Back to search results

DevOps Consultant Systems Engineer - Core Technology Infrastructure

Richardson, Texas;

Job Description:

Role Description:

  • As a Site Reliability Engineer, you will be focused on establishing and improving monitoring to measure end-to-end performance and end-user availability of systems via a suite of common monitoring tools. 
  • You will interface with business partners and operations teams to develop business and technical monitoring requirements.
  • Your core responsibility is to build Unified Monitoring and maintenance tools. A spart of this, you will need to build an inventory system as a hub to capture and preserve all observability metrics.
  • You will work with teams within the organization (Engineering and Operations) to assist with development and implementation of monitoring to meet business requirements, including KPIs, service mapping, dependency mapping, alerting thresholds, etc.
  • You will be working with other site reliability engineers and dedicated monitoring engineers to support this initiative.

Responsibilities:

  • Work with application owners, both Business owners and Engineering teams, along with operation services, to establish Business and Technical monitoring strategies, including instrumentation of the systems, collection of metrics, development of KPIs, and configuration of alerting by static and dynamic thresholds through use of statistical analysis and machine learning.
  • The idea is to drive Standardizing and centralization. Build tools to achieve operational efficiencies and product insight.
  • Design and build an inventory system with comprehensive list of KPIs and metrics built in and preserved.
  • Develop performance test plan and Test harness that can satisfy 100+ varied products and platforms.
  • Devise programmatic capacity planning routines.
  • Utilize technical area expertise to assess, select, manage, and implement enterprise application components, and to ensure that the technical solution solves the business problem as an organic part of the organization’s operational and functional baseline.
  • Participate in the support of Major Incidents with Major Incident Management (MIM), Operations Triage Group (OTG), ECC, and Problem Management (PM) throughout the major incident life cycle by providing monitoring data on the system(s) in question and by addressing deficiencies in technical and business monitoring KPIs.
  • Support Triage efforts during Major Incidents by deconstructing application performance, interoperability, instrumentation, and human factors to facilitate resolution and development of resilient solutions.
  • Support PM’s enterprise root cause analysis (RCA) processes in collaboration with appropriate OI&T organizations.
  • Capture technical information from the relevant stakeholders and synthesize it into useful information in various formats for OIT senior management and other VA components.
  • Demonstrate proficiency with DevOps tools, JIRA, ServiceNow, MS Project and perform tasks using the tools.

Qualifications

Education and Experience:

  • Master’s Degree is preferred in Business Administration, Business Management, Computer Science, Information Systems, Information Resource Management, Industrial Engineering, Operations Research, or related fields
  • 3+ years of relative experience
  • Certifications in relevant software development or analytics plus 3-5 years of relevant experience
  • 8 to 10 years of relevant experience may be substituted for education (13-15 years total)

Skills:

  • Strong experience in Java and Front-end development (UI and UX) (React JS, Angular)
  • Experience with Apache/tomcat Middleware and Java/RESTful services framework (mulesoft is a plus)
  • Backend Database experience is a muct - Oracle, sqlserver, hadoop
  • Strong Python, UNIX, Wintel, Perl/Shell scripting
  • Strong experience working with CI/CD tools - bitbucket, jfrog, Jenkins, Artifactory, Ansible
  • Experience working with Business and Technical leaders to develop KPIs for application monitoring.
  • Experience with modern performance monitoring and diagnostics tools (examples: Splunk, Splunk ITSI, AppD, Dynatrace, SolarWinds, etc.)
  • Be a technical expert with expertise across multiple technology areas and the ability to diagnose complex issues throughout many technologies and apply this knowledge to effective monitoring of applications.
  • Must be able to provide oral and written discussion of analytical findings using narrative and graphic forms.
  • Must be able to use qualitative and quantitative analytical skills to assess the effectiveness of the operations.
  • Identifying symptoms for process improvement.
  • Analytical and investigation, and organization skills
  • Communications including being able to craft content for executive level presentations.
  • IT background and ability to understand technical content.

Job Band:

H5

Shift: 

1st shift (United States of America)

Hours Per Week:

40

Weekly Schedule:

Referral Bonus Amount:

0

Job Description:

Role Description:

  • As a Site Reliability Engineer, you will be focused on establishing and improving monitoring to measure end-to-end performance and end-user availability of systems via a suite of common monitoring tools. 
  • You will interface with business partners and operations teams to develop business and technical monitoring requirements.
  • Your core responsibility is to build Unified Monitoring and maintenance tools. A spart of this, you will need to build an inventory system as a hub to capture and preserve all observability metrics.
  • You will work with teams within the organization (Engineering and Operations) to assist with development and implementation of monitoring to meet business requirements, including KPIs, service mapping, dependency mapping, alerting thresholds, etc.
  • You will be working with other site reliability engineers and dedicated monitoring engineers to support this initiative.

Responsibilities:

  • Work with application owners, both Business owners and Engineering teams, along with operation services, to establish Business and Technical monitoring strategies, including instrumentation of the systems, collection of metrics, development of KPIs, and configuration of alerting by static and dynamic thresholds through use of statistical analysis and machine learning.
  • The idea is to drive Standardizing and centralization. Build tools to achieve operational efficiencies and product insight.
  • Design and build an inventory system with comprehensive list of KPIs and metrics built in and preserved.
  • Develop performance test plan and Test harness that can satisfy 100+ varied products and platforms.
  • Devise programmatic capacity planning routines.
  • Utilize technical area expertise to assess, select, manage, and implement enterprise application components, and to ensure that the technical solution solves the business problem as an organic part of the organization’s operational and functional baseline.
  • Participate in the support of Major Incidents with Major Incident Management (MIM), Operations Triage Group (OTG), ECC, and Problem Management (PM) throughout the major incident life cycle by providing monitoring data on the system(s) in question and by addressing deficiencies in technical and business monitoring KPIs.
  • Support Triage efforts during Major Incidents by deconstructing application performance, interoperability, instrumentation, and human factors to facilitate resolution and development of resilient solutions.
  • Support PM’s enterprise root cause analysis (RCA) processes in collaboration with appropriate OI&T organizations.
  • Capture technical information from the relevant stakeholders and synthesize it into useful information in various formats for OIT senior management and other VA components.
  • Demonstrate proficiency with DevOps tools, JIRA, ServiceNow, MS Project and perform tasks using the tools.

Qualifications

Education and Experience:

  • Master’s Degree is preferred in Business Administration, Business Management, Computer Science, Information Systems, Information Resource Management, Industrial Engineering, Operations Research, or related fields
  • 3+ years of relative experience
  • Certifications in relevant software development or analytics plus 3-5 years of relevant experience
  • 8 to 10 years of relevant experience may be substituted for education (13-15 years total)

Skills:

  • Strong experience in Java and Front-end development (UI and UX) (React JS, Angular)
  • Experience with Apache/tomcat Middleware and Java/RESTful services framework (mulesoft is a plus)
  • Backend Database experience is a muct - Oracle, sqlserver, hadoop
  • Strong Python, UNIX, Wintel, Perl/Shell scripting
  • Strong experience working with CI/CD tools - bitbucket, jfrog, Jenkins, Artifactory, Ansible
  • Experience working with Business and Technical leaders to develop KPIs for application monitoring.
  • Experience with modern performance monitoring and diagnostics tools (examples: Splunk, Splunk ITSI, AppD, Dynatrace, SolarWinds, etc.)
  • Be a technical expert with expertise across multiple technology areas and the ability to diagnose complex issues throughout many technologies and apply this knowledge to effective monitoring of applications.
  • Must be able to provide oral and written discussion of analytical findings using narrative and graphic forms.
  • Must be able to use qualitative and quantitative analytical skills to assess the effectiveness of the operations.
  • Identifying symptoms for process improvement.
  • Analytical and investigation, and organization skills
  • Communications including being able to craft content for executive level presentations.
  • IT background and ability to understand technical content.

Shift:

1st shift (United States of America)

Hours Per Week: 

40

Learn more about this role

Full time

JR-21082620

Band: H5

Manages People: No

Travel: Yes, 5% of the time

Manager:

Talent Acquisition Contact:

Kathleen Jones-Griffith

Referral Bonus:

0