girl looking into her desktop
Back to search results

Production Services Lead - Core Technology Infrastructure

Plano, Texas;

Job Description:

The resource will combine software and systems engineering skills to assist technology teams in the running of large-scale, distributed, fault-tolerant systems. They use their software development skills to automate routine operational activities and improve reliability and efficiency. They will use their knowledge and experience as a site reliability engineer to instill SRE best practices into the day to day routines of the broader SWAT SRE team as well as influence technology partners to adopt SRE practices.

The resource will also be responsible for the following:

- Apply extensive technical experience and skill set to drive the triaging of complex, high impact Production incidents to quickly restore service

- Partner with application and product managers to identify root cause and actions to correct complex, high impact Production problems. Also, working with those teams to identify other opportunities to improve overall Production stability, including actions to mitigate the reoccurrence of any problem as well as opportunities to improve overall monitoring.

- Socialize advanced triage techniques with Production Support teams aimed at improving the ability of those teams to be able to detect disruptions and restore service

- Proactively assess portfolio of critical applications and identify areas of stability and resiliency concerns. Ensure those concerns are escalated to the appropriate partners for remediation

- Partner with monitoring architects and tooling teams to drive out best practices on how to leverage advanced monitoring tools (i.e. – Splunk, AppDynamics, Dynatrace, NetScout, etc…) to more quickly detect Production incidents and restoral or service.

Required Skills:

Resource will have the following skills:

- 5+ years of experience in information technology

- Ability to program (structured and OO) with two or more high level languages, such as Python, Java, C/C++. Ruby and JavaScript

- Experiencing developing scripts to automate routine operational activities, ideally executed using a tool like Bladelogic or Ansible Tower

- Experience developing advanced monitoring capabilities using tools such as Splunk, AppDynamics, Dynatrace, Glassbox, and/or NetScout

- Experience troubleshooting complex Production incidents with at least one strength in the following areas:

  • Diagnosing database performance problems (Oracle, SQL Server and/or DB2);
  • Diagnosing networking problems;
  • Diagnosing middleware problems, including extensive experience analyzing heap and/or thread dumps to determine root cause and restoral actions;
  • Diagnosing storage problems;
  • Diagnosing server problems with either bare metal implementations, VMs or with the underlying cloud infrastructure (i.e. – ESX clusters, etc…)
  • Experience identifying root cause of a Production incident analyzing application and thread dumps
  • Experience defining Service Level Objectives (SLOs), Error Budgets and Error Policies to help prioritize limited development capacity between new features/service onboardings and stability stories for upcoming releases
  • Strong, courageous communicator capable of effectively communicating, verbally, via emails and instant messaging, to both technical and business teams
  • Capable of periodically providing on call support outside of normal working hours
  • Capable of working in high pressure situations

Desired Skills

  • Experience as a system administrator, database administrator, network administrator and/or middleware administrator
  • Experience supporting/development applications that utilize SAN and NAS storage.
  • Any experience with Amazon S3 or Hitachi HCP storage a plus
  • Experience leaning out and automating processes aimed at improving overall efficiency and quality of the work product
  • Experience developing and/or supporting applications that leverage products from vendors such as Pega or MuleSoft
  • Familiarity with the ITIL framework - Bachelor’s degree in business, computer science, MIS or related field

Job Band:

H5

Shift: 

1st shift (United States of America)

Hours Per Week:

40

Weekly Schedule:

Referral Bonus Amount:

0

Job Description:

The resource will combine software and systems engineering skills to assist technology teams in the running of large-scale, distributed, fault-tolerant systems. They use their software development skills to automate routine operational activities and improve reliability and efficiency. They will use their knowledge and experience as a site reliability engineer to instill SRE best practices into the day to day routines of the broader SWAT SRE team as well as influence technology partners to adopt SRE practices.

The resource will also be responsible for the following:

- Apply extensive technical experience and skill set to drive the triaging of complex, high impact Production incidents to quickly restore service

- Partner with application and product managers to identify root cause and actions to correct complex, high impact Production problems. Also, working with those teams to identify other opportunities to improve overall Production stability, including actions to mitigate the reoccurrence of any problem as well as opportunities to improve overall monitoring.

- Socialize advanced triage techniques with Production Support teams aimed at improving the ability of those teams to be able to detect disruptions and restore service

- Proactively assess portfolio of critical applications and identify areas of stability and resiliency concerns. Ensure those concerns are escalated to the appropriate partners for remediation

- Partner with monitoring architects and tooling teams to drive out best practices on how to leverage advanced monitoring tools (i.e. – Splunk, AppDynamics, Dynatrace, NetScout, etc…) to more quickly detect Production incidents and restoral or service.

Required Skills:

Resource will have the following skills:

- 5+ years of experience in information technology

- Ability to program (structured and OO) with two or more high level languages, such as Python, Java, C/C++. Ruby and JavaScript

- Experiencing developing scripts to automate routine operational activities, ideally executed using a tool like Bladelogic or Ansible Tower

- Experience developing advanced monitoring capabilities using tools such as Splunk, AppDynamics, Dynatrace, Glassbox, and/or NetScout

- Experience troubleshooting complex Production incidents with at least one strength in the following areas:

  • Diagnosing database performance problems (Oracle, SQL Server and/or DB2);
  • Diagnosing networking problems;
  • Diagnosing middleware problems, including extensive experience analyzing heap and/or thread dumps to determine root cause and restoral actions;
  • Diagnosing storage problems;
  • Diagnosing server problems with either bare metal implementations, VMs or with the underlying cloud infrastructure (i.e. – ESX clusters, etc…)
  • Experience identifying root cause of a Production incident analyzing application and thread dumps
  • Experience defining Service Level Objectives (SLOs), Error Budgets and Error Policies to help prioritize limited development capacity between new features/service onboardings and stability stories for upcoming releases
  • Strong, courageous communicator capable of effectively communicating, verbally, via emails and instant messaging, to both technical and business teams
  • Capable of periodically providing on call support outside of normal working hours
  • Capable of working in high pressure situations

Desired Skills

  • Experience as a system administrator, database administrator, network administrator and/or middleware administrator
  • Experience supporting/development applications that utilize SAN and NAS storage.
  • Any experience with Amazon S3 or Hitachi HCP storage a plus
  • Experience leaning out and automating processes aimed at improving overall efficiency and quality of the work product
  • Experience developing and/or supporting applications that leverage products from vendors such as Pega or MuleSoft
  • Familiarity with the ITIL framework - Bachelor’s degree in business, computer science, MIS or related field

Shift:

1st shift (United States of America)

Hours Per Week: 

40

Learn more about this role

Full time

JR-21078555

Band: H5

Manages People: No

Travel: Yes, 5% of the time

Manager:

Talent Acquisition Contact:

Kathleen Jones-Griffith

Referral Bonus:

0