Back to search results

Site Reliability Engineer

Plano, Texas

Job Description:

About us:

At Bank of America, we are guided by a common purpose to help make financial lives better through the power of every connection. Responsible Growth is how we run our company and how we deliver for our clients, teammates, communities and shareholders every day.

One of the keys to driving Responsible Growth is being a great place to work for our teammates around the world. We’re devoted to being a diverse and inclusive workplace for everyone. We hire individuals with a broad range of backgrounds and experiences and invest heavily in our teammates and their families by offering competitive benefits to support their physical, emotional, and financial well-being.

Bank of America believes both in the importance of working together and offering flexibility to our employees. We use a multi-faceted approach for flexibility, depending on the various roles in our organization.

Working at Bank of America will give you a great career with opportunities to learn, grow and make an impact, along with the power to make a difference. Join us!

Job Description:


This job is responsible for partnering with engineering and technology teams to implement measures as prescribed by lead/senior SRE engineers. Key responsibilities include ensuring appropriate instrumentation, tooling, ticketing, alerting and on call routines are in place for key services, identifying root causes of issues through production triage efforts, and suggesting code enhancements to technology teams to automate services and improve reliability and efficiency. Job expectations include using software development skills to improve efficiency and to address gaps in reliability.

Overview:

This position is for a Site Reliability Engineer (SRE) who provides 24x7 application support for Crowdstrike Falcon on Linux and Windows operating systems.  The candidate should also have experience in diagnosing performance related issues and escalating them to a third-party vendor for review and remediation.  It is preferable that the candidate have experience in a large corporation and 5+ years of experience in supporting enterprise level applications.  This role also requires working with other enterprise level business and administrative groups and being able to communicate (spoken/written) effectively.

Responsibilities:

  • Develops and maintains reliability scripts, tools and libraries and leverages them for common instrumentation, automation, and operational needs, and when mentoring Site Reliability Engineer (SRE) resources on reliability practices and established tools/capabilities
  • Collaborates with Development and Infrastructure teams to understand technical solutions and implement monitoring capabilities outlined in the application and system monitoring designs put forward by the SRE Lead
  • Partners to implement code changes to make use of common reliability libraries and tools and helps Application Production Services and Application Development teammates understand how to use them
  • Identifies vulnerabilities and opportunities for reliability improvement, such as investigating low level error rates and 'noise' in monitoring, and defines solutions to reduce manual support effort and/or improve system reliability
  • Engages as a subject matter expert in major incident triage efforts and failure scenario modelling and diagnosis with Problem Manager root causes for major incident/problem management investigations
  • Participates regularly in an on-call rotation with Production Support teammates to learn more about reliability issues affecting their portfolio

Required Qualifications:

  • 5+ years of experience in supporting Crowdstrike Falcon and other enterprise security scanning and patching solutions
  • Experience with enterprise monitoring and reporting tools and providing 24x7x365 support
  • Ability to work with product managers and development leads to design, build and maintain enterprise solutions
  • Proficient in Linux & Windows

Desired Qualifications:

  • Development languages (Java, Python)
  • Additional security scanning and patching solutions (Bladelogic, Microsoft SCCM, Tanium, BMC Atrium Orchestrator)
  • Remedy ITSM
  • Ansible Tower, Bladelogic, BMC Atrium Orchestrator
  • Monitoring Tools (Tivoli ITM, Sitescope, Dynatrace)

Skills:

  • Analytical Thinking
  • Automation
  • Collaboration
  • Production Support
  • Result Orientation
  • Application Development
  • Architecture
  • Influence
  • Project Management
  • Solution Design
  • Adaptability
  • DevOps Practices
  • Risk Management
  • Solution Delivery Process
  • Stakeholder Management

Shift:

1st shift (United States of America)

Hours Per Week: 

40

Learn more about this role

Full time

JR-24040332

Manages People: No

Travel: Yes, 5% of the time

Street Address

Primary Location:
7105 CORPORATE DR, TX, Plano, 75024