Site Reliability Engineer II - GBS IND

Hyderabad, , India;

Additional locations See less

Job Description:

About Us

At Bank of America, we are guided by a common purpose to help make financial lives better through the power of every connection. Responsible Growth is how we run our company and how we deliver for our clients, teammates, communities and shareholders every day.

One of the keys to driving Responsible Growth is being a great place to work for our teammates around the world. We’re devoted to being a diverse and inclusive workplace for everyone. We hire individuals with a broad range of backgrounds and experiences and invest heavily in our teammates and their families by offering competitive benefits to support their physical, emotional, and financial well-being.

Bank of America believes both in the importance of working together and offering flexibility to our employees. We use a multi-faceted approach for flexibility, depending on the various roles in our organization.

Working at Bank of America will give you a great career with opportunities to learn, grow and make an impact, along with the power to make a difference. Join us!

Global Business Services

Global Business Services delivers Technology and Operations capabilities to Lines of Business and Staff Support Functions of Bank of America through a centrally managed, globally integrated delivery model and globally resilient operations.

Global Business Services is recognized for flawless execution, sound risk management, operational resiliency, operational excellence and innovation.

In India, we are present in five locations and operate as BA Continuum India Private Limited (BACI), a non-banking subsidiary of Bank of America Corporation and the operating company for India operations of Global Business Services.

Process Overview

Enterprise Cloud Platforms team in the CTI organization offers Private and Public Cloud platforms for Bank of America’s developers to drive faster time-to-market, innovation with private and public cloud capabilities, and reduce complexity with built-in integrations. We believe in high quality engineering culture to engineer our platforms with customer and platform mindset, design for large enterprise scale and resilience, and accelerate market innovation into the technical platforms we deliver.

This position is part of Technology Infrastructure Services (TIS) organization and falls under Cloud Services team (ECP). As part of this team, you will have a large impact on the evolution of next generation Cloud and Container services for Bank of America and explore an extensive list of new technologies that will drive innovation across our company.

Job Description

We are looking for Ops/ Site Reliability Engineer (SRE) for Hybrid Cloud Container platform

running on Openshift 4.X. The individuals in this role will develop SRE tools and automations for day-to-day proactive maintenance and operations of hybrid cloud container platform.

Should provide end-to-end support coverage for the platform & work on build, upgrade and maintain OCP clusters. Should have understanding or exposure of agile as well as ITSM incident/change/request management processes. Experience of implementing platform resiliency, self-healing, health & compliance dashboards, automation for day-to-day operational tasks over hybrid cloud for enterprise class production grade environment is desired.

Responsibilities

Responsible for SRE Support for Container platforms & apply SRE knowledge to identify potential gaps in the observability design or implementation.
Work with the clients, Application and development Teams to onboard the applications and integrate with CI/CD platform.
Be able to provide technical expertise to Configure, Deploy, and Support Bank workloads to securely run and operate in Container Infra (K8s/RedHat Open Shift/AKS).
Responsible for engineering of new capabilities to the OpenShift/Container Platforms and delivering those capabilities in a fully automated and supportable fashion.
Implement cluster services to manage On-Prem Bare Metal Open shift cluster deployments and off-prem deployments.
Work with monitoring tools and Application Development teams to enhance monitoring capabilities and modify monitoring dashboards for new observability plans created in support of initiatives or continuous improvement efforts.
Develop software or system scripts to simplify or eliminate the dependence on human intervention for recurring tasks.
Work with Production Support teams to perform knowledge transfer, playbook updates and training for new monitoring capabilities.
Identify vulnerabilities and opportunities for reliability improvement, such as investigating low level error rates and 'noise' in monitoring and to help define solutions to improve system reliability.
Develop and maintain a catalog of extensible reliability scripts, tools, and libraries that can be leveraged for common instrumentation, automation and operational needs.
Be able to provide technical expertise to Configure, Deploy, and Support Bank workloads to securely run and operate in Container Infra (K8s/RedHat Open Shift/AKS).
Responsible for engineering of new capabilities to the OpenShift/Container Platforms and delivering those capabilities in a fully automated and supportable fashion.
Implement cluster services to manage On-Prem Bare Metal Open shift cluster deployments and off-prem deployments.
Responsible for SRE Support for Container platforms & apply SRE knowledge to identify potential gaps in the observability design or implementation.
Work with the clients, Application and development Teams to onboard the applications and integrate with CI/CD platform.
Be able to provide technical expertise to Configure, Deploy, and Support Bank workloads to securely run and operate in Container Infra (K8s/RedHat Open Shift/AKS).
Responsible for engineering of new capabilities to the OpenShift/Container Platforms and delivering those capabilities in a fully automated and supportable fashion.
Implement cluster services to manage On-Prem Bare Metal Open shift cluster deployments and off-prem deployments.
Work with monitoring tools and Application Development teams to enhance monitoring capabilities and modify monitoring dashboards for new observability plans created in support of initiatives or continuous improvement efforts.
Develop software or system scripts to simplify or eliminate the dependence on human intervention for recurring tasks.
Work with Production Support teams to perform knowledge transfer, playbook updates and training for new monitoring capabilities.
Identify vulnerabilities and opportunities for reliability improvement, such as investigating low level error rates and 'noise' in monitoring and to help define solutions to improve system reliability.
Develop and maintain a catalog of extensible reliability scripts, tools, and libraries that can be leveraged for common instrumentation, automation and operational needs.
Be able to provide technical expertise to Configure, Deploy, and Support Bank workloads to securely run and operate in Container Infra (K8s/RedHat Open Shift/AKS).
Responsible for engineering of new capabilities to the OpenShift/Container Platforms and delivering those capabilities in a fully automated and supportable fashion.
Implement cluster services to manage On-Prem Bare Metal Open shift cluster deployments and off-prem deployments.

Requirements

Education: B.E. / B. Tech / M.E. / M. Tech / MCA

Certifications If Any: N/A

Experience Range: 8 to 10 years

Foundational Skills

Experience as a Site Reliability Engineer within large, multinational organizations, with a preference for implementations of new technologies with a proven track record of success.
Demonstrated ability to design and develop significant components within an application.
Expertise in supporting Container production (K8s/RedHat Openshift) environments, and associated maintenance, change control, incident and problem management
Strong experience in Linux administration, programming experience in at least one language (Python, Shell scripting, Java etc) and Cloud-native technologies.
Strong experience in onboarding applications to container and multi-cloud platforms – Azure, AWS, GCP, IBM Cloud
Strong experience in Infrastructure automation using either of Terraform/Packer, Ansible or Python
Understanding or exposure of agile as well as ITSM incident/change/request management processes.
Experience of implementing platform resiliency, self-healing, health & compliance dashboards, automation for day to day operational tasks over hybrid cloud for enterprise class production grade environment is desired.
Experience in PaaS logging, monitoring, and observability tools such as ELK, FluentD, Prometheus, Splunk, Nagios, Datadog, etc.
Experience in building large scale distributed enterprise platforms with focus on performance, scale, security, and reliability
Self-motivated and results oriented with excellent analytical, problem solving, interpersonal, presentation and communication skills.
Operate in a fast-paced environment with multiple concurrent priorities

Desired Skills

Experience in designing, analyzing and troubleshooting large scale distributed systems and good understanding of multi-vendor Cloud offerings.
Experience in cloud-native network, storage, and virtualization technologies
Experience in DevOps and GitOps models with IaaS, Config-as-Code, Policy-as-Code and CI/CD tools - bit bucket, jfrog, Jenkins, Artifactory, Ansible
Experience with modern performance monitoring and diagnostics tools (examples: Splunk, Splunk ITSI, AppD, Dynatrace, SolarWinds, etc.)
Understand relevant application technologies and development life cycles.
Operational Process & Routines: Strong adherence to operating controls, risk management, process review and creation, documentation and collaborative knowledge sharing.
Inter-personal skills and Communication skills.
RedHat Openshift/Kubernetes Certifications; Cloud(Azure/AWS/GCP) Certifications.
Ability to use qualitative and quantitative analytical skills to assess the effectiveness of the operations, manage competing priorities and adapt to change in project scope.
Proven ability to work independently with minimal supervision and as part of a team with direct responsibilities.

Work Timings: 06:30 AM - 03:30 PM 12:30 PM-9:30 PM with weekend support on a rotational basis.

Location : Hyderabad /Gurugram

Learn more about this role

We strive to provide you with information about products and services you might find interesting and useful. Relationship-based ads and online behavioral advertising help us do that.

Bank of America participates in the Digital Advertising Alliance ("DAA") self-regulatory Principles for Online Behavioral Advertising and uses the Advertising Options Icon on our behavioral ads on non-affiliated third-party sites (excluding ads appearing on platforms that do not accept the icon). Ads served on our behalf by these companies do not contain unencrypted personal information and we limit the use of personal information by companies that serve our ads. To learn more about ad choices, or to opt out of interest-based advertising with non-affiliated third-party sites, visit YourAdChoices layer powered by the Digital Advertising Alliance or through the Network Advertising Initiative's Opt-Out Tool layer. You may also visit the individual sites for additional information on their data and privacy practices and opt out-options.

To learn more about relationship-based ads, online behavioral advertising and our privacy practices, please review Bank of America Online Privacy Notice and our Online Privacy FAQs.

Site Reliability Engineer II - GBS IND

Street Address

Primary Location:

Additional Locations:

Advertising Practices