Monitoring Engineering Production Services Specialist ll
Job Description:
At Bank of America, we are guided by a common purpose to help make financial lives better through the power of every connection. We do this by driving Responsible Growth and delivering for our clients, teammates, communities and shareholders every day.
Being a Great Place to Work and providing a culture of caring is core to how we drive Responsible Growth. We are intentional about fostering an inclusive workplace where every teammate has the opportunity to succeed, build a career and contribute to our shared success. This includes attracting and developing exceptional talent, recognizing and rewarding performance, and supporting our teammates’ physical, emotional, and financial wellness through affordable, competitive and flexible benefits.
We value the unique perspectives individuals bring from all backgrounds and career paths - whether shaped by military service, community college education, or a wide range of work and life experiences. These journeys foster resilience, leadership and innovation, strengthening our workforce and positively impact the communities we serve.
Bank of America is committed to an in-office culture that supports collaboration, engagement, and career development. Our approach includes clear in-office expectations, while providing an appropriate level of flexibility based on role-specific responsibilities and business needs.
At Bank of America, you can build a successful career with opportunities to learn, grow, and make an impact. Join us!
Job Description:
This job is responsible for providing support to end users and responding to issues related to incidents and problem management for multiple applications, focusing on leading triage activities on all business impacting incidents. Key responsibilities include ensuring compliance with incident management and problem management policies and procedures. Job expectations include serving as a key focal point for the customer, client, and associate experience and restoring any impacts to those experiences regardless of where the root cause of the impact lies.
Responsibilities:
- Leads production support triage efforts, manages bridge line troubleshooting, engages in technical research, and escalates issues to leadership as needed
- Ensures all impacts are accurately recorded and documented in the system of record, verifies documents and wikis are updated and available for use during triage, and supports on call responsibilities for incidents, the documentation of application flows, impacts during outages, the customer experience, and contacts for support needs
- Provides status updates and technical detail for awareness communications, such as infrastructure, application and client impact, and component points of failure, oversees accuracy of all communications sent, and ensures any necessary reconvenes are scheduled
- Identifies business impact, interprets monitors, dashboards, and logs, and writes queries to accurately calculate and communicate impacts to leadership in partnership with senior team members or specialists within Technology Services
- Promotes and enforces production governance during triage/testing, and identifies production failure scenarios, vulnerabilities, and opportunities for improvement, determines appropriate actions, and escalate issues as needed
- Analyzes, manages, and coordinates incident management activities to detect problems that potentially affect the service level
- Fulfills research requests, ad hoc reports, and offline incidents at the direction of senior team members or the Technology/Production Services teams
Required Qualifications
- Hands-on experience with Splunk (search, SPL, dashboards, alerts, data onboarding, and tuning).
- Hands-on experience with Dynatrace (APM, services/entities, alerting profiles, management zones, dashboards).
- Strong understanding of monitoring and observability concepts: logs, metrics, traces, events, and correlation.
- Experience supporting production systems and participating in incident management and operational support.
- Knowledge of SRE concepts such as reliability engineering, alert hygiene, post-incident reviews, and automation.
- Experience working with ITSM processes (incident, problem, change) and tracking SI actions to closure.
- Basic to intermediate scripting experience (e.g., Python, Shell) for automation and analysis.
- Strong communication skills and ability to work across distributed teams in the APAC region.
Desired Qualifications
- Experience with advanced Splunk or Dynatrace features (custom metrics, anomaly detection, DQL/SPL optimization, synthetic monitoring).
- Experience integrating monitoring tools with ServiceNow or similar ITSM platforms.
- Familiarity with capacity monitoring, performance engineering, or business transaction monitoring.
- Relevant certifications (Splunk, Dynatrace, SRE/DevOps, Cloud) are a plus.
Skills:
- Adaptability
- Analytical Thinking
- Influence
- Production Support
- Risk Management
- Automation
- Collaboration
- Innovative Thinking
- Result Orientation
- Solution Design
- Business Acumen
- DevOps Practices
- Project Management
- Solution Delivery Process
- Stakeholder Management
Shift:
1st shift (United States of America)Hours Per Week:
40