Resiliency and Site Reliability Expert
Vanguard
Malvern, PA 19355
Posted 9 months ago
-
Job Type(s)
Full Time
-
Industry
Engineering
-
Job Description
Resiliency and Site Reliability Expert
At Vanguard, we pride ourselves on delivering an exceptional client experience to all investors; at the core of this experience are systems that reside in a technically complex and constantly evolving resiliency landscape. Passionate, technically skilled engineers are at the center of our resiliency operations, and we are looking to grow our team.
As a Lead Resiliency and Reliability Engineer at Vanguard, you will play a critical role in solving impactful operational problems. You will think creatively to find opportunities to improve system performance and efficiency, scalability, fault tolerance, and self-healing capabilities. Youll apply Chaos Engineering principles to challenge our systems and discover hidden weaknesses, all while understanding the big picture of how our systems work together to create the ultimate client experience.
Core Responsibilities:
- Instruments, enhances, and advocates for system observability. Identifies and develops solutions to bridge system observability gaps.
- Collaborates with internal teams to evaluate the health, stability, and reliability of systems/platforms. Looks for opportunity to improve system availability, performance efficiency and resiliency.
- Develops and communicates new standards and newly available tools and frameworks across subdivisions. Enforces reliability standards. Designs and develops new automated solutions for reliability.
- Make contributions to centrally managed (IT-wide) inner source libraries for reliability, such as the OpenTelemetry wrapper libraries.
- Provides technical leadership, consultancy, and coaching on designing and implementing both traditional and serverless architectures in AWS with an emphasis on repeatability, scaling options, resilience, reliability, telemetry, networking, etc., including design patterns for resilient systems
- Leads failure modes analysis spanning product families when new features and architecture patterns are introduced. Leads cross-product or cross-subdivision chaos experimentation. Facilitates post-incident reviews for any high severity client impacting events local to the product family.
- Designs, reviews, and coaches others on performance tests using appropriate components (e.g., requests per minute, # of threads, the construction of a request with headers and cookies)
- Consults, reviews, coaches, and influences architectural decisions, including non-functional aspects, proposing potential technical solutions/enhancements, and explaining convincingly which is better and why.
- Contributes to or leads Reliability Engineering and Resilience communities of practice. Remains informed about site reliability engineering activities happening within the subdivision.
- Provides technical leadership, guidance, consulting, training, and governance on SRE to one or more product families in a subdivision. Works with product owners and teams to set subdivision goals for higher availability and SRE impact, and tracks progress toward achieving them.
- Identifies opportunities to automate away toil and develops solutions, monitors error budget exhaustion rates, configures auto scaling thresholds for the product, and incorporates resilience patterns, such as circuit breakers, into the application code. Develops complex deployment and/or routing strategies for high availability.
- Maintains and looks for opportunities to improve centralized incident response playbook for the subdivision to document standards for managing communication and escalation during an incident. Oversees blameless post-incident reviews for high severity incidents involving more multiple product families.
- Onboard and train new SRE Practitioners and Leads within the subdivision
- Maintain + enforce subdivisional reliability engineering standards
- Communicate new standards and newly available tools and frameworks back to other SREs within their subdivisions, for example - news regarding observability tools, libraries for resilience patterns, and emerging cloud platforms.
- Make contributions to centrally managed (IT-wide) inner source libraries for reliability, such as the OpenTelemetry wrapper libraries.
- Aggregation of quantifiable data about availability to report back to senior leadership
- Coordination of cross-product and/or cross-subdivision Chaos experimentation
- Maintenance of any centralized incident response playbooks for the subdivision
- Note: These differ significantly from runbooks. Runbooks document the step-by-step process to recover a specific component within a system. Incident Playbooks document standards for managing communication and escalation during an incident, including handoffs to other teams.
- Facilitation of blameless post-incident reviews for high severity incidents or incidents involving more than one product family
- Regularly attend any Reliability Engineering and Resilience communities of practice
- Host Retail subdivision Reliability Engineering community of practice
- Remain informed about SRE activities happening within the subdivision
What it Takes:
- Minimum of eight years related work experience, with at least three years of development experience.
- Undergraduate degree or equivalent combination of training and experience. Graduate degree preferred.
- Full stack development JDK8+ preferred with spring boot, Rest APIs, multithreaded, multiprocessing applications, Graphql. Experience with UI development (familiar with Angular, TypeScript, NodeJS etc.) is a plus.
- Ability to diagnose and resolve problems in high-throughput applications,
- Experience with one or more observability frameworks or tools Experience with OpenTelemetry (java, js, etc.), Cloudwatch, Grafana, Splunk, etc.
- Exposure to *nix environments including some shell script development and basic command execution.
- Strong understanding of database principles and working knowledge in distributed storage and infrastructural solutions.
- Experience with container management and micro-services architectures such as Docker in cloud and on-premises infrastructure.
- Working knowledge of AWS network foundations, application networking, edge, and network security.
- Excellent communication, and documentation skills.
Special Factors:
- Vanguard is not offering visa sponsorship for this position.
Special Factors
Sponsorship
Vanguard is offering visa sponsorship for this position.About Vanguard
We are Vanguard. Together, were changing the way the world invests.
For us, investing doesnt just end in value. It starts with values. Because when you invest with courage, when you invest with clarity, and when you invest with care, you can get so much more in return. We invest with purpose and thats how weve become a global market leader. Here, we grow by doing the right thing for the people we serve. And so can you.
We want to make success accessible to everyone. This is our opportunity. Lets make it count.
Inclusion Statement
Vanguards continued commitment to diversity and inclusion is firmly rooted in our culture. Every decision we make to best serve our clients, crew (internally employees are referred to as crew), and communities is guided by one simple statement: Do the right thing.
We believe that a critical aspect of doing the right thing requires building diverse, inclusive, and highly effective teams of individuals who are as unique as the clients they serve. We empower our crew to contribute their distinct strengths to achieving Vanguards core purpose through our values.
When all crew members feel valued and included, our ability to collaborate and innovate is amplified, and we are united in delivering on Vanguard's core purpose.
Our core purpose: To take a stand for all investors, to treat them fairly, and to give them the best chance for investment success.
How We Work
Vanguard has implemented a hybrid working model for the majority of our crew members, designed to capture the benefits of enhanced flexibility while enabling in-person learning, collaboration, and connection. We believe our mission-driven and highly collaborative culture is a critical enabler to support long-term client outcomes and enrich the employee experience.