Senior SRE Platform Engineer

Posted by PeopleNTech LLC

San Jose, CA

Permanent

Posted 3 days ago

Senior SRE Platform Engineer - In person interview

In This Role, You Will
Platform & Reliability Engineering

Embed SRE and production engineering principles into Payments Modernization from design through early life support
Define and validate non-functional requirements (NFRs) covering resilience, scalability, observability, recovery, and operability
Drive replay, retry, and exception-handling validation for event-driven payment flows
Lead capacity and performance testing, including volume growth and peak event scenarios (e.g. FedNow, CHIPS, SWIFT)

Service Transition & Operational Readiness

Own Permit-to-Operate readiness across environments (NFR Testing)
Define cutover, shadow support, and early life support models
Ensure runbooks, support procedures, on-call readiness, and escalation paths are production-grade before go-live
Partner with Change Assurance to apply risk-based release controls, canary/blue-green strategies, and rollback automation

Observability & Stability

Implement end-to-end observability across Kafka, MongoDB, API layers, and downstream payment components
Define and monitor SLOs, error budgets, and golden signals
Reduce alert noise through signal design, correlation, and automation
Analyze early defects and exception patterns (ACK/NACKs, business errors) to drive stabilization

Chaos Engineering & Continuous Improvement

Design and execute controlled failure testing (chaos engineering) to validate recovery patterns and blast radius
Lead blameless RCAs, ensuring corrective actions are owned and recurrence is prevented
Drive continuous service improvement (CSI) initiatives, including automation, resilience uplift, and technical debt reduction

Required Qualifications:

4+ years of Systems Engineering, Technology Architecture experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
2+ years of application support experience
2+ years of Application Frameworks experience in Spring Boot, Spring WebFlux, etc.
2+ years of Data Stores & Caching experience with MongoDB, Redis
2+ years of Platform experience in Kubernetes / container orchestration
2+ years of CI/CD & Automation experience in Progressive delivery, automated rollback, reliability-as-code concepts

Desired Qualifications:

2+ years of Resilience experience with Resilience4J, retry/replay patterns
2+ years Observability: Distributed tracing, metrics, logging, SLO tooling
2+ years Testing & Resilience Validation: BlazeMeter, Chaos Monkey
Strong experience in SRE, Production Engineering, Platform Engineering, or Service Transition within a complex technology or financial services environment
Demonstrated ability to productionize new platforms, not just support them
Solid understanding of high-value payment systems (Wires, RTP, SWIFT, CHIPS, FedNow) and their operational risk profile
Experience working with event-driven, distributed architectures
Proven ability to partner with engineering teams while representing the production and operational lens
Comfortable operating in early-stage, ambiguous transformation environments
Strong communication skills, with the ability to explain technical risk to senior stakeholders

Job Type: Permanent

Job ID: 254864193