Skip to main content

Senior SRE Platform Engineer

San Jose, CA
Permanent

Posted

Senior SRE Platform Engineer - In person interview

In This Role, You Will
Platform & Reliability Engineering
  • Embed SRE and production engineering principles into Payments Modernization from design through early life support
  • Define and validate non-functional requirements (NFRs) covering resilience, scalability, observability, recovery, and operability
  • Drive replay, retry, and exception-handling validation for event-driven payment flows
  • Lead capacity and performance testing, including volume growth and peak event scenarios (e.g. FedNow, CHIPS, SWIFT)
Service Transition & Operational Readiness
  • Own Permit-to-Operate readiness across environments (NFR Testing)
  • Define cutover, shadow support, and early life support models
  • Ensure runbooks, support procedures, on-call readiness, and escalation paths are production-grade before go-live
  • Partner with Change Assurance to apply risk-based release controls, canary/blue-green strategies, and rollback automation
Observability & Stability
  • Implement end-to-end observability across Kafka, MongoDB, API layers, and downstream payment components
  • Define and monitor SLOs, error budgets, and golden signals
  • Reduce alert noise through signal design, correlation, and automation
  • Analyze early defects and exception patterns (ACK/NACKs, business errors) to drive stabilization
Chaos Engineering & Continuous Improvement
  • Design and execute controlled failure testing (chaos engineering) to validate recovery patterns and blast radius
  • Lead blameless RCAs, ensuring corrective actions are owned and recurrence is prevented
  • Drive continuous service improvement (CSI) initiatives, including automation, resilience uplift, and technical debt reduction

Required Qualifications:
  • 4+ years of Systems Engineering, Technology Architecture experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
  • 2+ years of application support experience
  • 2+ years of Application Frameworks experience in Spring Boot, Spring WebFlux, etc.
  • 2+ years of Data Stores & Caching experience with MongoDB, Redis
  • 2+ years of Platform experience in Kubernetes / container orchestration
  • 2+ years of CI/CD & Automation experience in Progressive delivery, automated rollback, reliability-as-code concepts

Desired Qualifications:
  • 2+ years of Resilience experience with Resilience4J, retry/replay patterns
  • 2+ years Observability: Distributed tracing, metrics, logging, SLO tooling
  • 2+ years Testing & Resilience Validation: BlazeMeter, Chaos Monkey
  • Strong experience in SRE, Production Engineering, Platform Engineering, or Service Transition within a complex technology or financial services environment
  • Demonstrated ability to productionize new platforms, not just support them
  • Solid understanding of high-value payment systems (Wires, RTP, SWIFT, CHIPS, FedNow) and their operational risk profile
  • Experience working with event-driven, distributed architectures
  • Proven ability to partner with engineering teams while representing the production and operational lens
  • Comfortable operating in early-stage, ambiguous transformation environments
  • Strong communication skills, with the ability to explain technical risk to senior stakeholders

Job Type: Permanent

Job ID: 254864193