Site-Reliability Engineer
Our Client, an IT Services and Consulting company, is looking for a Site-Reliability Engineer for their Scottsdale, AZ location.
Requirements:
Why Should You Apply?
Requirements:
- Service reliability/operation experience running large-scale, high-performance applications in a hybrid environment (on-prem and cloud).
- Experience in writing automation scripts and building dashboards for Application Performance management to manage Transaction journeys.
- Experience working with Programming languages such as Go, Python, Java, Rust etc.
- Working knowledge on with one or more databases- Oracle, SQL Server, Redis, Clickhouse, postgres, Mongo or any time-series databases
- Experience in transitioning platforms to the cloud and Containerization GCPand Rancher
- Experience maintaining containerized app in GKE/RKE/AKE environments.
- Experience Implementing Cloud observability using OTEL to enable real-time monitoring, distributed tracing and incident resolution.
- Experience working with specific GraphQL Framework (Apollo, Prisma, Hasura etc...).
- Experience using knowledge of networking protocols such as TCP/IP, HTTP, DNS, Load balancing and service mesh to troubleshoot issues in high pressure situations.
- Proven experience managing Application availability, building creative solutions to manage repetitive activities, improving gating and detect for applications at every touchpoint for a 24 x 7 High availability platform exposed to critical clients and customers.
- Working knowledge of Monitoring tools - Splunk, App-dynamics, grafana/Prometheus and Dynatrace.
- Experience with tools like Rally, Confluence and other CI/CD extenders.
- Hands-on experience with implementing in-memory caching solutions. Experience on Redis DB is a plus.
- Excellent debugging skills across variety of integrated technical platforms on API gateway.
- Hands-on with GCS, Cloud SQL, Spanner and Firestore.
- Extensive experience in Enterprise level Infrastructure and Operations.
- Experience in High Availability and distributed systems, Linux and Windows administration, troubleshooting and support.
- Monitor and troubleshoot HashiCorp Vault environments, ensuring minimal downtime and rapid recovery from incidents.
- Working knowledge on Vertex AI, Gen AI and Bigquery
- Google Cloud Platform (GCP) Containerization, Kubernetes
- Infrastructure as Code (Terraform), CI/CD (GitHub Actions), and Helm
- Automation and scripting using Python, Ansible, and Node.js
- Monitoring and observability with Prometheus and Grafana
- Linux systems and troubleshooting
- Years of experience required: 7+
- 14.00 Years of Experience
Why Should You Apply?
- Health Benefits
- Referral Program
- Excellent growth and advancement opportunities
