Infrastructure Engineer
Posted
We are looking for a Infrastructure Engineer
We are seeking a highly skilled Infrastructure Engineer to help design, build, automate, and operate scalable, high-availability production infrastructure in a fast-paced enterprise technology environment. This individual will play a key role in driving reliability, automation, cloud infrastructure strategy, operational excellence, and AI-enabled engineering practices across mission-critical systems.
Responsibilities:- Design, build, automate, and support large-scale, highly available cloud infrastructure environments
- Manage and optimize containerized production platforms and orchestration environments
- Develop and maintain Infrastructure as Code (IaC) solutions using tools such as Terraform or Pulumi
- Build automation tooling, operational utilities, and platform enhancements using Python or Go
- Drive infrastructure reliability, scalability, observability, and resiliency initiatives
- Partner closely with engineering, product, security, AI/ML, and platform teams to support enterprise-wide initiatives
- Implement and maintain monitoring, logging, alerting, and performance management solutions
- Troubleshoot complex production issues and proactively identify systemic risks or operational weaknesses
- Lead infrastructure improvements with a focus on reversibility, risk mitigation, and minimizing production blast radius
- Create operational standards, automation frameworks, and deployment strategies that improve engineering velocity and reliability
- Support AI-driven infrastructure operations, intelligent automation initiatives, and AI-assisted engineering workflows
- Evaluate and implement emerging AI-enabled operational tooling to improve efficiency, incident response, automation, and developer productivity
- Collaborate with engineering teams supporting AI/ML workloads, data platforms, and model deployment pipelines
- Own infrastructure initiatives end-to-end, including architecture, implementation, rollout, rollback planning, and operational support
- 5 years of experience in Infrastructure Engineering, DevOps, Site Reliability Engineering, or similar roles supporting large-scale production environments
- Hands-on experience operating containerized production environments and orchestration platforms in enterprise or high-growth environments
- Strong experience with Kubernetes, Helm, and Infrastructure as Code tools such as Terraform or Pulumi
- Experience supporting cloud infrastructure environments, preferably AWS
- Proficiency in Python or Go for automation, tooling, and infrastructure development
- Strong experience with monitoring, observability, and logging platforms such as Prometheus, Grafana, ELK, or equivalent technologies
- Experience implementing resilient infrastructure designs focused on scalability, reliability, rollback strategies, and operational safety
- Strong understanding of infrastructure tradeoffs involving reliability, cost optimization, deployment velocity, and operational risk
- Demonstrated experience leveraging AI-assisted engineering tools and agentic AI workflows within day-to-day development and operational practices
- Experience utilizing AI-enabled platforms such as Claude Code, Codex, GitHub Copilot, or similar tools to improve automation, troubleshooting, deployment efficiency, and operational workflows
- Familiarity with infrastructure requirements supporting AI/ML environments, including compute scalability, data processing pipelines, model deployment, or GPU-enabled workloads is highly desirable
- Excellent communication and cross-functional collaboration skills
- Strong analytical and problem-solving capabilities
- Ability to challenge assumptions, identify operational gaps, and recommend innovative infrastructure solutions
- Proven ownership mindset with experience leading infrastructure initiatives from concept through production deployment
- Strong organizational skills with the ability to prioritize and execute in fast-paced environments
- Passion for continuous improvement, emerging technologies, and modern AI-enabled operational practices
- Software engineering background with experience building and maintaining production-grade applications, services, libraries, or internal frameworks
- Ability to read, troubleshoot, and modify application codebases supporting infrastructure platforms
- Experience bridging infrastructure engineering and software development practices
- Experience building reusable platform tooling, developer enablement frameworks, or internal infrastructure products
- Experience supporting enterprise-scale cloud transformation or modernization initiatives
- Exposure to MLOps, AI infrastructure, vector databases, model serving frameworks, or intelligent automation platforms
- Experience supporting AI/ML engineering teams through scalable infrastructure and deployment automation
