Infrastructure admin for AI services
Posted
Job title: Infrastructure admin for AI services (Azure & AWS)
location: Remote
$50/hr
Key Responsibilities
location: Remote
$50/hr
Key Responsibilities
- Design, deploy, and manage cloud infrastructure supporting AI/ML workloads on AWS and Azure
- Manage compute resources such as EC2, Azure Virtual Machines, GPU instances, EKS, VPC, ECS, S3, Lambda, Route 53, and Kubernetes clusters
- Provision and configure storage, networking, and security services for AI platforms
- Ensure high availability, scalability, and reliability of AI environments
- Deploy and maintain AI/ML services such as Amazon SageMaker, Azure Microsoft Foundry, and Azure Machine Learning
- Support data scientists and ML engineers by providing optimized infrastructure for model training and deployment
- Implement Infrastructure as Code (IaC) using Terraform, CloudFormation, ARM templates / Bicep, and Docker Files
- Automate and set up environment provisioning, patching, and scaling
- Deploy and manage containerized AI workloads using Docker, Kubernetes, Amazon EKS, Azure Kubernetes Service (AKS), and ECS
- Monitor system health, performance, and resource utilization using CloudWatch, Azure Monitor, Datadog / Prometheus
- Optimize infrastructure for cost, performance, and GPU utilization
- Implement cloud security best practices including IAM / RBAC management, network security groups, encryption, and secrets management
- Ensure compliance with organizational and regulatory standards
- Integrate AI infrastructure with CI/CD pipelines
- Support automated deployment of models and AI services
- Bachelor s degree in Computer Science, Information Systems, or related field
- 5+ years experience in infrastructure administration or cloud engineering
- Strong hands-on experience with AWS cloud services and Microsoft Azure cloud services
- Experience supporting AI/ML infrastructure or data platforms
- Proficiency with Linux administration and scripting (Python, Bash, PowerShell, Terraform, Terragrunt)
- Experience with Docker and Kubernetes
- Experience with GitHub Actions
- Experience with LLM infrastructure set up
- Experience working in a centralized team with triaging capabilities
