AI Infra engineer
Posted
Title: AI Infra Engineer
Duration: 10+ Months
Location: Morrisville, NC, 27560
Short Description:
This role combines IT operations, hardware troubleshooting, and AI infrastructure expertise. expect to handle day-to-day system administration, diagnose and resolve issues, and ensure optimal performance for ML workloads.
Key Responsibilities
Required Qualifications
Preferred Skills
Duration: 10+ Months
Location: Morrisville, NC, 27560
Short Description:
This role combines IT operations, hardware troubleshooting, and AI infrastructure expertise. expect to handle day-to-day system administration, diagnose and resolve issues, and ensure optimal performance for ML workloads.
Key Responsibilities
- Hardware Management and Troubleshooting: Monitor and maintain GPU servers/workstations, including diagnosing and resolving hardware failures (e.g., GPU faults, power issues, cooling problems). Coordinate repairs, replacements, or upgrades as needed to ensure system uptime.
- Software and Driver Management: Install, update, and configure CUDA drivers, Linux operating systems (e.g., Ubuntu or CentOS), and related dependencies. Ensure compatibility across hardware and software stacks for seamless ML operations.
- Performance Benchmarking: Run and analyze MLPerf benchmarks to evaluate system performance, identify bottlenecks, and optimize configurations for ML training tasks.
- System Diagnostics and Problem Resolution: Proactively monitor systems for issues, perform root-cause analysis on failures or performance degradation, and implement fixes. This includes debugging kernel errors, network issues, or resource contention during LLM training.
- General Infrastructure Ops: Implement best practices for security, backups, logging, and monitoring. Handle routine maintenance, such as firmware updates, patch management, and capacity planning for the GPU cluster.
Required Qualifications
- - Proven experience (3+ years) in managing GPU-accelerated servers or high-performance computing (HPC) environments, preferably in AI/ML contexts.
- - Strong knowledge of Linux system administration, including shell scripting, package management, and networking.
- - Hands-on experience with NVIDIA CUDA toolkit, drivers, and GPU hardware (e.g., A100, H100, or similar).
- - Familiarity with ML benchmarking tools like MLPerf and frameworks such as TensorFlow, PyTorch, or Hugging Face for LLM training.
- - Ability to diagnose hardware and software issues using tools like nvidia-smi, dmesg, top/htop, or Prometheus/Grafana for monitoring.
- - Understanding of AI infrastructure ops, including containerization (Docker/Kubernetes) and orchestration for distributed training.
- - Excellent problem-solving skills with a proactive approach to preventing downtime.
Preferred Skills
- - Experience with cluster management tools like Slurm, Kubernetes, or Ray for scaling ML workloads.
- - Knowledge of hardware diagnostics for servers (e.g., IPMI, BIOS configuration, RAID setups).
- - Background in IT operations with AI focus, such as DevOps for ML (MLOps).
- - Certifications like RHCE (Red Hat Certified Engineer), NVIDIA certifications, or similar.
- - Ability to work independently in a remote or on-site setup, with strong communication skills for reporting issues.
