AI Infra engineer

Raleigh, NC

Permanent

Posted 5 days ago

Title: AI Infra Engineer
Duration: 10+ Months
Location: Morrisville, NC, 27560

Short Description:
This role combines IT operations, hardware troubleshooting, and AI infrastructure expertise. expect to handle day-to-day system administration, diagnose and resolve issues, and ensure optimal performance for ML workloads.
Key Responsibilities

Hardware Management and Troubleshooting: Monitor and maintain GPU servers/workstations, including diagnosing and resolving hardware failures (e.g., GPU faults, power issues, cooling problems). Coordinate repairs, replacements, or upgrades as needed to ensure system uptime.
Software and Driver Management: Install, update, and configure CUDA drivers, Linux operating systems (e.g., Ubuntu or CentOS), and related dependencies. Ensure compatibility across hardware and software stacks for seamless ML operations.
Performance Benchmarking: Run and analyze MLPerf benchmarks to evaluate system performance, identify bottlenecks, and optimize configurations for ML training tasks.
System Diagnostics and Problem Resolution: Proactively monitor systems for issues, perform root-cause analysis on failures or performance degradation, and implement fixes. This includes debugging kernel errors, network issues, or resource contention during LLM training.
General Infrastructure Ops: Implement best practices for security, backups, logging, and monitoring. Handle routine maintenance, such as firmware updates, patch management, and capacity planning for the GPU cluster.

Required Qualifications

- Proven experience (3+ years) in managing GPU-accelerated servers or high-performance computing (HPC) environments, preferably in AI/ML contexts.
- Strong knowledge of Linux system administration, including shell scripting, package management, and networking.
- Hands-on experience with NVIDIA CUDA toolkit, drivers, and GPU hardware (e.g., A100, H100, or similar).
- Familiarity with ML benchmarking tools like MLPerf and frameworks such as TensorFlow, PyTorch, or Hugging Face for LLM training.
- Ability to diagnose hardware and software issues using tools like nvidia-smi, dmesg, top/htop, or Prometheus/Grafana for monitoring.
- Understanding of AI infrastructure ops, including containerization (Docker/Kubernetes) and orchestration for distributed training.
- Excellent problem-solving skills with a proactive approach to preventing downtime.

Preferred Skills

- Experience with cluster management tools like Slurm, Kubernetes, or Ray for scaling ML workloads.
- Knowledge of hardware diagnostics for servers (e.g., IPMI, BIOS configuration, RAID setups).
- Background in IT operations with AI focus, such as DevOps for ML (MLOps).
- Certifications like RHCE (Red Hat Certified Engineer), NVIDIA certifications, or similar.
- Ability to work independently in a remote or on-site setup, with strong communication skills for reporting issues.

Apply Now

Job Type: Permanent

Job ID: 254737557

Apply Now