Skip to main content

Systems Engineer

Charlottesville, VA
Permanent

Posted

*Work with Progression, Inc. get your application bumped to the front of the line*

HPC Engineer
Charlottesville, VA

MUST:
Active TS/SCI
5+ years of Linux experience including command line system usage, scripting, and troubleshooting applications in multi-user server environments.
Professional experience administering or supporting command line Linux systems (RHEL derivatives preferred).
Experience developing scripts using Bash, Python, or similar scripting languages.
Experience troubleshooting software execution issues in distributed computing environments.
Working knowledge of job scheduling systems such as Slurm, PBS, Torque, or similar platforms.
Experience supporting users in technical computing or engineering environments.
Strong troubleshooting and analytical skills.
Experience as a user or administrator of HPC clusters.
Experience supporting parallel computing frameworks such as MPI, OpenMP, or CUDA based GPU workloads.
Experience supporting scientific or engineering applications requiring large scale compute resources.
Experience using performance monitoring and optimization tools for compute workloads.
Experience compiling applications using C, C++, Fortran, or Python based environments.
Experience working in classified computing environments.
Experience supporting GPU enabled workloads.
Ability to obtain DoD 8140 (8570) IAT Level II certification

DUTIES:
Provide user support for computational workloads running on HPC clusters in classified and unclassified environments.
Assist users in developing, submitting, and troubleshooting scheduler job scripts for systems such as Slurm or PBS, including resource allocation for CPU, GPU, and distributed compute workloads.
Troubleshoot slow, hanging, or failing HPC jobs including MPI based distributed workloads, GPU jobs, and large scale parallel applications.
Support users compiling and executing scientific, modeling, or data processing applications within Linux based HPC environments.
Provide guidance on HPC best practices for job scheduling, compute resource allocation, and workload performance.
Monitor workload execution patterns and provide guidance to improve cluster throughput and resource utilization.
Develop scripts or tools using Bash or Python to automate common operational tasks.
Maintain documentation and knowledge base articles describing system capabilities, job execution procedures, and troubleshooting guidance.
Support performance analysis of compute workloads to identify inefficiencies or configuration issues.
Coordinate with HPC systems engineers when infrastructure or cluster configuration issues impact workload performance.
Provide responsive on site support for users executing HPC workloads in mission environments.
Maintain source controlled scripting and tools using Git or similar version control platforms.
*Progression Inc. is an affirmative action/equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, status as a protected veteran, or status as an individual with a disability.* #INDPRO

Job Type: Permanent

Job ID: 254901971