Skip to main content

Program Manager

Tyler, TX
Permanent

Posted

We are mainly looking for a ML Engineer who is experienced and ready to take on this role. The candidate should have a strong background in ML and be capable of handling the tasks and responsibilities that come with the position.
ML Infrastructure
Performance Engineer
Focus:
This role focuses on the "serving plane." The engineer will integrate high-speed inference runtimes with streaming loaders and take ownership of the performance benchmarking mandate.
Key Responsibilities:
Integrate
SGLang
with the
Run:ai Model Streamer
to enable concurrent tensor streaming directly to GPU memory, reducing model "cold start" times.
Optimize SGLang s backend runtime, leveraging features like
RadixAttention
for prefix caching and compressed finite-state machines for faster decoding.
Design and execute rigorous
performance benchmarking
suites to identify bottlenecks in the inference stack and provide code-level "fixes" to improve time-to-first-token (TTFT).
Required Expertise:
Proficiency in
Python
and experience with asynchronous programming (AsyncIO) for ML serving frameworks.
Experience with
Ray
for distributed compute and managing Reinforcement Learning (RL) workloads.
Hands-on experience with profiling tools such as NVIDIA Nsight, PyTorch Profiler, or Client Gaudi instrumentation.

Job Type: Permanent

Job ID: 254677514