Robotic Reliability Systems Engineer
Posted
Robotic Reliability Systems Engineer
What We Are Looking For
Seeking a Robotic Reliability (Systems) Engineer to drive the reliability, performance, and scalability of our autonomous warehouse platform powered by mobile robots. This is a high-impact, hands-on engineering role focused on solving complex system-level challenges across large-scale robotic fleets deployed at customer sites .
This role sits at the intersection of robotics software, hardware integration, and operational performance. The primary objective is to diagnose, resolve, and prevent system-level issues, ensuring our robotic systems operate reliably and consistently meet customer performance KPIs.
We are looking for a technically strong, data-driven engineer who thrives in complex, real-world environments and can translate ambiguous system behaviors into structured analysis and actionable engineering improvements.
- Fleet-Scale System Reliability
- Identify, triage, and root-cause system-level issues impacting large-scale robotic fleets.
- Drive improvements in system reliability, availability, and performance across thousands of deployed robots.
- Define and monitor system performance guardrails tied to customer KPIs (throughput, error rates, recovery time, uptime).
- Partner with field teams to debug and resolve production issues in live environments.
- End-to-End Systems Debugging & Integration
- Work across robotics software, hardware, controls, perception, and infrastructure to diagnose complex system interactions.
- Debug issues spanning embedded systems, distributed services, real-time control loops, and operational workflows.
- Collaborate with cross-functional teams to drive fixes and long-term solutions.
- Contribute to system design improvements that enhance robustness, fault tolerance, and scalability.
- Data-Driven Performance Optimization
- Analyze robot logs, telemetry, and diagnostics data to identify failure modes and performance bottlenecks.
- Build and use tools (SQL, Python, dashboards) to investigate trends and validate hypotheses.
- Develop mechanisms for regression detection, failure trend analysis, and performance monitoring.
- Drive continuous improvement through structured experiments and data-backed decisions.
- Operational Excellence & Continuous Improvement
- Own reliability metrics and contribute to improving system observability and debuggability.
- Document failure modes, learnings, and standard operating procedures for issue resolution.
- Support release validation and help ensure changes meet reliability and performance expectations.
- Act as a technical escalation point for complex system issues.
- 5+ years of experience in robotics, automation, or complex distributed systems engineering.
- Strong systems engineering mindset with experience in robotics control software, real-time systems, and hardware-software integration.
- Demonstrated experience in structured root-cause analysis and failure investigation.
- Proficiency in data analysis and scripting (Python, SQL, or similar).
- Experience working with logs, telemetry systems, and large-scale operational data.
- Familiarity with Linux environments and version control systems (Git).
- Experience working in production environments with deployed systems (not just lab prototypes).
- Strong problem-solving skills and ability to work across ambiguous, cross-functional system boundaries.
- Experience in Agile development environments.
