Technical Support Lead

Posted by PeopleNTech LLC

Arlington, VA

Permanent

Posted 4 days ago

Role : Technical Support Lead
Location : Remote
Rate : $75/hr
Indent : PSL(phone number removed)_1-124-1
Role Overview
We are seeking an experienced AI Operations, Observability, Automation & AI Operation
Orchestrator to join our Platform Engineering team. In this highly cross-functional role, you
will be responsible for ensuring the reliability, scalability, observability, and operational
excellence of Aisera's Agentic AI platform in production enterprise environments.
You will architect and operate the infrastructure that enables Aisera's AI agents to perform
reliably at scale monitoring model behavior, automating operational runbooks, and
orchestrating multi-agent workflows across complex enterprise ecosystems. This is a rare
opportunity to build foundational AI Ops capabilities at the intersection of enterprise AI,
agentic automation, and large-scale production systems.

Core Technologies / Skills (Must Have)
Monitoring & APM: Azure Monitor, Application Insights, Log Analytics.
Tracing/Telemetry: OpenTelemetry SDK integration.
Metrics & Dashboards: Prometheus, Grafana, DataDog, Logic Monitor and
SolarWinds Observability
Log Platforms: ELK Stack (or equivalent log management platforms).
IaC: Terraform, ARM templates, or Bicep.
CI/CD: Azure DevOps and CI/CD pipelines.
Scripting: PowerShell and Bash.

Key Responsibilities
Implement and maintain observability capabilities across environments
(monitoring, logging, tracing).
Deploy and configure monitoring/logging/tracing tools and integrations.
Implement correlation/trace context propagation patterns across services and
integration points.
Build and maintain dashboards for real-time system health visibility.
Configure alerting rules, routing, and escalation workflows aligned to operational
needs.
Automate telemetry pipelines including log aggregation, retention, and archival
policies.
Support incident response with rapid diagnostics, troubleshooting, and root cause
analysis.
Ensure observability and operational coverage during migrations/refactoring
activities.
Maintain runbooks and operational documentation to improve repeatability and
response time.
Optimize observability costs and data storage through smart sampling, retention,
and governance.

Required Experience (Must Have)
5+ years of experience in SRE, DevOps, or Platform Engineering roles.
Strong hands-on experience implementing observability tooling and practices.
Experience with infrastructure-as-code and automation.
Proven incident management and troubleshooting skills for production systems.
Experience supporting enterprise applications in production environments.
Key Responsibilities
AI Operations & Platform Reliability
Own end-to-end operational health of Aisera's Agentic AI platform, including model
inference pipelines, agent orchestration services, and integration endpoints.
Define and enforce SLOs, SLAs, and error budgets for AI-powered services in
production.
Lead incident management, root-cause analysis, and post-mortem processes for AI
platform outages and degradations.
Collaborate with engineering teams to build reliability into AI features from design
through deployment.
Observability & Monitoring
Design and implement observability stacks (metrics, logs, traces, alerts) for AI
inference workloads, including LLM response quality, agent decision accuracy,
latency, and throughput.
Build dashboards and alerting frameworks to surface anomalies in AI model
behavior, intent classification accuracy, and workflow execution success rates.
Implement AI-specific observability patterns hallucination detection, confidence
scoring trends, retrieval relevance monitoring in real-time production pipelines.
Partner with Data Science to instrument model performance and enable feedback
loops for continuous improvement.
Automation & Runbook Engineering
Design, develop, and maintain automated remediation systems and operational
runbooks to reduce mean time to recovery (MTTR).
Build self-healing automation for common failure scenarios in AI pipelines, API
integrations, and enterprise connector workflows.
Drive automation of deployment pipelines, canary releases, and rollback
procedures for AI model updates.
Champion a 'automate-first' culture across the operations organization.
AI Operation Orchestration
Architect and manage multi-agent orchestration frameworks that coordinate
parallel AI agents across ITSM, HR, Finance, and Customer Service domains.
Design orchestration patterns for stateful, long-running agentic workflows
including agent handoff, task decomposition, escalation logic, and tool-call retry
strategies.
Define and enforce governance policies for AI agent actions, including approval
workflows, rate limiting, and audit trails for compliance.
Evaluate and integrate emerging AI orchestration technologies (LangGraph,
Temporal, Dapr, internal frameworks) to improve agent reliability and scalability.
Collaborate with Product and Solutions Engineering to deliver orchestration
capabilities that meet enterprise security and compliance requirements.
Capacity Planning & Cost Optimization
Own GPU/CPU capacity planning and cost allocation for AI inference workloads
across cloud providers (AWS, Azure, GCP).
Implement FinOps practices for AI compute, including model batching strategies,
caching layers, and tiered inference routing.
Forecast capacity needs based on customer growth, new AI feature releases, and
enterprise onboarding pipeline.
Cross-Functional Leadership
Act as the operational bridge between AI Research, Product Engineering, Customer
Success, and Enterprise customers.
Represent AI Ops in architecture reviews, security audits, and enterprise customer
technical evaluations.
Mentor junior engineers and build a culture of operational excellence within the
platform team.

Apply Now

Job Type: Permanent

Job ID: 254812052

Apply Now