Monitoring and Observability Engineer
Posted
Job Title: Monitoring and Observability Engineer
Duration: 12+ Months (Possible extension)
Location: New York, NY 10286
Onsite Role (4 days a week)
Responsibilities:
Duration: 12+ Months (Possible extension)
Location: New York, NY 10286
Onsite Role (4 days a week)
Responsibilities:
- Seeking a Cloud Monitoring and Observability Engineer to own the design, implementation, and continuous improvement of observability solutions for critical applications at the firm.
The ideal candidate brings deep expertise across leading observability platformsspanning APM, NPM, infrastructure monitoring, and cloud-native telemetryand applies industry best practices to deliver reliable, actionable insight across every layer of the stack. - Architect and continuously refine a unified observability strategy covering application, infrastructure, network, and user-experience layers within Azure and on premise hosted appliactions, proactively identifying coverage gaps and driving improvements without waiting to be asked.
- Integrate Azure-native telemetry (Azure Monitor, Log Analytics/KQL, Application Insights) with enterprise platforms (AppDynamics, Dynatrace, ThousandEyes, SolarWinds, Prometheus/Grafana) to deliver correlated, cross-domain visibility.
- Define and enforce telemetry standardsmetrics, logs, and distributed tracesaligned to SLIs and SLOs; establish data collection pipelines using OpenTelemetry and equivalent frameworks.
- Build and maintain high-signal dashboards, synthetic tests, and alerting workflows; rigorously tune thresholds, anomaly detection, and de-duplication to maximize signal-to-noise ratio.
- Instrument services with APM tooling for business-transaction tracing, dependency mapping, code-level diagnostics, and root-cause analysis.
- Implement network performance and digital-experience monitoring (ThousandEyes, NetScout) including path visualization, BGP/DNS tests, and endpoint-agent configuration to correlate network health with application performance.
- Embed observability into CI/CD and infrastructure-as-code workflows so every new service launches with monitoring from day one.
- Author and maintain runbooks, escalation paths, and post-incident review artifacts; lead data-driven root-cause analysis and remediation during incidents.
- Perform capacity and performance trend analysis; deliver actionable recommendations for optimization, right-sizing, and resilience hardening.
- Ensure all monitoring solutions satisfy security and compliance requirements; maintain audit-ready documentation and evidence.
- 5+ years designing and operating enterprise monitoring/observability for cloud or hybrid environments, including mission-critical applications.
- Demonstrated ability to work independently diagnosing complex, cross-domain issues, proposing solutions, and driving them to completion with minimal oversight.
- Proven, production-level expertise with at least one tool in each category (or equivalent):
- APM: AppDynamics, Dynatrace, or New Relicincluding business- transaction tracing, service maps, anomaly detection, and alert-policy design.
- NPM / Digital Experience Monitoring: ThousandEyes, and NetScout including synthetic testing, path visualization, and WAN/internet performance analysis.
- Infrastructure Monitoring & Event Management: SolarWinds, Datadog, Moogsoft, BigPanda, or Prometheus/Grafanaincluding availability/capacity dashboards, alert routing, and event correlation.
- Azure Monitoring: Azure Monitor, Log Analytics (KQL), and Application Insights with third-party integration experience.
- Strong grasp of observability best practices: distributed tracing, structured logging, metric cardinality management, and OpenTelemetry instrumentation pipelines.
- Scripting and automation proficiency (PowerShell, Python, or Bash) for agent deployment, monitoring-as-code, synthetic-test creation, and reporting.
- Solid networking fundamentals (DNS, BGP, HTTP, TLS, TCP/IP) and the ability to correlate application and network telemetry for end-to-end troubleshooting.
- Working understanding of CI/CD best practices using Git-backed pipelines, including gated merge requests, automated testing stages, and progressive deployment strategies to ensure changes are consistently validated before reaching production.
- Clear, concise documentation and communication skills; ability to translate complex observability data into actionable guidance for engineering, operations, and leadership stakeholders.
- Experience in regulated industries (financial services, government, healthcare) with compliance-aware monitoring design.
- Familiarity with log aggregation and SIEM/SOAR platforms (Splunk, Elastic) and their integration with APM/NPM tooling.
- ITSM platform integration experience (e.g., ServiceNow) for incident, change, and problem management workflows.
- Hands-on infrastructure-as-code experience (ARM/Bicep/Terraform) with observability baked into deployment templates.
- Grounding in SRE practiceserror budgets, reliability reviews, and capacity/performance planning.
- Ability to write instrumentation and automation code in one or more of the following:
- Java: OpenTelemetry SDK/agent integration, custom instrumentation, APM tagging.
- .NET (C#): ASP.NET service instrumentation, auto-instrumentation configuration, custom exporters and health probes.
- Python: Automation scripts, custom collectors/exporters, synthetic tests, and monitoring API integration.
