Incident response

AWS wires its DevOps Agent into PagerDuty incidents

AWS wires its DevOps Agent into PagerDuty incidents

AWS has paired its DevOps Agent with PagerDuty, the tool most on-call engineers see first when something breaks. The integration shipped generally available on the AWS DevOps Blog on June 19. When an incident fires in PagerDuty, the agent runs an automated investigation across AWS and a defined set of third-party observability tools, and writes its findings back onto the incident record. The pitch — the operator no longer leaves PagerDuty to find out why the alert fired — is real to the degree the wiring is complete; outside the listed integrations the agent is blind.

The framing AWS uses for the pain is operator-shaped: when a page lands, engineers cross-reference dashboards, logs and recent deploys before they have a hypothesis to test. The post calls that step "manual correlation" and identifies it as the place where resolution time balloons.

The plumbing

Per the post, PagerDuty and the DevOps Agent are connected through a native OAuth 2.0 link. The agent is triggered by the incident itself rather than by a human invoking it, runs a root-cause investigation, generates a mitigation plan, and writes the result back to the PagerDuty incident timeline.

Data sources the agent correlates, as listed in the announcement:

  • Amazon CloudWatch metrics
  • AWS CloudTrail logs
  • AWS X-Ray traces
  • Application topology
  • Deployment history
  • Third-party observability — Datadog, Splunk, New Relic, Dynatrace

A separate PagerDuty MCP Server is called out as in early access for Advance-tier customers; that part is not GA.

What this changes on call

The operational read is narrow but real. Two things move.

The context-gathering tab-switch collapses into one place. By the time the on-call has finished reading the page, a candidate hypothesis is sitting in the incident timeline. That trims the front edge of MTTR, which is where most teams have the largest variance — the time between page and first plausible theory.

Deployment-history correlation also stops being a step an engineer has to remember. "Did we ship something around the same time?" is the first question on every incident review; pulling it automatically every time turns "was it a deploy?" from a discipline into a default signal.

What it does not solve

A few caveats, and they are the ones any auto-triage agent inherits.

The agent only sees what it is wired into. If your reliability story leans on a SaaS that is not on the list — feature flags, queue depths, a CDN that is not in the supported set — the root-cause writeup will look confident and miss it. AWS names four observability vendors and its own CloudWatch, CloudTrail and X-Ray. Everything else is on you to plumb, and the announcement does not document a public extension point in this post.

There is also no signal in the announcement about how the agent behaves when several incidents collide. A real on-call hour is rarely one alert at a time. How the agent prioritises across overlapping incidents, whether it reuses recently-fetched context, and how its findings interact when one incident is a downstream of another are unanswered from the post alone.

And the usual one: an automated mitigation plan in the incident timeline is not the same as a mitigation plan a human signed off on. Treat the agent's writeup as a first draft of the incident review, not the verdict.

The wider field

The pattern is now common enough to have a shape. PagerDuty itself has been adding correlation and assist features inside the incident tool for a while. The major observability vendors — Datadog, New Relic, Splunk, Dynatrace — increasingly graft incident workflows onto their own telemetry, so the agent and the context live next to each other. ITSM platforms approach the same problem from the ticket side, drawing on change records rather than metrics. Each picks a different seam: where the agent lives, whose data it sees by default, and how much cross-vendor stitching the customer carries.

The AWS-and-PagerDuty wiring picks one specific tradeoff: AWS owns the agent and the reasoning; PagerDuty owns the incident timeline and the human-in-the-loop. For teams already deep in both, the seam is short. For teams whose primary observability is elsewhere, the seam is exactly where you will trip.

What the GA label does not guarantee is how the agent will behave during a long, ambiguous, multi-service incident — the only kind that really matters. That answer comes from the first weekend it works one.

Source: AWS DevOps Blog (aws.amazon.com)

Related
Incident response

AWS DevOps Agent reaches GA with the Datadog MCP Server in tow

AWS has moved its DevOps Agent from preview to general availability, shipping it alongside the Datadog MCP Server so the agent can correlate monitoring signals with AWS-deployed infrastructure during an incident. The badge change is the smaller story; the bigger one is what SRE teams now have to write down before the agent touches production.

June 20, 2026
Incident response

AWS teaches its DevOps Agent to flip feature flags during incidents

The AWS DevOps Blog details an integration where the AWS DevOps Agent's MCP server talks to LaunchDarkly so an agent can identify and toggle the flags relevant to a live outage instead of paging three teams to do it by hand. The integration removes a real coordination step — and forces every shop to write down which actions an agent is allowed to take unattended.

June 20, 2026
Incident response

AWS pushes its DevOps Agent's diagnostic reach down to the EKS node via a custom MCP server

AWS has published a pattern for extending its autonomous DevOps Agent into EKS node OS and runtime data through a custom Model Context Protocol server, addressing incidents that live outside the agent's native cluster-control-plane visibility. The post is explicit that the implementation is a proof of concept, not a production replacement for monitoring or log shipping.

June 17, 2026

Turn this into your pipeline. Build it on Buddy.

Start free