Incident response

AWS pushes its DevOps Agent's diagnostic reach down to the EKS node via a custom MCP server

AWS has published a pattern for pushing its autonomous DevOps Agent's diagnostic reach down to the EKS node, by wiring a custom Model Context Protocol server in front of the OS and runtime data the agent cannot otherwise see, per the AWS DevOps Blog. The agent already does the cluster-control-plane work — it diagnoses CrashLoopBackOff failures, traces ConfigMap deletions through Kubernetes audit logs, and correlates CloudWatch metrics with cluster events. What it does not do is read what is happening underneath the kubelet. That is the gap the post sets out to close.

The pattern, in three moves

The post describes the shape every MCP integration tends to settle into. Identify the data the agent cannot reach. Stand up an MCP server that exposes that data as structured tools. Wire it to the agent via Amazon Bedrock AgentCore Gateway, with Cognito-based OAuth fronting the call surface and Lambda doing the mediation. The reference implementation ships through AWS CDK.

Three design rules are stated explicitly, and they matter more than the plumbing. Return structured findings with severity classification, not raw log text. Do not hand the agent a shell — execution on nodes is routed through Systems Manager Automation, which keeps every node-side action inside an auditable controlled-execution path. Make the tools composable, so one tool's output is the next tool's input. That third rule is what unlocks parallel investigation: the demonstrated scenario fans out into four concurrent diagnostic tasks whose structured reports feed root-cause synthesis.

What the reference server exposes

The reference MCP server publishes more than twenty diagnostic sources from EKS nodes. The list is roughly what an SRE reaches for after the cluster API has told them everything is fine: iptables rules and conntrack entries, containerd runtime logs, kubelet journal and logs, CNI configuration and IPAMD state, route tables and ENI metadata, dmesg kernel messages, sysctl parameters, DNS configuration.

The class of incident this addresses is the one the API server cannot speak to — IP allocation failures, DNS resolution problems, network policy enforcement, storage mount timeouts, node registration failures. Operationally, that is the long tail of EKS pages. Pods that never leave ContainerCreating because IPAMD has no IPs left to hand out. DNS lookups that the application sees as failures and the control plane sees as nothing. CNI traffic broken by node-local network state. None of those rendered the agent useless before. They just required a human to log into the node.

A rough sketch of a structured tool response looks more like a typed payload than a log dump:

tool: collect_kubelet_journal
node: <node-name>
severity: warning
findings:
  - kind: kubelet-restart-loop
    detail: <short-classified-detail>
correlate_with: [containerd_logs, dmesg]

The placeholders are intentional — the point is the shape, not the content.

The boundaries, named

The post is careful — and this is worth crediting — about what the pattern is not. It is positioned as a proof of concept for non-production validation, not a production drop-in. It requires the Systems Manager Agent on every node with the matching IAM permissions, which is its own deployment concern. It is explicitly not a replacement for continuous monitoring, for compliance-grade log shipping, or for native integrations where the agent already reaches directly. The cost surface is real and spans AgentCore Gateway, Lambda invocations, S3 with KMS-encrypted log storage, and Cognito.

The operational read

The interesting claim is not that the agent is smarter. It is that the trust boundary around an autonomous investigator can be drawn narrow on purpose. Severity-classified, structured tool outputs are easier to audit than free-form prompts. Routing all node-side work through SSM Automation keeps the agent off shells while preserving a record of every command it asked for. That maps to how on-call already wants this to work: agents that propose, humans that press the button.

The catch is the usual one. Any pattern that widens what an LLM-driven loop can read also widens what a misconfigured loop can read on a bad day. Twenty-plus diagnostic sources is twenty-plus paths into node state sitting behind one OAuth scope. The pattern is a credible first-line responder for CI/CD-driven Kubernetes incidents. The part it leaves to the platform team is the part no CDK template can ship — deciding which clusters the agent is allowed to look at, and which calls still need a human confirmation before they run.

Source: AWS DevOps Blog (aws.amazon.com)

Turn this into your pipeline. Build it on Buddy.

Start free