Platform engineering

AWS reworks the EKS control plane and starts selling it in sized tiers

AWS reworks the EKS control plane and starts selling it in sized tiers

Hundreds of thousands of Kubernetes clusters, more than thirty AWS regions, and an argument that operating a managed Kubernetes service at that scale needs a different set of primitives than the ones Kubernetes was originally built around. That is the frame Amazon EKS engineers Neelendra Bhandari and Sri Saran Balaji Vellore Rajakumar use in a sponsored post on The New Stack, walking through a set of architectural changes to EKS and a new tier-priced product on top of them.

Their argument, in one sentence: at scale, availability dies when a component reacts badly to a fault and turns the fault into an outage. The failing component itself is usually the minor part of the story. The examples they lead with are familiar to any on-call: a stale cache that serves wrong answers, a health check that restarts the very process keeping a cluster alive.

Etcd, off disk

The most consequential piece of engineering in the post is a rewrite of how EKS stores cluster state. Traditional etcd relies on Raft consensus: a majority of members must agree before any write commits, which means two of three members going unhealthy takes the cluster offline. EKS replaced Raft with a purpose-built journal that provides durable, ordered storage independently of etcd. Writes commit as soon as the journal acknowledges them; the etcd peers no longer negotiate quorum with each other. That, the authors write, made an entire class of etcd quorum-loss failures disappear.

Because the journal now owns durability, etcd no longer needs to fsync every write to a local disk. Its data store has moved to an in-memory filesystem, and what was a disk-bound system is now compute-bound. Storage latency is out of the critical path.

Two follow-on moves ride on that change. For ultra-scale clusters, etcd is partitioned into resource-specific shards, each with its own storage budget. The example the authors give is failure isolation: if a misbehaving controller floods the events keyspace, the events partition hits its quota but the leases partition keeps writing, so healthy nodes stop appearing unhealthy. And because quorum is no longer negotiated between local peers, etcd can be colocated on the same host as the API server, with reads and writes travelling over loopback instead of the network. If the local etcd is impaired, the API server fails over to another etcd member reading from the same journal.

Upstream, not internal

The post is more useful than the usual architecture blog because it names bottlenecks that hit at hyperscale and points at where AWS is fixing them upstream, not only inside EKS. Three examples. The watch cache holds a shared read lock while building a WatchList snapshot, which at hundreds of thousands of objects starves the writer that advances the cache's resource version; the authors say a refactor is being worked on with the community. A single mutex in the Horizontal Pod Autoscaler serialises scaling decisions and is being replaced by a redesigned data store in PR #139142. And the scheduler rebuilds in-use persistent-volume state by scanning every node in the cluster every cycle, tracked as issue #138426, which is being changed to compute that information lazily.

For a CI/CD team, the useful read is narrower. These fixes are not shipping today. The specific failure modes now have names and issue numbers, which is what you need to know if your own pipelines burst thousands of pods and the control plane starts to wobble.

Control plane as a sized resource

The other announcement in the post is EKS Provisioned Control Plane, which turns those engineering gains into a purchasable dimension. Instead of hoping the shared control plane keeps up with a large training run, you pick a tier that maps to concrete numbers: API request concurrency, pod scheduling rate, and cluster database size. Tiers run XL through 8XL. The top tier, 8XL on Kubernetes 1.34, is quoted at 16,000 concurrent API request seats, a 400 pods-per-second scheduling rate, and 16 GB of cluster database storage, backed by a 99.99% availability SLA measured in one-minute intervals. Tier changes are done through the console, CLI, eksctl, CloudFormation, or Terraform, and the post says they happen without cluster recreation or downtime.

The pitch is that orchestration capacity now behaves like compute: you reserve it before a burst, you release it after. The catch is upstream of the price sheet. If the binding constraint on your workload is not one of the three dimensions being sold, more API seats will not save you.

What the announcement does not tell you

Two caveats worth carrying forward. The post is sponsored by AWS Marketplace, so the framing of every trade-off is AWS's own. And the operational story of a control plane that no longer takes a quorum-loss outage still has failure modes. They live in the journal itself, in the observability of the failover path, and in the moments when the convenient local etcd is impaired and the API server has to reach across for another member.

The authors acknowledge that last point in the post: "When you optimize for the common case, design just as deliberately for the moment that optimization is not available." That line, more than any specific tier number, tells you where the next EKS incident writeup is likely to start.

Source: The New Stack (thenewstack.io)

Related
Incident response

AWS pushes its DevOps Agent's diagnostic reach down to the EKS node via a custom MCP server

AWS has published a pattern for extending its autonomous DevOps Agent into EKS node OS and runtime data through a custom Model Context Protocol server, addressing incidents that live outside the agent's native cluster-control-plane visibility. The post is explicit that the implementation is a proof of concept, not a production replacement for monitoring or log shipping.

June 17, 2026
Platform engineering

Flipkart's chaos platform runs 90% of fault injection in staging, ships five fixes back to LitmusChaos

Flipkart's central reliability engineering team won the CNCF End User Case Study contest for a multi-tenant chaos engineering platform on Kubernetes that runs roughly 90% of its experiments in staging clusters and contributed five fixes upstream to LitmusChaos. The next step on the roadmap is to wire chaos testing into the CI/CD pipeline as a mandatory phase.

June 18, 2026
Incident response

AWS wires its DevOps Agent into PagerDuty incidents

AWS has paired its DevOps Agent with PagerDuty so that the moment an incident is created, the agent runs a root-cause investigation across AWS telemetry, deployment history and a short list of third-party observability tools, posting its findings back onto the incident record.

June 21, 2026

Turn this into your pipeline. Build it on Buddy.

Start free