AWS reworks the EKS control plane and starts selling it in sized tiers
Maya Okonkwo
Hundreds of thousands of Kubernetes clusters, more than thirty AWS regions, and an argument that operating a managed Kubernetes service at that scale needs a different set of primitives than the ones Kubernetes was originally built around. That is the frame Amazon EKS engineers Neelendra Bhandari and Sri Saran Balaji Vellore Rajakumar use in a sponsored post on The New Stack, walking through a set of architectural changes to EKS and a new tier-priced product on top of them.
Their argument, in one sentence: at scale, availability dies when a component reacts badly to a fault and turns the fault into an outage. The failing component itself is usually the minor part of the story. The examples they lead with are familiar to any on-call: a stale cache that serves wrong answers, a health check that restarts the very process keeping a cluster alive.
Etcd, off disk
The most consequential piece of engineering in the post is a rewrite of how EKS stores cluster state. Traditional etcd relies on Raft consensus: a majority of members must agree before any write commits, which means two of three members going unhealthy takes the cluster offline. EKS replaced Raft with a purpose-built journal that provides durable, ordered storage independently of etcd. Writes commit as soon as the journal acknowledges them; the etcd peers no longer negotiate quorum with each other. That, the authors write, made an entire class of etcd quorum-loss failures disappear.
Because the journal now owns durability, etcd no longer needs to fsync every write to a local disk. Its data store has moved to an in-memory filesystem, and what was a disk-bound system is now compute-bound. Storage latency is out of the critical path.
Two follow-on moves ride on that change. For ultra-scale clusters, etcd is partitioned into resource-specific shards, each with its own storage budget. The example the authors give is failure isolation: if a misbehaving controller floods the events keyspace, the events partition hits its quota but the leases partition keeps writing, so healthy nodes stop appearing unhealthy. And because quorum is no longer negotiated between local peers, etcd can be colocated on the same host as the API server, with reads and writes travelling over loopback instead of the network. If the local etcd is impaired, the API server fails over to another etcd member reading from the same journal.
Upstream, not internal
The post is more useful than the usual architecture blog because it names bottlenecks that hit at hyperscale and points at where AWS is fixing them upstream, not only inside EKS. Three examples. The watch cache holds a shared read lock while building a WatchList snapshot, which at hundreds of thousands of objects starves the writer that advances the cache's resource version; the authors say a refactor is being worked on with the community. A single mutex in the Horizontal Pod Autoscaler serialises scaling decisions and is being replaced by a redesigned data store in PR #139142. And the scheduler rebuilds in-use persistent-volume state by scanning every node in the cluster every cycle, tracked as issue #138426, which is being changed to compute that information lazily.
For a CI/CD team, the useful read is narrower. These fixes are not shipping today. The specific failure modes now have names and issue numbers, which is what you need to know if your own pipelines burst thousands of pods and the control plane starts to wobble.
Control plane as a sized resource
The other announcement in the post is EKS Provisioned Control Plane, which turns those engineering gains into a purchasable dimension. Instead of hoping the shared control plane keeps up with a large training run, you pick a tier that maps to concrete numbers: API request concurrency, pod scheduling rate, and cluster database size. Tiers run XL through 8XL. The top tier, 8XL on Kubernetes 1.34, is quoted at 16,000 concurrent API request seats, a 400 pods-per-second scheduling rate, and 16 GB of cluster database storage, backed by a 99.99% availability SLA measured in one-minute intervals. Tier changes are done through the console, CLI, eksctl, CloudFormation, or Terraform, and the post says they happen without cluster recreation or downtime.
The pitch is that orchestration capacity now behaves like compute: you reserve it before a burst, you release it after. The catch is upstream of the price sheet. If the binding constraint on your workload is not one of the three dimensions being sold, more API seats will not save you.
What the announcement does not tell you
Two caveats worth carrying forward. The post is sponsored by AWS Marketplace, so the framing of every trade-off is AWS's own. And the operational story of a control plane that no longer takes a quorum-loss outage still has failure modes. They live in the journal itself, in the observability of the failover path, and in the moments when the convenient local etcd is impaired and the API server has to reach across for another member.
The authors acknowledge that last point in the post: "When you optimize for the common case, design just as deliberately for the moment that optimization is not available." That line, more than any specific tier number, tells you where the next EKS incident writeup is likely to start.
Source: The New Stack (thenewstack.io)