Flipkart's chaos platform runs 90% of fault injection in staging, ships five fixes back to LitmusChaos

Flipkart's central reliability engineering team won CNCF's End User Case Study contest for a multi-tenant chaos platform that runs roughly 90% of its infrastructure fault-injection experiments inside Kubernetes staging clusters ahead of festive-sale traffic, and that has shipped five fixes back to LitmusChaos upstream. The announcement landed on 17 June 2026, ahead of the team's case-study presentation at KubeCon + CloudNativeCon India in Mumbai. The operational read is straightforward: chaos is moving out of quarterly game days and into the boring, repeatable layer of the SDLC.

What the CRE team actually built

Per the CNCF write-up, the platform is a multi-tenant subscriber model deployed in a central namespace, with LitmusChaos, a CNCF incubating project, as the experiment engine. The custom pieces are where the engineering shows. A DaemonSet-based high-availability model lets the team run fault injections in parallel rather than serializing them. A Script Runner fault was added so target selection can be dynamic rather than hard-coded into each experiment spec. And because not everything at Flipkart is on Kubernetes, the team built a hybrid extension that keeps legacy VM workloads in scope. None of those are off-the-shelf LitmusChaos primitives, they are the multi-tenant guard rails any team trying to copy this design has to build itself.

The 90% number is the headline; the split is what matters

A platform that runs 90% of its chaos experiments in staging is the part of the story everyone will quote. It is also the part most likely to mislead. Staging is the right venue for routine drills against deterministic dependencies, pod evictions, resource starvation, dependency latency. The failures SREs are actually afraid of are environmental: control-plane outages, DNS misbehaviour, region-level capacity loss. Those do not faithfully reproduce in staging, and the announcement does not claim they do. Read the 90% as "the cheap, frequent experiments live in staging," with the implication that the rest happen somewhere closer to real traffic. The split, not the average, is the bit a reliability lead would want documented in the runbook.

Five upstream contributions, all small, all telling

The contributions returned to LitmusChaos are not headline features. They are platform-fit bug fixes: database index changes for project-scoped probe uniqueness, validation repairs for duplicate names during tag edits, and workflow configuration fixes for setups running custom image registries. Those are exactly the bugs a single-team user does not hit. They show up when many teams share one chaos platform, each editing tags, each pushing their own probe definitions, each pulling images from a non-default registry. Treating this as five small patches undersells it: the contributions are the tax of running a CNCF incubating project as production infrastructure, and Flipkart paid it instead of forking. That is the part of an upstream case study an SRE should actually care about.

Chaos as a mandatory pipeline phase

The roadmap stated in the announcement is that automated chaos testing becomes a mandatory SDLC phase, integrated directly into Flipkart's CI/CD pipelines. In effect: a chaos gate that runs on changes, not a quarterly drill. The upside is obvious, failure modes get exercised on the same cadence as code review and unit tests, and the cost of a resilience regression drops from "discovered in production at 3am" to "discovered in pipeline at lunchtime." The catch is upstream of the rollout. Making chaos a required pipeline step rewards experiments that finish in tens of seconds and produce deterministic pass/fail signal. The deep ones, partial network partitions, slow memory leaks, region-level capacity drains, do not fit that budget. Without an explicit carve-out for long-running experiments, the library will quietly drift toward the cheap end, and the failure modes that take a real on-call rotation to surface will fall out of the gate.

What other teams can lift, and what they cannot

The portable pieces are architectural: the subscriber model, the DaemonSet-based parallel injection, dynamic target selection, an explicit bridge for legacy VMs. The non-portable piece is the multi-tenancy itself. A smaller engineering org adopting LitmusChaos for one or two squads will not hit the tag-edit and registry bugs that Flipkart's team ended up upstreaming, and will not need the same guard rails. A larger one will, and should plan to file its own patches before it plans to file its first chaos-as-CI-gate runbook. The case study is now public; the multi-tenant tax is still on the adopting team's plate.

Flipkart's chaos platform runs 90% of fault injection in staging, ships five fixes back to LitmusChaos

What the CRE team actually built

The 90% number is the headline; the split is what matters

Five upstream contributions, all small, all telling

Chaos as a mandatory pipeline phase

What other teams can lift, and what they cannot

etcd v3.7 adds a streaming range API and drops the legacy v2 store

Google's pitch: put Agent Substrate on top of GKE Agent Sandbox, and call it the AI-agent runtime

After the ingress-NGINX retirement, what your migration plan owes production

Turn this into your pipeline. Build it on Buddy.