Docker Engine 29.4.3 moves the 'Copy Fail' mitigation off seccomp after the first fix broke 32-bit containers
Maya OkonkwoSix weeks after CVE-2026-31431 — "Copy Fail" — went public with no kernel patch on most distros, Docker has shipped Engine 29.4.3 with a second-attempt mitigation that blocks the vulnerable syscall path through AppArmor and SELinux instead of seccomp. Per Docker's post-mortem from May 27, the first attempt in 29.4.2 also blocked the path, but did so by denying socketcall(2) outright and broke 32-bit binaries, Go programs built with GOARCH=386, Wine and SteamCMD inside containers. The catch upstream of either fix is that the actual flaw is in the Linux kernel's algif_aead module, going back to 2017, and the kernel patch is what closes it — on Ubuntu that patch is still not out.
For container operators, the consequence is straightforward. If the host kernel is unpatched and Docker Engine is below 29.4.3, a container with code execution can corrupt the host page cache and read or modify any file mapped by any process on the node, including other containers sharing image layers. That is a host-wide blast radius from inside a default-profile container, with no extra capabilities required. The mitigation in 29.4.3 narrows the entry point; it does not remove it.
What does Copy Fail actually let an attacker do?
Per the disclosure on April 29 and Docker's write-up by Paweł Gronowski, the bug is in the AF_ALG crypto subsystem. An unprivileged user that can open an AF_ALG socket — which a default-profile container could — gets a controlled-write primitive into the page cache. The page cache backs file reads across the entire system. Corrupt the right page of a setuid binary and you have local root. Corrupt a shared library page and every process that maps it sees the modified bytes until the page is evicted.
The relevant fact for a container team is that the page cache is not namespaced. The compromise crosses the container boundary on the read side by construction. A workload on the same node that opens /usr/bin/sudo gets the attacker's version, briefly, regardless of whether it lives in another container, a host process or a sibling pod. That is the part operators should price in, not the local-root demo.
Why did the first mitigation break 32-bit workloads?
Engine 29.4.2 added a seccomp filter that denied address families AF_ALG and AF_VSOCK on the socket(2) syscall. That is the right shape for an x86_64 process going through the modern entry point. It does not cover socketcall(2), the older multiplexed syscall that wraps socket, bind, connect and friends behind a single number. seccomp BPF cannot dereference the userspace pointer that socketcall uses to pack its real arguments, so a selective deny by address family is not possible at that layer.
Docker chose to deny socketcall entirely on the assumption that modern toolchains had moved to the direct socket syscalls Linux 4.3 added. That assumption was wrong on three fronts, per the moby issue tracker: older glibc on i386 still routes through socketcall, the Go runtime unconditionally uses it for GOARCH=386 regardless of glibc, and several legacy and gaming workloads — Wine, SteamCMD — depend on it. The deny broke networking for those containers. It also failed to actually close the hole on amd64, because any process can switch to ia32 compatibility mode with int $0x80 and invoke socketcall directly, bypassing the socket(2) arg filter.
The kindest reading of the 29.4.2 release is that the seccomp layer was the wrong tool for the job and that became obvious only when 32-bit users filed bugs.
What did 29.4.3 ship instead?
Per the moby/profiles change, Engine 29.4.3 reverts the socketcall deny and moves enforcement to LSMs. AppArmor gets a deny network alg, rule in the default profile, which blocks AF_ALG through both socket(2) and socketcall(2) because the kernel hook fires at security_socket_create() — after the syscall dispatch, regardless of entry point. For SELinux distributions, Docker ships a CIL policy module that denies alg_socket creation for all container_domain types and is loaded via semodule.
The trade-off is honest. On Ubuntu and Debian, AppArmor is the default and the new rule applies automatically when 29.4.3 is installed. On Fedora, RHEL and CentOS, SELinux must be enforcing and the CIL module must be loaded — the daemon flag is on by default on those distros, but a permissive SELinux configuration on a workaround-laden host will silently skip the mitigation. On a host with neither LSM active, 29.4.3 buys nothing over 29.4.2 with the seccomp revert; the container is back to relying on the kernel patch.
How do other runtimes and orchestrators handle default-profile updates?
The default-profile machinery is more shared than the brand names suggest, and the response to Copy Fail mostly tracked along those shared lines.
- containerd uses the same seccomp default profile family that Docker maintains, so the underlying change landed in the shared blobs and is picked up by any runtime consuming them. Operators running Kubernetes on containerd inherit the fix when the node's container runtime package is upgraded, not when kubelet is. The dependency is easy to miss on a rolling node refresh.
- Podman with SELinux on RHEL was less exposed during the window where no kernel patch existed, because the SELinux container policy on RHEL already had stricter defaults around uncommon socket families than the AppArmor defaults on Ubuntu. For shops already standardised on Podman plus SELinux enforcing, the LSM mitigation lands with less drama than retrofitting it on Ubuntu where the policy gap is bigger. A competitor is the better fit here for teams whose host OS is RHEL or Fedora and who have invested in SELinux — the audit story is simpler and the post-CVE patch path was shorter.
- gVisor sidesteps the question. Its user-space kernel intercepts the syscall and never opens an AF_ALG socket on the host kernel, which is the design intent of running it in the first place. The trade-off is the long-standing one: a compatibility tax and a syscall-performance hit that not every workload can absorb.
- Kubernetes itself has no direct lever here beyond Pod Security Standards and the
seccompDefaultkubelet flag that pinsRuntimeDefault. Either way the runtime ships the actual rules; kubelet is upstream of the audit, not the enforcement. - AWS Bottlerocket ships an opinionated host with SELinux enforcing and a smaller package surface, so the CVE mitigation arrives bundled with the AMI refresh rather than as three separate package upgrades. For teams that have already accepted the immutable-host model, that pattern reduces the toil window after a CVE materially.
- Buddy is one option for the audit job that sits outside the cluster — a scheduled pipeline that SSHes to a sample of nodes and reports
docker version,apparmor_statusand SELinux mode in one digest. The reason to reach for it is narrow: small teams without a config-management system who still need a weekly "is the mitigation actually loaded everywhere" check, owned by a job whose failure does not depend on the cluster it audits.
The honest read across the field: nothing in this CVE was a runtime-vendor failing — it is a kernel bug — and the response work was mostly about which LSM was already in front of the bug on which distro. The mitigation gap is largest on default-Ubuntu hosts without a host kernel patch, and that is also where most of the install base sits.
What does the post-CVE audit pipeline look like?
A weekly Buddy pipeline that fans out across a node inventory and reports Engine version plus LSM status:
- pipeline: "container-runtime-cve-audit"
on: "SCHEDULE"
cron: "0 7 * * MON"
variables:
- key: "NODES"
value: "node-01,node-02,node-03"
type: "VAR"
- key: "SSH_KEY"
value: "secure!"
encrypted: true
settable: false
actions:
- action: "Inventory engine + LSM state"
type: "BUILD"
docker_image_name: "alpine:latest"
execute_commands:
- apk add --no-cache openssh-client
- mkdir -p ~/.ssh && echo "$SSH_KEY" > ~/.ssh/id_ed25519 && chmod 600 ~/.ssh/id_ed25519
- |
IFS=','; for n in $NODES; do
ver=$(ssh -o StrictHostKeyChecking=no audit@$n "docker version --format '{{.Server.Version}}'" 2>/dev/null)
aa=$(ssh audit@$n "grep -c 'deny network alg' /etc/apparmor.d/docker* 2>/dev/null || echo 0")
se=$(ssh audit@$n "getenforce 2>/dev/null || echo n/a")
awk -v n="$n" -v v="$ver" -v a="$aa" -v s="$se" \
'BEGIN{ok=(v>="29.4.3")?"OK":"OLD"; print n, "engine="v, "rule="a, "selinux="s, ok}'
done
The shape ports to any scheduler that can run ssh on cron. The point is the digest exists, the comparison against 29.4.3 is a string match, and someone reads it on Monday morning before the next CVE redraws the floor. See the Buddy pipeline variables docs for the secret-handling model used above.
What is still unresolved?
Three things. The Ubuntu kernel patch is not out as of Docker's write-up, which means the LSM mitigation is the only line of defence on the most common container host, and only when AppArmor is loaded and enforcing — a non-default unconfined profile or a custom seccomp profile that re-enables socketcall without an LSM in front of AF_ALG reopens the hole. The 29.4.3 release does not change that; it just provides a default profile that closes it on a stock install.
The second is that the page-cache primitive is more general than the setuid-binary demo. Detection guidance from vendors has so far focused on suspicious AF_ALG socket creation, which the LSM rule prevents from succeeding; it does not catch a workload that switched to a custom seccomp profile and then opened the socket, because the LSM denial may be the only event left on a hardened host. Inventory needs to confirm the rule is loaded, not infer it from absence of alerts.
The third is the shape of the post-mortem itself. The 29.4.2 → 29.4.3 cycle took weeks, with the first version regressing 32-bit workloads after the public disclosure. That is the pattern operators should plan for the next time a kernel-level CVE leaves the runtime as the only mitigation layer: the first patch will be too broad, and the second one will land at a different layer. Roll forward fast, but do not roll forward once.
Source: Docker Blog (docker.com)