runtime: normalize tar headers for reproducible build contexts#86
Conversation
tarDirectory stamps mtime/uid/gid into headers from live FileInfo, so synthesized contexts (uid-reconcile, etc.) carry wall-clock mtimes that shift BuildKit's COPY vertex digest across invocations of byte-identical content. Downstream consumers that snapshot a workspace after one Up and restore it for a later Up hit full cache misses and re-extract GBs of image layers into new snapshotter dirs. Normalize ModTime to epoch, zero AccessTime/ChangeTime, and clear uid/gid/uname/gname in every tar header. Also pin useruid's temp context files to the epoch via os.Chtimes so determinism is a local property of the synthesizer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR normalizes tar archive metadata to ensure deterministic, reproducible container builds. Tar header timestamps and ownership metadata are pinned to Unix epoch and zeroed, eliminating host-dependent variance in generated tar streams. The pattern is applied to both core tar generation and the UID reconciliation build context. ChangesDeterministic tar archive generation
Possibly related PRs
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Summary
tarDirectorynow normalizes every tar header (ModTime → unix epoch, AccessTime/ChangeTime zero, uid/gid 0, uname/gname empty). Wall-clock mtimes fromos.WriteFilein synthesized build contexts (e.g.useruid'suid-fix.sh+Dockerfile) were leaking into the tar stream and shifting BuildKit's COPY vertex digest across invocations of byte-identical content.useruid.reconcileRemoteUserUIDadditionallyos.Chtimes-pins the two files it writes to the epoch — defensive, so the synthesized context's reproducibility is a local property of the caller rather than only the tar layer.runtime/docker/build_test.go:TestTarDirectoryNormalizesMetadataasserts every header field is normalized;TestTarDirectoryDeterministictars the same content twice with diverging on-disk mtimes and asserts byte-identical streams.Why it matters
Downstream pipelines that run
dc.Upagainst a blank PVC, snapshot the PVC, and restore the snapshot in session pods were getting full BuildKit cache misses on the seconddc.Up. Observed impact on one workspace: COPY vertex digest changed across three runs over the same input (6270201163e7…,f5ed748968f2…,8c82f115a53e…), containerd snapshotter dir grew 10.2G → 16.5G after one cache-missed Up, and first-session pod startup went from <1s to ~70s.BuildKit hashes content for cache purposes, so erasing mtime/uid/gid from the tar header doesn't drop information it relies on — it just removes a host-specific perturbation from the digest. Behavior change worth noting: any external consumer that inspects raw layer metadata for original mtimes will no longer see them. None exists in this repo.
Test plan
go test ./...is green locallyTestTarDirectoryDeterministicfails without thetarDirectorychange (verified during development)BuildImagetwice on the same context and asserts identical layer digests — skipped here since it needs a live daemon🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Tests