[CI] Build the test image once and pull it from ECR#5905
Conversation
build.yml rebuilt the test image independently in both test-isaaclab-tasks and test-general (~23 min each, run in parallel and racing the GHA layer cache). Add a single build job that builds the image and pushes it to ECR via the ecr-build-push-pull action (ported from develop), and have the test jobs depend on it and pull the prebuilt image instead of rebuilding. The action resolves the ECR URL from the runner's SSM ecr-cache-url parameter and falls back to a local build when ECR is unavailable, so runners without ECR keep working as before. Preserves main's image (docker/Dockerfile.curobo), test execution, and result handling.
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot — PR #5905
Summary
This PR eliminates a redundant ~23-min Docker image build by introducing a dedicated build job that pushes to ECR once, then has both test-isaaclab-tasks and test-general pull the prebuilt image. A new composite action (.github/actions/ecr-build-push-pull/action.yml) encapsulates the build/push/pull logic with a multi-level caching strategy (exact commit tag → deps hash → full build).
🧪 Isaac Lab Expert
Architecture: ✅ Correct build-once/pull pattern
- The job dependency graph (
build → {test-isaaclab-tasks, test-general} → combine-results) ensures the image is always available before test jobs start. - The
cache-tag: cache-main-curobocorrectly namespaces main's layer cache away from develop/beta2. - The action is ported from the already-proven
developworkflow — good reuse.
Race condition risk: ✅ None identified
- The
needs: [build]dependency in both test jobs provides a hard barrier — no test job starts until the build job succeeds. - The previous approach had both test jobs building in parallel racing the GHA layer cache; this eliminates that race entirely.
Runner affinity consideration:
- The
buildjob and test jobs all use[self-hosted, gpu]runners. If the test jobs land on a different runner than the build job, they will pull from ECR (not local cache). This is by design and the action handles it correctly viadocker manifest inspect→ pull → tag. However, if the ECR push in the build job fails silently (see below), the test jobs would fall back to a full local build — which is actually the correct degraded behavior.
🔇 Silent Failure Hunter
1. ECR push failure in build job (Step 7) — Medium risk
- If
docker pushfails in the build job's "Push to ECR" step, it will cause the build job to fail (non-zero exit), which would block test jobs entirely due toneeds: [build]. - ✅ This is correct behavior — a hard failure is better than a silent one.
2. ECR unavailability in test jobs — ✅ Safe
- When test jobs invoke the action and ECR is unavailable, the action resolves no ECR URL and falls back to a full local build (Step 6). The
|| trueguards on IMDS/SSM calls prevent hard failures during resolution. - However, since the
buildjob already pushed the image, the test jobs should always find it via the commit-tagged ECR image (Step 4 manifest check → pull).
3. Docker credential store bypass — ✅ Properly handled
- Step 1 creates a temp config with
credsStore: ""and cleans it up in Step 9 (if: always()).
4. Potential gap: docker buildx imagetools create failure in deps-cache step (Step 5)
- If
docker buildx imagetools create -t "${ECR_IMAGE}" "${DEPS_ECR_IMAGE}"fails, the step would exit non-zero and the build job would fail. This is safe (fails loud).
5. || true patterns in resolution steps — Acceptable
- The
|| trueguards on IMDS token, instance ID, region, and SSM are intentional — they allow graceful degradation on non-EC2 runners.
📊 Test Coverage Analyzer
- No new tests needed. This is a CI-infrastructure-only change. The existing CI run of this PR serves as the integration test (build → pull → test pattern).
- The test plan in the PR description is appropriate: validate YAML, observe the green CI run.
🏁 Verdict: LGTM ✅
Well-structured CI optimization that follows the proven pattern from develop. The build-once/pull architecture is correct, race conditions are eliminated by job dependencies, and silent failures are properly handled with graceful degradation.
Minor suggestions (non-blocking):
- Consider adding a brief comment in
build.ymlnoting thatecr-build-push-pullin test jobs will effectively just pull (since build already pushed), to aid future maintainers reading the workflow. - The
timeout-minutes: 180on thebuildjob matches the test jobs — this is generous for a single build. A tighter timeout (e.g., 60 min) could catch hangs earlier, though this is cosmetic given the existing test job timeouts.
Reviewed at commit: 7c22ca8
1. Summary
build.ymlrebuilds the test Docker image independently in bothtest-isaaclab-tasksandtest-general(~23 min each, in parallel, racing the shared GHA layer cache). This builds the same image twice every run.buildjob that pushes it to ECR, then has the two test jobs depend on it and pull the prebuilt image instead of rebuilding.ecr-build-push-pullaction already used bydevelop/releaseon the same self-hosted runner pool; it resolves the ECR URL from the runner's SSMecr-cache-urlparameter and falls back to a local build when ECR is unavailable, so no runner regresses.2. Changes
.github/actions/ecr-build-push-pull/action.yml— generic drop-in fordocker-buildwith ECR-backed caching (ported verbatim fromdevelop). Builds + pushes the commit-tagged image, or pulls it if it already exists in ECR..github/workflows/build.yml:buildjob that builds and pushes the image to ECR once.test-isaaclab-tasksandtest-generalnowneeds: [build]and replace their per-jobBuild Docker Imagestep with an ECR pull of the sameimage-tag.run-tests, result copy/upload, fork-result checks, andcombine-resultsare unchanged. Job graph:build → {test-isaaclab-tasks, test-general} → combine-results.3. Adapted to main (not a develop bulk-copy)
run-tests/combine-results, and the existing image (docker/Dockerfile.curobo) — this only changes how many times the image is built, not what is built.cache-tag: cache-main-curoboso main's layer cache does not collide with develop/beta2's namespace in the shared ECR repo.4. Expected effect
5. Test plan
build.ymland the new action validate as YAML; job dependency graph isbuild → test jobs → combine-results.buildjob pushes once;test-isaaclab-tasks/test-generalpull the prebuilt image (no rebuild). Draft until that is observed green.Note: targets
maindeliberately —developalready uses this build-once/ECR pattern; this brings main'sbuild.ymlin line.