Skip to content

[CI] Build the test image once and pull it from ECR#5905

Draft
hujc7 wants to merge 1 commit into
isaac-sim:mainfrom
hujc7:jichuanh/main-ci-ecr-build-once
Draft

[CI] Build the test image once and pull it from ECR#5905
hujc7 wants to merge 1 commit into
isaac-sim:mainfrom
hujc7:jichuanh/main-ci-ecr-build-once

Conversation

@hujc7
Copy link
Copy Markdown
Collaborator

@hujc7 hujc7 commented Jun 1, 2026

1. Summary

  • main's build.yml rebuilds the test Docker image independently in both test-isaaclab-tasks and test-general (~23 min each, in parallel, racing the shared GHA layer cache). This builds the same image twice every run.
  • This change builds the image once in a dedicated build job that pushes it to ECR, then has the two test jobs depend on it and pull the prebuilt image instead of rebuilding.
  • Adopts the ecr-build-push-pull action already used by develop/release on the same self-hosted runner pool; it resolves the ECR URL from the runner's SSM ecr-cache-url parameter and falls back to a local build when ECR is unavailable, so no runner regresses.

2. Changes

  1. New .github/actions/ecr-build-push-pull/action.yml — generic drop-in for docker-build with ECR-backed caching (ported verbatim from develop). Builds + pushes the commit-tagged image, or pulls it if it already exists in ECR.
  2. .github/workflows/build.yml:
    • Added a build job that builds and pushes the image to ECR once.
    • test-isaaclab-tasks and test-general now needs: [build] and replace their per-job Build Docker Image step with an ECR pull of the same image-tag.
    • run-tests, result copy/upload, fork-result checks, and combine-results are unchanged. Job graph: build → {test-isaaclab-tasks, test-general} → combine-results.

3. Adapted to main (not a develop bulk-copy)

  • Keeps main's two-job test structure, env vars, run-tests/combine-results, and the existing image (docker/Dockerfile.curobo) — this only changes how many times the image is built, not what is built.
  • Uses cache-tag: cache-main-curobo so main's layer cache does not collide with develop/beta2's namespace in the shared ECR repo.

4. Expected effect

  • The duplicate ~23-min build collapses to one build + a ~1–2 min pull per test job, and the two jobs no longer race the GHA cache.

5. Test plan

  • build.yml and the new action validate as YAML; job dependency graph is build → test jobs → combine-results.
  • Pre-commit clean.
  • CI run (this PR): the build job pushes once; test-isaaclab-tasks/test-general pull the prebuilt image (no rebuild). Draft until that is observed green.

Note: targets main deliberately — develop already uses this build-once/ECR pattern; this brings main's build.yml in line.

build.yml rebuilt the test image independently in both test-isaaclab-tasks
and test-general (~23 min each, run in parallel and racing the GHA layer
cache). Add a single build job that builds the image and pushes it to ECR
via the ecr-build-push-pull action (ported from develop), and have the test
jobs depend on it and pull the prebuilt image instead of rebuilding.

The action resolves the ECR URL from the runner's SSM ecr-cache-url
parameter and falls back to a local build when ECR is unavailable, so
runners without ECR keep working as before. Preserves main's image
(docker/Dockerfile.curobo), test execution, and result handling.
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot — PR #5905

Summary

This PR eliminates a redundant ~23-min Docker image build by introducing a dedicated build job that pushes to ECR once, then has both test-isaaclab-tasks and test-general pull the prebuilt image. A new composite action (.github/actions/ecr-build-push-pull/action.yml) encapsulates the build/push/pull logic with a multi-level caching strategy (exact commit tag → deps hash → full build).


🧪 Isaac Lab Expert

Architecture: ✅ Correct build-once/pull pattern

  • The job dependency graph (build → {test-isaaclab-tasks, test-general} → combine-results) ensures the image is always available before test jobs start.
  • The cache-tag: cache-main-curobo correctly namespaces main's layer cache away from develop/beta2.
  • The action is ported from the already-proven develop workflow — good reuse.

Race condition risk: ✅ None identified

  • The needs: [build] dependency in both test jobs provides a hard barrier — no test job starts until the build job succeeds.
  • The previous approach had both test jobs building in parallel racing the GHA layer cache; this eliminates that race entirely.

Runner affinity consideration: ⚠️ Minor

  • The build job and test jobs all use [self-hosted, gpu] runners. If the test jobs land on a different runner than the build job, they will pull from ECR (not local cache). This is by design and the action handles it correctly via docker manifest inspect → pull → tag. However, if the ECR push in the build job fails silently (see below), the test jobs would fall back to a full local build — which is actually the correct degraded behavior.

🔇 Silent Failure Hunter

1. ECR push failure in build job (Step 7) — Medium risk

  • If docker push fails in the build job's "Push to ECR" step, it will cause the build job to fail (non-zero exit), which would block test jobs entirely due to needs: [build].
  • ✅ This is correct behavior — a hard failure is better than a silent one.

2. ECR unavailability in test jobs — ✅ Safe

  • When test jobs invoke the action and ECR is unavailable, the action resolves no ECR URL and falls back to a full local build (Step 6). The || true guards on IMDS/SSM calls prevent hard failures during resolution.
  • However, since the build job already pushed the image, the test jobs should always find it via the commit-tagged ECR image (Step 4 manifest check → pull).

3. Docker credential store bypass — ✅ Properly handled

  • Step 1 creates a temp config with credsStore: "" and cleans it up in Step 9 (if: always()).

4. Potential gap: docker buildx imagetools create failure in deps-cache step (Step 5)

  • If docker buildx imagetools create -t "${ECR_IMAGE}" "${DEPS_ECR_IMAGE}" fails, the step would exit non-zero and the build job would fail. This is safe (fails loud).

5. || true patterns in resolution steps — Acceptable

  • The || true guards on IMDS token, instance ID, region, and SSM are intentional — they allow graceful degradation on non-EC2 runners.

📊 Test Coverage Analyzer

  • No new tests needed. This is a CI-infrastructure-only change. The existing CI run of this PR serves as the integration test (build → pull → test pattern).
  • The test plan in the PR description is appropriate: validate YAML, observe the green CI run.

🏁 Verdict: LGTM

Well-structured CI optimization that follows the proven pattern from develop. The build-once/pull architecture is correct, race conditions are eliminated by job dependencies, and silent failures are properly handled with graceful degradation.

Minor suggestions (non-blocking):

  1. Consider adding a brief comment in build.yml noting that ecr-build-push-pull in test jobs will effectively just pull (since build already pushed), to aid future maintainers reading the workflow.
  2. The timeout-minutes: 180 on the build job matches the test jobs — this is generous for a single build. A tighter timeout (e.g., 60 min) could catch hangs earlier, though this is cosmetic given the existing test job timeouts.

Reviewed at commit: 7c22ca8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant