Skip to content

Add Harbor Dockerfile and multi-step support#1865

Draft
gabeorlanski wants to merge 7 commits into
PrimeIntellect-ai:mainfrom
gabeorlanski:harbor-docker-multistep-support
Draft

Add Harbor Dockerfile and multi-step support#1865
gabeorlanski wants to merge 7 commits into
PrimeIntellect-ai:mainfrom
gabeorlanski:harbor-docker-multistep-support

Conversation

@gabeorlanski

@gabeorlanski gabeorlanski commented Jun 24, 2026

Copy link
Copy Markdown

Description

Adds Harbor Dockerfile and multi-step task support across the Harbor implementations.

Key changes:

  • Adds shared Harbor parsing, tar, reward, and multi-step aggregation helpers.
  • Supports Harbor multi-step task layouts in harbor-v1 and experimental HarborEnv.
  • Supports Dockerfile-backed Harbor tasks with deterministic local image tags and configurable Dockerfile policy.
  • Adds local Harbor dataset path support for harbor-v1.
  • Records per-step Harbor rewards in rollout trace info.
  • Adds focused unit and rollout-level Docker tests for Harbor scoring and multi-step behavior.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Ran:

uv run ruff check verifiers/harbor.py verifiers/v1/tasksets/harbor_v1 verifiers/envs/experimental/harbor_env/env.py tests/v1/test_harbor_taskset.py
uv run pytest tests/v1/test_harbor_taskset.py tests/test_harbor_env_mcp.py -q
uv run pytest tests/v1 -q -k 'not e2e'

Note: I did not run the full uv run pytest suite locally because the full v1 e2e matrix includes environment-dependent cases such as Modal.

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes

This keeps framework changes out of the PR and confines implementation to Harbor/shared-Harbor code paths plus tests/docs.

Note

Add Dockerfile build support and multi-step task scoring to Harbor taskset

  • Adds DockerfilePolicy to taskset.py with build, ignore, and error modes; build constructs a deterministic local Docker image (tagged vf-harbor-*) from the task's Dockerfile context with content-hash caching.
  • Extends HarborTask and parse_task to carry multi-step metadata (HarborStep list, multi_step_reward_strategy, workdir), with aggregate harness/scoring timeouts computed as sums (unbounded if any step has no timeout).
  • Refactors HarborTaskset.solved and introduces run_verifier/record_step_results to evaluate steps sequentially, support early stopping on min_reward failure, and aggregate results using mean or final strategy into trace.info.
  • Adds shared helpers in verifiers/harbor.py for config loading, reward parsing (scalar and JSON), tarball creation, step info serialization, and prompt synthesis from steps when instruction.md is absent.
  • Mirrors multi-step support in HarborEnv for the verifiers-side evaluation path.
  • Risk: default dockerfile_policy changes to build; existing callers using --taskset.ignore-dockerfile must migrate to --taskset.dockerfile-policy ignore.

Macroscope summarized fbc6288. (Automatic summaries will resume when PR exits draft mode or review begins).

@gabeorlanski gabeorlanski force-pushed the harbor-docker-multistep-support branch from fbc6288 to 483b08d Compare June 24, 2026 22:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant