Skip to content

[codex] Support raw image offload in v1 train client#1746

Open
eligotts wants to merge 12 commits into
mainfrom
codex/v1-raw-image-offload
Open

[codex] Support raw image offload in v1 train client#1746
eligotts wants to merge 12 commits into
mainfrom
codex/v1-raw-image-offload

Conversation

@eligotts

@eligotts eligotts commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Design update — inline/offload image storage

This PR now follows the prime-rl multimodal image storage policy:

  • offload: current behavior, rewrite base64 data images to file:// run assets and require file-backed image URLs.
  • inline: keep data:image/...;base64,... URLs in the message payload and validate them without rewriting.

TrainClient now calls the policy-aware image preparation helper, so prime-rl can be the single source of truth via environment/config propagation.

Validation after latest push: uv run pytest tests/v1/test_train_client_multimodal.py -q passed (5 passed). Commit/push hooks also passed (ruff check, ruff format, generated AGENTS/CLAUDE check, ty).

Design update — dropped the None/cache-only image path

This PR and its companions (prime-rl #2836 / verifiers #1746 / renderers #89) no longer use the "send None for already-cached images" mechanism. Every image carries its raw descriptor ref at every slot (current and prior turns); /inference/v1/generate rematerializes each ref from disk every request.

Why: the None path coupled correctness to deployment (LRU cache present, single replica / DP-affinity, no eviction) and surfaced a miss as a hard vLLM EngineDeadError (qwen3-vl mrope dereferences a None image_grid_thw) that the retry net couldn't catch across the engine→API IPC. Dropping it is deployment-agnostic (a miss is impossible) and non-hacky. vLLM's mm_hash encoder cache still skips the expensive GPU re-encode for free — we only forgo the cheap IPC/CPU-reprocess dedup.

Validated: color-codeword (Qwen3-VL-4B) under DP=2, no affinity / no cache reliance: 0 crashes, 0 data=None, multi-turn accumulation correct, reward ~0.84. Also confirmed under TP.

This repo: with every image carrying its ref, no cache miss can occur — removed the retry subsystem (_generate_with_image_ref_retry, _has_descriptor_only_images, _retryable_mm_error_type, _json_error_type, _RETRYABLE_MM_ERROR_TYPES). Rollouts call renderers.client.generate directly. Obsolete retry tests removed.


Original description

Summary

  • tighten v1 multimodal graph serialization around strict raw descriptor sidecars
  • reject processed multimodal payload keys recursively, including nested pixel_values, image_embeds, and image_features
  • update v1 multimodal tests to use strict prime_raw_mm_item envelopes instead of descriptor-only Qwen payloads
  • keep raw image offload and retry behavior aligned with the companion Renderers and Prime-RL PRs

Companion PRs

Notes

  • Draft/WIP: this depends on the renderer generic raw multimodal ref contract in the companion PR.
  • v1 multimodal sidecars intentionally carry raw descriptors only, not processed image tensors or image-processor payloads.
  • Prime/vLLM materialization happens from raw image refs rather than Verifiers-held processor outputs.

Validation

  • Commit hooks: ruff check, ruff format, generated AGENTS/CLAUDE check passed.
  • Push hook: ty (ci parity) passed.
  • End-to-end hosted-style smoke through Prime-RL with /home/ubuntu/verifiers, /home/ubuntu/renderers, and /home/ubuntu/prime-rl-v1-raw-mm-offload completed inference, env rollouts, train batch creation, trainer step 0, and decoded strict trainer-bound raw image refs.

[!NOTE]

Add raw image offload and multimodal bridging support to v1 train client

  • Adds prepare_images_inplace in verifiers/utils/multimodal.py to walk nested message structures, offload image URLs to file-backed run assets, and enforce file:// URLs for multimodal training.
  • TrainClient overrides new prepare_request_body and prepare_messages hooks (defined on Client) to run image normalization before requests are sent.
  • InterceptionServer.handle_request now calls client preparation hooks on incoming requests and simulator-produced messages, surfacing failures as structured errors.
  • Multimodal graph handling is extended: _attribute_mm assigns per-message placeholders, MessageNode validates that only raw descriptors (not processed tensors) are serialized/deserialized, and PendingTurn.previous_multi_modal_data merges prefix multimodal data for turn bridging.
  • TrainClient.get_response no longer blocks bridging on multimodal detection; it passes previous_multi_modal_data into bridge_to_next_turn whenever the renderer is multimodal.
  • Risk: Requests with image URLs that cannot be offloaded to run assets will now raise RuntimeError at request time rather than failing silently later.

Macroscope summarized 9b3e7ee.


Note

Medium Risk
Multimodal training now hard-requires offloadable file:// image URLs and a compatible renderers version; invalid or remote-only images fail at request time. Graph serialization rejects old processed multimodal payloads, so mixed-version rollouts could break until companions land.

Overview
Adds prepare_images_inplace so renderer-backed clients rewrite image parts to file:// run assets (via renderers.mm_store) before tokenization, and fail if images are not file-backed after preparation.

v1 TrainClient implements new prepare_request_body / prepare_messages hooks (also on the base Client); the interception server runs them on incoming bodies and user-simulator messages. Multi-turn bridging no longer skips multimodal prompts: it passes previous_multi_modal_data into bridge_to_next_turn when the renderer is multimodal.

Trace graph multimodal sidecars are raw descriptor dicts (raw_image_uri, hashes, placeholders)—processed tensors (pixel_values, etc.) are rejected on serialize/deserialize. _attribute_mm and branch merge now carry placeholder ranges; PendingTurn.previous_multi_modal_data merges prefix sidecars for bridging.

v0 RendererClient runs the same image prep in to_native_prompt. Legacy v0→v1 rollouts keep live cumulative multi_modal_data on the trajectory when building traces (avoiding JSON-save deltas).

Reviewed by Cursor Bugbot for commit 9b3e7ee. Bugbot is set up for automated code reviews on this repo. Configure here.

@eligotts eligotts force-pushed the codex/v1-raw-image-offload branch from 3f5bb1a to de37650 Compare June 25, 2026 06:40
Comment thread verifiers/v1/clients/train.py
Comment thread verifiers/v1/graph.py Outdated
Comment thread verifiers/v1/clients/train.py Outdated
S1ro1 and others added 3 commits June 27, 2026 00:18
Every image carries its ref, so no cache miss can occur. Removes _generate_with_image_ref_retry / _has_descriptor_only_images / _retryable_mm_error_type / _json_error_type / _RETRYABLE_MM_ERROR_TYPES; rollouts call generate() directly. Obsolete retry tests removed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…fload

# Conflicts:
#	verifiers/v1/clients/train.py
Comment thread verifiers/v1/utils/multimodal.py Outdated
@eligotts eligotts marked this pull request as ready for review June 29, 2026 16:36

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Want reviews to match your repository better? Bugbot Learning can learn team-specific rules from PR activity. A team admin can enable Learning in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 0b1d73f. Configure here.

result = _offload_image_url(_image_source_url(source), image_dir)
if result is not None:
_set_image_source_url(source, result)
_require_file_image_url(source)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline image URLs always rejected

High Severity

The v1 train path always requires every image_url to become a file:// run asset after preparation, even when offload leaves a data:image/...;base64,... URL unchanged. That conflicts with the intended inline multimodal storage mode where base64 image URLs stay in the message payload, so inline training rollouts fail at request preparation instead of validating in place.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 0b1d73f. Configure here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed inline mode so this is irrelevant

@macroscopeapp

macroscopeapp Bot commented Jun 29, 2026

Copy link
Copy Markdown

Approvability

Verdict: Needs human review

2 blocking correctness issues found. This PR adds new multimodal image handling capabilities with significant new code and runtime behavior changes. Multiple unresolved review comments identify potential bugs including rejected inline images (high severity) and backwards compatibility concerns for existing saved rollouts.

No code changes detected at 9b3e7ee. Prior analysis still applies.

You can customize Macroscope's approvability policy. Learn more.

Comment thread verifiers/v1/graph.py
return False


def _validate_raw_mm_item(item: Any) -> dict[str, Any]:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium v1/graph.py:76

_validate_raw_mm_item now unconditionally rejects processed multimodal payloads containing keys like pixel_values, and deserialize_multi_modal_data runs it on every multi_modal_data field during deserialization. Loading a previously persisted multimodal v1 trace whose sidecars contain pixel_values now raises TypeError instead of round-tripping, breaking backwards compatibility for existing saved rollouts. Consider allowing processed payloads through on the deserialization path (e.g. by skipping the processed-key check in the validator's before path) so old traces can still be loaded.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/v1/graph.py around line 76:

`_validate_raw_mm_item` now unconditionally rejects processed multimodal payloads containing keys like `pixel_values`, and `deserialize_multi_modal_data` runs it on every `multi_modal_data` field during deserialization. Loading a previously persisted multimodal v1 trace whose sidecars contain `pixel_values` now raises `TypeError` instead of round-tripping, breaking backwards compatibility for existing saved rollouts. Consider allowing processed payloads through on the deserialization path (e.g. by skipping the processed-key check in the validator's `before` path) so old traces can still be loaded.

Comment on lines +64 to +67
if value.get("type") == "image_url":
source = value.get("image_url")
if source is not None:
_prepare_image_source(source, image_dir=image_dir)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium utils/multimodal.py:64

prepare_images_inplace skips validation when an image_url part has a missing or None image_url field: lines 65-67 only call _prepare_image_source when source is not None, so the malformed part passes through unchecked. Downstream, ChatDialect.parse_request normalizes it to ImageUrlSource(url=""), forwarding a request with an empty image URL instead of rejecting it. Consider calling _require_file_image_url(value) (or otherwise validating) when source is None so malformed parts are rejected.

Suggested change
if value.get("type") == "image_url":
source = value.get("image_url")
if source is not None:
_prepare_image_source(source, image_dir=image_dir)
if value.get("type") == "image_url":
source = value.get("image_url")
if source is not None:
_prepare_image_source(source, image_dir=image_dir)
else:
_require_file_image_url(value)
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/utils/multimodal.py around lines 64-67:

`prepare_images_inplace` skips validation when an `image_url` part has a missing or `None` `image_url` field: lines 65-67 only call `_prepare_image_source` when `source is not None`, so the malformed part passes through unchecked. Downstream, `ChatDialect.parse_request` normalizes it to `ImageUrlSource(url="")`, forwarding a request with an empty image URL instead of rejecting it. Consider calling `_require_file_image_url(value)` (or otherwise validating) when `source` is `None` so malformed parts are rejected.

…fload

# Conflicts:
#	verifiers/v1/cli/dashboard/eval.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants