Skip to content

feat(speculative): add Qwen3-VL support for DFlash training#1887

Open
skierat wants to merge 1 commit into
NVIDIA:mainfrom
skierat:skierat/qwen3-vl-dflash-training
Open

feat(speculative): add Qwen3-VL support for DFlash training#1887
skierat wants to merge 1 commit into
NVIDIA:mainfrom
skierat:skierat/qwen3-vl-dflash-training

Conversation

@skierat

@skierat skierat commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Type of change: new feature

Adds online DFlash training support for Qwen3-VL–style vision-language models.

Changes include:

  • Load VLMs through the Transformers 5 AutoModelForImageTextToText API, while retaining compatibility with the legacy VLM auto-model API.
  • Run the base model through its top-level multimodal forward when image/video inputs are present, ensuring vision embeddings are injected before collecting DFlash target hidden states.
  • Extend VisionLanguageDataCollator to:
    • propagate answer_only_loss, chat-template, and DFlash label-alignment settings;
    • apply VLM_MIN_PIXELS / VLM_MAX_PIXELS processor limits;
    • derive assistant-only masks from ChatML/Llama chat boundaries when processor generation masks are unavailable;
    • enforce the fixed training_seq_len required by DFlash block training.
  • Preserve the existing text-only DFlash path.

Usage

python -m torch.distributed.run \                                      
--nproc_per_node 4 \                                                 
examples/speculative_decoding/main.py \                              
--config modelopt_recipes/general/speculative_decoding/dflash.yaml \
model.model_name_or_path=/path/to/qwen3-vl-model \
model.trust_remote_code=true \
data.data_path=/path/to/train.jsonl \
data.vlm_processor=/path/to/qwen3-vl-model \                        
data.vlm_img_dir=/path/to/image/root \
training.training_seq_len=4096 \
training.answer_only_loss=true \
dflash.dflash_block_size=8 \
dflash.dflash_mask_token_id=151669


### Testing
- git diff --check
- Parsed all modified Python modules successfully.
- Ran iterative multi-node Slurm smoke tests with a Qwen3-VL-family model and mixed multimodal data:
    - validated VLM model loading with Transformers 5;
    - validated distributed initialization, DFlash conversion, and VLM collation paths;
    - identified and addressed processor padding/truncation behavior required by fixed-size DFlash blocks.
This PR remains draft pending a completed end-to-end training smoke test and automated regression coverage.

### Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines (https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (git commit -s -S).
Make sure you read and follow the Security Best Practices (https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded trust_remote_code=True, torch.load(...,
weights_only=False), pickle, etc.).
- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
- Did you write any new necessary tests?: ❌ — automated Qwen3-VL/DFlash regression coverage still needs to be added before review.
- Did you update Changelog (https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ❌ — evaluate and add an entry before marking ready for review if this is considered user-facing speculative-decoding support.
- Did you get Claude approval on this PR?: N/A
### Additional Information
The PR intentionally excludes local Slurm launch scripts, logs, model paths, datasets, and environment-specific configuration.


<!-- This is an auto-generated comment: release notes by coderabbit.ai -->

## Summary by CodeRabbit

* **New Features**
  * Improved vision-language training support with more consistent chat template handling, answer-only loss behavior, and label shifting.
  * Added broader compatibility when loading vision-language models across different supported Transformers entry points.

* **Bug Fixes**
  * Made JSONL dataset loading more resilient by falling back to manual parsing when automatic loading fails.
  * Improved multimodal training stability by correctly handling image/video inputs and preserving gradient sync in edge cases.
  * Better enforced sequence padding and assistant masking for multimodal batches.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

@skierat skierat requested review from a team as code owners July 2, 2026 14:55
@skierat skierat requested review from cjluo-nv and h-guo18 July 2, 2026 14:55
@copy-pr-bot

copy-pr-bot Bot commented Jul 2, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR extends speculative decoding to better support vision-language models (VLMs): dataset loading gains a JSONL fallback, VisionLanguageDataCollator adds marker-derived assistant masking, shift_labels, and pixel-bound configuration; load_vlm_or_llm gains flexible auto-model class resolution; and HFDFlashModel's forward/loss path handles multimodal inputs.

Changes

VLM training and data pipeline updates

Layer / File(s) Summary
Dataset loading and VLM collator core changes
modelopt/torch/utils/plugins/transformers_dataset.py
Adds JSONL fallback in ShardedDataset._load_dataset; reworks VisionLanguageDataCollator with shift_labels, VLM_MIN_PIXELS/VLM_MAX_PIXELS env-based processor config, marker-derived assistant mask discovery, _build_assistant_masks, _pad_sequence_tensors, and updated _apply_chat_template/_process_multimodal_sample.
Collator wiring in example script
examples/speculative_decoding/eagle_utils.py
Passes chat_template, answer_only_loss, and shift_labels to the VLM collator constructor call.
VLM auto-model class resolution
modelopt/torch/speculative/utils.py
Updates load_vlm_or_llm to try AutoModelForVision2Seq, fall back to AutoModelForImageTextToText, then architecture-based lookup, raising ValueError if none match.
Multimodal forward and loss handling in HFDFlashModel
modelopt/torch/speculative/plugins/hf_dflash.py
Routes multimodal inputs through top-level model forward with hidden_states validation; rebuilds the zero-loss dummy tensor from all trainable dflash_module parameters instead of a single weight reference.

Estimated code review effort: 3 (Moderate) | ~35 minutes

Sequence Diagram(s)

sequenceDiagram
  participant Trainer
  participant VisionLanguageDataCollator
  participant Processor
  participant HFDFlashModel

  Trainer->>VisionLanguageDataCollator: collate batch of samples
  VisionLanguageDataCollator->>Processor: from_pretrained (with pixel bounds)
  VisionLanguageDataCollator->>VisionLanguageDataCollator: _apply_chat_template (marker-derived masks)
  VisionLanguageDataCollator->>VisionLanguageDataCollator: _pad_sequence_tensors (train_len)
  VisionLanguageDataCollator->>VisionLanguageDataCollator: _process_multimodal_sample (labels via assistant_masks)
  VisionLanguageDataCollator-->>Trainer: batch with input_ids, labels, pixel_values
  Trainer->>HFDFlashModel: forward(batch)
  HFDFlashModel->>HFDFlashModel: detect multimodal kwargs
  HFDFlashModel->>HFDFlashModel: super().forward (return_dict, output_hidden_states)
  HFDFlashModel-->>Trainer: hidden_states / loss
Loading

Estimated code review effort: 3 (Moderate) | ~35 minutes


Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

  • Ignore

❌ Failed checks (1 error)

Check name Status Explanation Resolution
Security Anti-Patterns ❌ Error Two new # nosec comments added in modelopt/torch/utils/plugins/transformers_dataset.py violate SECURITY.md rule prohibiting # nosec as bypass for security checks. Remove # nosec comments and either request exception review from @NVIDIA/modelopt-setup-codeowners or refactor code to avoid Bandit flag.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding Qwen3-VL support for DFlash training.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/speculative/utils.py`:
- Around line 613-620: The model class resolution in utils.py currently prefers
AutoModelForVision2Seq before AutoModelForImageTextToText, which keeps using the
deprecated alias when both are available. Update the fallback order in the model
class lookup so AutoModelForImageTextToText is tried first, then fall back to
AutoModelForVision2Seq, and keep the architecture-based fallback in the same
resolution chain.

In `@modelopt/torch/utils/plugins/transformers_dataset.py`:
- Around line 537-545: The padding logic in `pad_tokenized_messages` only
extends `input_ids` and `attention_mask`, but it also needs to pad
`assistant_masks` when `apply_chat_template()` returns that tensor. Update the
sequence-padding loop near the existing `tokenized_messages.get(key)` handling
to include `assistant_masks` with the same `pad_width`, preserving tensor shape
consistency for the downstream `answer_only_loss` path in
`transformers_dataset`.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 77e86707-9ddf-4546-8afc-48443831de63

📥 Commits

Reviewing files that changed from the base of the PR and between 9038b71 and 16bad92.

📒 Files selected for processing (4)
  • examples/speculative_decoding/eagle_utils.py
  • modelopt/torch/speculative/plugins/hf_dflash.py
  • modelopt/torch/speculative/utils.py
  • modelopt/torch/utils/plugins/transformers_dataset.py

Comment on lines +613 to +620
# ``AutoModelForVision2Seq`` was renamed in Transformers 5. Prefer the
# legacy alias when present, then use the current auto-model entry point.
model_cls = getattr(transformers, "AutoModelForVision2Seq", None)
if model_cls is None:
model_cls = getattr(transformers, "AutoModelForImageTextToText", None)
if model_cls is None:
architecture = (getattr(model_config, "architectures", None) or [None])[0]
model_cls = getattr(transformers, architecture, None) if architecture else None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the relevant file and nearby logic
git ls-files modelopt/torch/speculative/utils.py
wc -l modelopt/torch/speculative/utils.py
sed -n '580,660p' modelopt/torch/speculative/utils.py

# Search for related auto-model selection logic
rg -n "AutoModelForVision2Seq|AutoModelForImageTextToText|architectures|FutureWarning" modelopt/torch/speculative -S

Repository: NVIDIA/Model-Optimizer

Length of output: 4378


🌐 Web query:

Hugging Face AutoModelForVision2Seq deprecated removed in v5 use AutoModelForImageTextToText FutureWarning

💡 Result:

The class AutoModelForVision2Seq was deprecated in earlier versions of the Hugging Face Transformers library and has been officially removed in version 5.0.0 [1][2][3]. To resolve the FutureWarning or import errors resulting from this change, you should update your code to use AutoModelForImageTextToText instead [4][1][5]. This change is part of the broader cleanup and architectural updates introduced in Transformers v5 [6][3]. If you are encountering this in your codebase, replacing all instances of AutoModelForVision2Seq with AutoModelForImageTextToText will restore compatibility with the v5.x ecosystem [1][7].

Citations:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find the Transformers version constraints used by this repo
rg -n "transformers|tf_latest|AutoModelForImageTextToText|AutoModelForVision2Seq" pyproject.toml requirements*.txt noxfile.py modelopt tests -S

Repository: NVIDIA/Model-Optimizer

Length of output: 50378


🌐 Web query:

Transformers 4.56 AutoModelForVision2Seq AutoModelForImageTextToText both available deprecation warning

💡 Result:

In the Hugging Face Transformers library, AutoModelForVision2Seq is deprecated and has been superseded by AutoModelForImageTextToText [1][2][3]. In version 4.56, using AutoModelForVision2Seq triggers a FutureWarning because it is slated for removal in version 5.0 [4][3][5]. The library has explicitly transitioned to AutoModelForImageTextToText as the standard class for handling vision-language (image-text-to-text) models [6][7]. To resolve this deprecation warning, you should update your code to replace any instances of AutoModelForVision2Seq with AutoModelForImageTextToText [1][8]. For example: Instead of: from transformers import AutoModelForVision2Seq model = AutoModelForVision2Seq.from_pretrained("model-name") Use: from transformers import AutoModelForImageTextToText model = AutoModelForImageTextToText.from_pretrained("model-name")

Citations:


Prefer AutoModelForImageTextToText first

AutoModelForVision2Seq is still present in the supported 4.56+ range and emits a FutureWarning, so this order keeps taking the deprecated path whenever both classes exist. Try the new class first, then fall back to the legacy alias for older installs.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/speculative/utils.py` around lines 613 - 620, The model class
resolution in utils.py currently prefers AutoModelForVision2Seq before
AutoModelForImageTextToText, which keeps using the deprecated alias when both
are available. Update the fallback order in the model class lookup so
AutoModelForImageTextToText is tried first, then fall back to
AutoModelForVision2Seq, and keep the architecture-based fallback in the same
resolution chain.

Comment on lines +537 to +545
for key, pad_value in (
("input_ids", self.tokenizer.pad_token_id),
("attention_mask", 0),
):
value = tokenized_messages.get(key)
if not isinstance(value, torch.Tensor):
continue
pad = value.new_full((value.shape[0], pad_width), pad_value)
tokenized_messages[key] = torch.cat((value, pad), dim=1)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Pad assistant_masks with the other sequence tensors.

When marker-derived masks are not used, apply_chat_template() can return assistant_masks; if the processor also ignored padding="max_length", this helper pads input_ids/attention_mask but leaves assistant_masks short. Line 595 can then fail with a shape mismatch under answer_only_loss=True.

Proposed fix
         for key, pad_value in (
             ("input_ids", self.tokenizer.pad_token_id),
             ("attention_mask", 0),
+            ("assistant_masks", 0),
         ):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for key, pad_value in (
("input_ids", self.tokenizer.pad_token_id),
("attention_mask", 0),
):
value = tokenized_messages.get(key)
if not isinstance(value, torch.Tensor):
continue
pad = value.new_full((value.shape[0], pad_width), pad_value)
tokenized_messages[key] = torch.cat((value, pad), dim=1)
for key, pad_value in (
("input_ids", self.tokenizer.pad_token_id),
("attention_mask", 0),
("assistant_masks", 0),
):
value = tokenized_messages.get(key)
if not isinstance(value, torch.Tensor):
continue
pad = value.new_full((value.shape[0], pad_width), pad_value)
tokenized_messages[key] = torch.cat((value, pad), dim=1)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/utils/plugins/transformers_dataset.py` around lines 537 - 545,
The padding logic in `pad_tokenized_messages` only extends `input_ids` and
`attention_mask`, but it also needs to pad `assistant_masks` when
`apply_chat_template()` returns that tensor. Update the sequence-padding loop
near the existing `tokenized_messages.get(key)` handling to include
`assistant_masks` with the same `pad_width`, preserving tensor shape consistency
for the downstream `answer_only_loss` path in `transformers_dataset`.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant