feat(speculative): add Qwen3-VL support for DFlash training by skierat · Pull Request #1887 · NVIDIA/Model-Optimizer

skierat · 2026-07-02T14:55:14Z

What does this PR do?

Type of change: new feature

Adds online DFlash training support for Qwen3-VL–style vision-language models.

Changes include:

Load VLMs through the Transformers 5 AutoModelForImageTextToText API, while retaining compatibility with the legacy VLM auto-model API.
Run the base model through its top-level multimodal forward when image/video inputs are present, ensuring vision embeddings are injected before collecting DFlash target hidden states.
Extend VisionLanguageDataCollator to:
- propagate answer_only_loss, chat-template, and DFlash label-alignment settings;
- apply VLM_MIN_PIXELS / VLM_MAX_PIXELS processor limits;
- derive assistant-only masks from ChatML/Llama chat boundaries when processor generation masks are unavailable;
- enforce the fixed training_seq_len required by DFlash block training.
Preserve the existing text-only DFlash path.

Usage

python -m torch.distributed.run \                                      
--nproc_per_node 4 \                                                 
examples/speculative_decoding/main.py \                              
--config modelopt_recipes/general/speculative_decoding/dflash.yaml \
model.model_name_or_path=/path/to/qwen3-vl-model \
model.trust_remote_code=true \
data.data_path=/path/to/train.jsonl \
data.vlm_processor=/path/to/qwen3-vl-model \                        
data.vlm_img_dir=/path/to/image/root \
training.training_seq_len=4096 \
training.answer_only_loss=true \
dflash.dflash_block_size=8 \
dflash.dflash_mask_token_id=151669


### Testing
- git diff --check
- Parsed all modified Python modules successfully.
- Ran iterative multi-node Slurm smoke tests with a Qwen3-VL-family model and mixed multimodal data:
    - validated VLM model loading with Transformers 5;
    - validated distributed initialization, DFlash conversion, and VLM collation paths;
    - identified and addressed processor padding/truncation behavior required by fixed-size DFlash blocks.
This PR remains draft pending a completed end-to-end training smoke test and automated regression coverage.

### Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines (https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (git commit -s -S).
Make sure you read and follow the Security Best Practices (https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded trust_remote_code=True, torch.load(...,
weights_only=False), pickle, etc.).
- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
- Did you write any new necessary tests?: ❌ — automated Qwen3-VL/DFlash regression coverage still needs to be added before review.
- Did you update Changelog (https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ❌ — evaluate and add an entry before marking ready for review if this is considered user-facing speculative-decoding support.
- Did you get Claude approval on this PR?: N/A
### Additional Information
The PR intentionally excludes local Slurm launch scripts, logs, model paths, datasets, and environment-specific configuration.


<!-- This is an auto-generated comment: release notes by coderabbit.ai -->

## Summary by CodeRabbit

* **New Features**
  * Improved vision-language training support with more consistent chat template handling, answer-only loss behavior, and label shifting.
  * Added broader compatibility when loading vision-language models across different supported Transformers entry points.

* **Bug Fixes**
  * Made JSONL dataset loading more resilient by falling back to manual parsing when automatic loading fails.
  * Improved multimodal training stability by correctly handling image/video inputs and preserving gradient sync in edge cases.
  * Better enforced sequence padding and assistant masking for multimodal batches.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

copy-pr-bot · 2026-07-02T14:55:18Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-07-02T14:56:43Z

📝 Walkthrough

Walkthrough

This PR extends speculative decoding to better support vision-language models (VLMs): dataset loading gains a JSONL fallback, VisionLanguageDataCollator adds marker-derived assistant masking, shift_labels, and pixel-bound configuration; load_vlm_or_llm gains flexible auto-model class resolution; and HFDFlashModel's forward/loss path handles multimodal inputs.

Changes

VLM training and data pipeline updates

Layer / File(s)	Summary
Dataset loading and VLM collator core changes `modelopt/torch/utils/plugins/transformers_dataset.py`	Adds JSONL fallback in `ShardedDataset._load_dataset`; reworks `VisionLanguageDataCollator` with `shift_labels`, `VLM_MIN_PIXELS`/`VLM_MAX_PIXELS` env-based processor config, marker-derived assistant mask discovery, `_build_assistant_masks`, `_pad_sequence_tensors`, and updated `_apply_chat_template`/`_process_multimodal_sample`.
Collator wiring in example script `examples/speculative_decoding/eagle_utils.py`	Passes `chat_template`, `answer_only_loss`, and `shift_labels` to the VLM collator constructor call.
VLM auto-model class resolution `modelopt/torch/speculative/utils.py`	Updates `load_vlm_or_llm` to try `AutoModelForVision2Seq`, fall back to `AutoModelForImageTextToText`, then architecture-based lookup, raising `ValueError` if none match.
Multimodal forward and loss handling in HFDFlashModel `modelopt/torch/speculative/plugins/hf_dflash.py`	Routes multimodal inputs through top-level model forward with hidden_states validation; rebuilds the zero-loss dummy tensor from all trainable `dflash_module` parameters instead of a single weight reference.

Estimated code review effort: 3 (Moderate) | ~35 minutes

Sequence Diagram(s)

sequenceDiagram
  participant Trainer
  participant VisionLanguageDataCollator
  participant Processor
  participant HFDFlashModel

  Trainer->>VisionLanguageDataCollator: collate batch of samples
  VisionLanguageDataCollator->>Processor: from_pretrained (with pixel bounds)
  VisionLanguageDataCollator->>VisionLanguageDataCollator: _apply_chat_template (marker-derived masks)
  VisionLanguageDataCollator->>VisionLanguageDataCollator: _pad_sequence_tensors (train_len)
  VisionLanguageDataCollator->>VisionLanguageDataCollator: _process_multimodal_sample (labels via assistant_masks)
  VisionLanguageDataCollator-->>Trainer: batch with input_ids, labels, pixel_values
  Trainer->>HFDFlashModel: forward(batch)
  HFDFlashModel->>HFDFlashModel: detect multimodal kwargs
  HFDFlashModel->>HFDFlashModel: super().forward (return_dict, output_hidden_states)
  HFDFlashModel-->>Trainer: hidden_states / loss

Estimated code review effort: 3 (Moderate) | ~35 minutes

Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

Ignore

❌ Failed checks (1 error)

Check name	Status	Explanation	Resolution
Security Anti-Patterns	❌ Error	Two new `# nosec` comments added in modelopt/torch/utils/plugins/transformers_dataset.py violate SECURITY.md rule prohibiting `# nosec` as bypass for security checks.	Remove `# nosec` comments and either request exception review from `@NVIDIA/modelopt-setup-codeowners` or refactor code to avoid Bandit flag.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: adding Qwen3-VL support for DFlash training.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/speculative/utils.py`:
- Around line 613-620: The model class resolution in utils.py currently prefers
AutoModelForVision2Seq before AutoModelForImageTextToText, which keeps using the
deprecated alias when both are available. Update the fallback order in the model
class lookup so AutoModelForImageTextToText is tried first, then fall back to
AutoModelForVision2Seq, and keep the architecture-based fallback in the same
resolution chain.

In `@modelopt/torch/utils/plugins/transformers_dataset.py`:
- Around line 537-545: The padding logic in `pad_tokenized_messages` only
extends `input_ids` and `attention_mask`, but it also needs to pad
`assistant_masks` when `apply_chat_template()` returns that tensor. Update the
sequence-padding loop near the existing `tokenized_messages.get(key)` handling
to include `assistant_masks` with the same `pad_width`, preserving tensor shape
consistency for the downstream `answer_only_loss` path in
`transformers_dataset`.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 77e86707-9ddf-4546-8afc-48443831de63

📥 Commits

Reviewing files that changed from the base of the PR and between 9038b71 and 16bad92.

📒 Files selected for processing (4)

examples/speculative_decoding/eagle_utils.py
modelopt/torch/speculative/plugins/hf_dflash.py
modelopt/torch/speculative/utils.py
modelopt/torch/utils/plugins/transformers_dataset.py

coderabbitai · 2026-07-02T15:07:18Z

+        # ``AutoModelForVision2Seq`` was renamed in Transformers 5. Prefer the
+        # legacy alias when present, then use the current auto-model entry point.
+        model_cls = getattr(transformers, "AutoModelForVision2Seq", None)
+        if model_cls is None:
+            model_cls = getattr(transformers, "AutoModelForImageTextToText", None)
+        if model_cls is None:
+            architecture = (getattr(model_config, "architectures", None) or [None])[0]
+            model_cls = getattr(transformers, architecture, None) if architecture else None


🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect the relevant file and nearby logic git ls-files modelopt/torch/speculative/utils.py wc -l modelopt/torch/speculative/utils.py sed -n '580,660p' modelopt/torch/speculative/utils.py # Search for related auto-model selection logic rg -n "AutoModelForVision2Seq|AutoModelForImageTextToText|architectures|FutureWarning" modelopt/torch/speculative -S

Repository: NVIDIA/Model-Optimizer

Length of output: 4378

🌐 Web query:

Hugging Face AutoModelForVision2Seq deprecated removed in v5 use AutoModelForImageTextToText FutureWarning

💡 Result:

The class AutoModelForVision2Seq was deprecated in earlier versions of the Hugging Face Transformers library and has been officially removed in version 5.0.0 [1][2][3]. To resolve the FutureWarning or import errors resulting from this change, you should update your code to use AutoModelForImageTextToText instead [4][1][5]. This change is part of the broader cleanup and architectural updates introduced in Transformers v5 [6][3]. If you are encountering this in your codebase, replacing all instances of AutoModelForVision2Seq with AutoModelForImageTextToText will restore compatibility with the v5.x ecosystem [1][7].

Citations:

1: Remove references to AutoModelForVision2Seq huggingface/transformers#41513

2: huggingface/transformers@878d60a

3: https://github.com/huggingface/transformers/releases/tag/v5.0.0

4: https://raw.githubusercontent.com/huggingface/transformers/main/MIGRATION_GUIDE_V5.md

5: AutoModelForVision2Seq.from_config() got an unexpected keyword argument 'torch_dtype' huggingface/transformers#40477

6: https://github.com/huggingface/transformers/blob/main/MIGRATION_GUIDE_V5.md

7: Remove references to AutoModelForVision2Seq huggingface/transformers#41513

🏁 Script executed:

#!/bin/bash set -euo pipefail # Find the Transformers version constraints used by this repo rg -n "transformers|tf_latest|AutoModelForImageTextToText|AutoModelForVision2Seq" pyproject.toml requirements*.txt noxfile.py modelopt tests -S

Repository: NVIDIA/Model-Optimizer

Length of output: 50378

🌐 Web query:

Transformers 4.56 AutoModelForVision2Seq AutoModelForImageTextToText both available deprecation warning

💡 Result:

In the Hugging Face Transformers library, AutoModelForVision2Seq is deprecated and has been superseded by AutoModelForImageTextToText [1][2][3]. In version 4.56, using AutoModelForVision2Seq triggers a FutureWarning because it is slated for removal in version 5.0 [4][3][5]. The library has explicitly transitioned to AutoModelForImageTextToText as the standard class for handling vision-language (image-text-to-text) models [6][7]. To resolve this deprecation warning, you should update your code to replace any instances of AutoModelForVision2Seq with AutoModelForImageTextToText [1][8]. For example: Instead of: from transformers import AutoModelForVision2Seq model = AutoModelForVision2Seq.from_pretrained("model-name") Use: from transformers import AutoModelForImageTextToText model = AutoModelForImageTextToText.from_pretrained("model-name")

Citations:

1: Remove references to AutoModelForVision2Seq huggingface/transformers#41513

2: huggingface/transformers@878d60a

3: huggingface/transformers@cb54ce4

4: Warning when using the "image-to-text" pipeline huggingface/transformers#41509

5: AutoModelForVision2Seq.from_config() got an unexpected keyword argument 'torch_dtype' huggingface/transformers#40477

6: https://huggingface.co/docs/transformers/v4.56.0/tasks/image_text_to_text

7: Add auto model for image-text-to-text huggingface/transformers#32472

8: Remove references to AutoModelForVision2Seq huggingface/transformers#41513

Prefer AutoModelForImageTextToText first

AutoModelForVision2Seq is still present in the supported 4.56+ range and emits a FutureWarning, so this order keeps taking the deprecated path whenever both classes exist. Try the new class first, then fall back to the legacy alias for older installs.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/speculative/utils.py` around lines 613 - 620, The model class resolution in utils.py currently prefers AutoModelForVision2Seq before AutoModelForImageTextToText, which keeps using the deprecated alias when both are available. Update the fallback order in the model class lookup so AutoModelForImageTextToText is tried first, then fall back to AutoModelForVision2Seq, and keep the architecture-based fallback in the same resolution chain.

coderabbitai · 2026-07-02T15:07:18Z

+        for key, pad_value in (
+            ("input_ids", self.tokenizer.pad_token_id),
+            ("attention_mask", 0),
+        ):
+            value = tokenized_messages.get(key)
+            if not isinstance(value, torch.Tensor):
+                continue
+            pad = value.new_full((value.shape[0], pad_width), pad_value)
+            tokenized_messages[key] = torch.cat((value, pad), dim=1)


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Pad assistant_masks with the other sequence tensors.

When marker-derived masks are not used, apply_chat_template() can return assistant_masks; if the processor also ignored padding="max_length", this helper pads input_ids/attention_mask but leaves assistant_masks short. Line 595 can then fail with a shape mismatch under answer_only_loss=True.

Proposed fix

for key, pad_value in ( ("input_ids", self.tokenizer.pad_token_id), ("attention_mask", 0), + ("assistant_masks", 0), ):

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for key, pad_value in (

("input_ids", self.tokenizer.pad_token_id),

("attention_mask", 0),

):

value = tokenized_messages.get(key)

if not isinstance(value, torch.Tensor):

continue

pad = value.new_full((value.shape[0], pad_width), pad_value)

tokenized_messages[key] = torch.cat((value, pad), dim=1)

for key, pad_value in (

("input_ids", self.tokenizer.pad_token_id),

("attention_mask", 0),

("assistant_masks", 0),

):

value = tokenized_messages.get(key)

if not isinstance(value, torch.Tensor):

continue

pad = value.new_full((value.shape[0], pad_width), pad_value)

tokenized_messages[key] = torch.cat((value, pad), dim=1)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/utils/plugins/transformers_dataset.py` around lines 537 - 545, The padding logic in `pad_tokenized_messages` only extends `input_ids` and `attention_mask`, but it also needs to pad `assistant_masks` when `apply_chat_template()` returns that tensor. Update the sequence-padding loop near the existing `tokenized_messages.get(key)` handling to include `assistant_masks` with the same `pad_width`, preserving tensor shape consistency for the downstream `answer_only_loss` path in `transformers_dataset`.

feat(speculative): add Qwen3-VL support for DFlash training

16bad92

skierat requested review from a team as code owners July 2, 2026 14:55

skierat requested review from cjluo-nv and h-guo18 July 2, 2026 14:55

coderabbitai Bot reviewed Jul 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(speculative): add Qwen3-VL support for DFlash training#1887

feat(speculative): add Qwen3-VL support for DFlash training#1887
skierat wants to merge 1 commit into
NVIDIA:mainfrom
skierat:skierat/qwen3-vl-dflash-training

skierat commented Jul 2, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented Jul 2, 2026

Uh oh!

coderabbitai Bot commented Jul 2, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Pre-merge checks failed

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jul 2, 2026

Uh oh!

coderabbitai Bot Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

skierat commented Jul 2, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Uh oh!

copy-pr-bot Bot commented Jul 2, 2026

Uh oh!

coderabbitai Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Pre-merge checks failed

❌ Failed checks (1 error)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

skierat commented Jul 2, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jul 2, 2026 •

edited

Loading