feat(speculative): add Qwen3-VL support for DFlash training#1887
feat(speculative): add Qwen3-VL support for DFlash training#1887skierat wants to merge 1 commit into
Conversation
📝 WalkthroughWalkthroughThis PR extends speculative decoding to better support vision-language models (VLMs): dataset loading gains a JSONL fallback, VisionLanguageDataCollator adds marker-derived assistant masking, shift_labels, and pixel-bound configuration; load_vlm_or_llm gains flexible auto-model class resolution; and HFDFlashModel's forward/loss path handles multimodal inputs. ChangesVLM training and data pipeline updates
Estimated code review effort: 3 (Moderate) | ~35 minutes Sequence Diagram(s)sequenceDiagram
participant Trainer
participant VisionLanguageDataCollator
participant Processor
participant HFDFlashModel
Trainer->>VisionLanguageDataCollator: collate batch of samples
VisionLanguageDataCollator->>Processor: from_pretrained (with pixel bounds)
VisionLanguageDataCollator->>VisionLanguageDataCollator: _apply_chat_template (marker-derived masks)
VisionLanguageDataCollator->>VisionLanguageDataCollator: _pad_sequence_tensors (train_len)
VisionLanguageDataCollator->>VisionLanguageDataCollator: _process_multimodal_sample (labels via assistant_masks)
VisionLanguageDataCollator-->>Trainer: batch with input_ids, labels, pixel_values
Trainer->>HFDFlashModel: forward(batch)
HFDFlashModel->>HFDFlashModel: detect multimodal kwargs
HFDFlashModel->>HFDFlashModel: super().forward (return_dict, output_hidden_states)
HFDFlashModel-->>Trainer: hidden_states / loss
Estimated code review effort: 3 (Moderate) | ~35 minutes Caution Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional.
❌ Failed checks (1 error)
✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@modelopt/torch/speculative/utils.py`:
- Around line 613-620: The model class resolution in utils.py currently prefers
AutoModelForVision2Seq before AutoModelForImageTextToText, which keeps using the
deprecated alias when both are available. Update the fallback order in the model
class lookup so AutoModelForImageTextToText is tried first, then fall back to
AutoModelForVision2Seq, and keep the architecture-based fallback in the same
resolution chain.
In `@modelopt/torch/utils/plugins/transformers_dataset.py`:
- Around line 537-545: The padding logic in `pad_tokenized_messages` only
extends `input_ids` and `attention_mask`, but it also needs to pad
`assistant_masks` when `apply_chat_template()` returns that tensor. Update the
sequence-padding loop near the existing `tokenized_messages.get(key)` handling
to include `assistant_masks` with the same `pad_width`, preserving tensor shape
consistency for the downstream `answer_only_loss` path in
`transformers_dataset`.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 77e86707-9ddf-4546-8afc-48443831de63
📒 Files selected for processing (4)
examples/speculative_decoding/eagle_utils.pymodelopt/torch/speculative/plugins/hf_dflash.pymodelopt/torch/speculative/utils.pymodelopt/torch/utils/plugins/transformers_dataset.py
| # ``AutoModelForVision2Seq`` was renamed in Transformers 5. Prefer the | ||
| # legacy alias when present, then use the current auto-model entry point. | ||
| model_cls = getattr(transformers, "AutoModelForVision2Seq", None) | ||
| if model_cls is None: | ||
| model_cls = getattr(transformers, "AutoModelForImageTextToText", None) | ||
| if model_cls is None: | ||
| architecture = (getattr(model_config, "architectures", None) or [None])[0] | ||
| model_cls = getattr(transformers, architecture, None) if architecture else None |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the relevant file and nearby logic
git ls-files modelopt/torch/speculative/utils.py
wc -l modelopt/torch/speculative/utils.py
sed -n '580,660p' modelopt/torch/speculative/utils.py
# Search for related auto-model selection logic
rg -n "AutoModelForVision2Seq|AutoModelForImageTextToText|architectures|FutureWarning" modelopt/torch/speculative -SRepository: NVIDIA/Model-Optimizer
Length of output: 4378
🌐 Web query:
Hugging Face AutoModelForVision2Seq deprecated removed in v5 use AutoModelForImageTextToText FutureWarning
💡 Result:
The class AutoModelForVision2Seq was deprecated in earlier versions of the Hugging Face Transformers library and has been officially removed in version 5.0.0 [1][2][3]. To resolve the FutureWarning or import errors resulting from this change, you should update your code to use AutoModelForImageTextToText instead [4][1][5]. This change is part of the broader cleanup and architectural updates introduced in Transformers v5 [6][3]. If you are encountering this in your codebase, replacing all instances of AutoModelForVision2Seq with AutoModelForImageTextToText will restore compatibility with the v5.x ecosystem [1][7].
Citations:
- 1: Remove references to AutoModelForVision2Seq huggingface/transformers#41513
- 2: huggingface/transformers@878d60a
- 3: https://github.com/huggingface/transformers/releases/tag/v5.0.0
- 4: https://raw.githubusercontent.com/huggingface/transformers/main/MIGRATION_GUIDE_V5.md
- 5: AutoModelForVision2Seq.from_config() got an unexpected keyword argument 'torch_dtype' huggingface/transformers#40477
- 6: https://github.com/huggingface/transformers/blob/main/MIGRATION_GUIDE_V5.md
- 7: Remove references to AutoModelForVision2Seq huggingface/transformers#41513
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find the Transformers version constraints used by this repo
rg -n "transformers|tf_latest|AutoModelForImageTextToText|AutoModelForVision2Seq" pyproject.toml requirements*.txt noxfile.py modelopt tests -SRepository: NVIDIA/Model-Optimizer
Length of output: 50378
🌐 Web query:
Transformers 4.56 AutoModelForVision2Seq AutoModelForImageTextToText both available deprecation warning
💡 Result:
In the Hugging Face Transformers library, AutoModelForVision2Seq is deprecated and has been superseded by AutoModelForImageTextToText [1][2][3]. In version 4.56, using AutoModelForVision2Seq triggers a FutureWarning because it is slated for removal in version 5.0 [4][3][5]. The library has explicitly transitioned to AutoModelForImageTextToText as the standard class for handling vision-language (image-text-to-text) models [6][7]. To resolve this deprecation warning, you should update your code to replace any instances of AutoModelForVision2Seq with AutoModelForImageTextToText [1][8]. For example: Instead of: from transformers import AutoModelForVision2Seq model = AutoModelForVision2Seq.from_pretrained("model-name") Use: from transformers import AutoModelForImageTextToText model = AutoModelForImageTextToText.from_pretrained("model-name")
Citations:
- 1: Remove references to AutoModelForVision2Seq huggingface/transformers#41513
- 2: huggingface/transformers@878d60a
- 3: huggingface/transformers@cb54ce4
- 4: Warning when using the "image-to-text" pipeline huggingface/transformers#41509
- 5: AutoModelForVision2Seq.from_config() got an unexpected keyword argument 'torch_dtype' huggingface/transformers#40477
- 6: https://huggingface.co/docs/transformers/v4.56.0/tasks/image_text_to_text
- 7: Add auto model for image-text-to-text huggingface/transformers#32472
- 8: Remove references to AutoModelForVision2Seq huggingface/transformers#41513
Prefer AutoModelForImageTextToText first
AutoModelForVision2Seq is still present in the supported 4.56+ range and emits a FutureWarning, so this order keeps taking the deprecated path whenever both classes exist. Try the new class first, then fall back to the legacy alias for older installs.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@modelopt/torch/speculative/utils.py` around lines 613 - 620, The model class
resolution in utils.py currently prefers AutoModelForVision2Seq before
AutoModelForImageTextToText, which keeps using the deprecated alias when both
are available. Update the fallback order in the model class lookup so
AutoModelForImageTextToText is tried first, then fall back to
AutoModelForVision2Seq, and keep the architecture-based fallback in the same
resolution chain.
| for key, pad_value in ( | ||
| ("input_ids", self.tokenizer.pad_token_id), | ||
| ("attention_mask", 0), | ||
| ): | ||
| value = tokenized_messages.get(key) | ||
| if not isinstance(value, torch.Tensor): | ||
| continue | ||
| pad = value.new_full((value.shape[0], pad_width), pad_value) | ||
| tokenized_messages[key] = torch.cat((value, pad), dim=1) |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Pad assistant_masks with the other sequence tensors.
When marker-derived masks are not used, apply_chat_template() can return assistant_masks; if the processor also ignored padding="max_length", this helper pads input_ids/attention_mask but leaves assistant_masks short. Line 595 can then fail with a shape mismatch under answer_only_loss=True.
Proposed fix
for key, pad_value in (
("input_ids", self.tokenizer.pad_token_id),
("attention_mask", 0),
+ ("assistant_masks", 0),
):📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for key, pad_value in ( | |
| ("input_ids", self.tokenizer.pad_token_id), | |
| ("attention_mask", 0), | |
| ): | |
| value = tokenized_messages.get(key) | |
| if not isinstance(value, torch.Tensor): | |
| continue | |
| pad = value.new_full((value.shape[0], pad_width), pad_value) | |
| tokenized_messages[key] = torch.cat((value, pad), dim=1) | |
| for key, pad_value in ( | |
| ("input_ids", self.tokenizer.pad_token_id), | |
| ("attention_mask", 0), | |
| ("assistant_masks", 0), | |
| ): | |
| value = tokenized_messages.get(key) | |
| if not isinstance(value, torch.Tensor): | |
| continue | |
| pad = value.new_full((value.shape[0], pad_width), pad_value) | |
| tokenized_messages[key] = torch.cat((value, pad), dim=1) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@modelopt/torch/utils/plugins/transformers_dataset.py` around lines 537 - 545,
The padding logic in `pad_tokenized_messages` only extends `input_ids` and
`attention_mask`, but it also needs to pad `assistant_masks` when
`apply_chat_template()` returns that tensor. Update the sequence-padding loop
near the existing `tokenized_messages.get(key)` handling to include
`assistant_masks` with the same `pad_width`, preserving tensor shape consistency
for the downstream `answer_only_loss` path in `transformers_dataset`.
What does this PR do?
Type of change: new feature
Adds online DFlash training support for Qwen3-VL–style vision-language models.
Changes include:
AutoModelForImageTextToTextAPI, while retaining compatibility with the legacy VLM auto-model API.VisionLanguageDataCollatorto:answer_only_loss, chat-template, and DFlash label-alignment settings;VLM_MIN_PIXELS/VLM_MAX_PIXELSprocessor limits;training_seq_lenrequired by DFlash block training.Usage