Conversation
Codecov Report❌ Patch coverage is
🚀 New features to boost your workflow:
|
| # Parameter / memory note: with the default 7B-dense-shaped backbone | ||
| # (dim=4096, n_layers=32, ffn ~14336) and 8 SwiGLU experts per layer | ||
| # (4 image + 4 text), total params is much larger than dense 7B. Use | ||
| # FSDP=4 + activation_checkpointing="full" to fit on 4x H200. For a |
There was a problem hiding this comment.
configs/train/vlm_7b_moma.toml gives three conflicting messages on activation_checkpointing:
- L22 (header note): Use FSDP=4 + activation_checkpointing="full" to fit on 4x H200 - recommends it.
- L84 (just above the field): AC=full is currently a no-op for MoMa - says it does nothing.
- L89 (just above the field): Required given the per-layer expert duplication on 4x H200 - says it's mandatory.
A user reading top-down is told to enable AC, then that it's required, then that it's inert.
Suggest:
- Reword L22 to flag the no-op and recommend reducing moma_experts_per_modality on 4x H200 instead.
- Reword L89's "Required" to something like "Intended once apply_ac is refactored; currently inert".
| "MoMa + Expert Parallelism is not supported in v1. Per-modality " | ||
| "expert groups need EP-aware dispatch that is not yet wired." | ||
| ) | ||
| if self.train.compile_model: |
There was a problem hiding this comment.
The MoMa branch of JobConfig.validate (job.py:225-247) already warns on compile_model (lines 241-247). The AC=full no-op fits the same precedent but is silent: a user authoring a fresh MoMa config without reading vlm_7b_moma.toml's NOTE block gets no warning that activation checkpointing has no effect.
Add a sibling warning right after the compile block (import logging is already in scope there):
if self.train.activation_checkpointing == "full":
logging.getLogger(__name__).warning(
"AC=full is currently a no-op for MoMa (apply_ac matches "
"TransformerBlock only; MoMaBlock is a sibling nn.Module). "
"Use ac='selective' (wraps the Attention submodule, still works) "
"or reduce moma_experts_per_modality until the apply_ac refactor lands."
)Trigger only on "full", not != "none": apply_ac's selective branch (parallel.py:120) uses isinstance(m, Attention), and MoMaBlock.attention is a standard Attention (moma.py:401), so selective mode does wrap MoMa correctly. Only full is broken.
Could ride on the upcoming apply_ac refactor PR, or land as a small patch here.
|
I tested the MoMa architecture on two H100 GPU and it works fine. Added a few minor comments in PR. |
Summary
arch = "moma"), following Lin et al. 2024 (arXiv:2407.21770). Single shared Q/K/V/O attention + per-modality MoE FFN groups with expert-choice + Sigmoid routing.modality_ids(level 1, reusing MoT's existing mechanism), then learned expert-choice + Sigmoid within each modality group (level 2, with Gumbel-Sigmoid noise per paper Eq. 5).Deferred to v2
torch.compilesupport — modality_ids scatter/gather + EC top-k currently cause graph breaks;JobConfig.validateemits a warning.Testing
ruff format— clean (148 files unchanged)ruff check— All checks passedpytest tests/unit/— 1342 passed, 2 skipped (~180s)pytest tests/integration/— 77 passed on CUDA (~44s)torchrun --nproc_per_node=2 -m pytest tests/distributed/test_vlm_*_fsdp.py— 30 passed (~38s)torchrun --nproc_per_node=2 -m pytest tests/distributed/test_fsdp.py test_checkpoint.py— 17 passed, 1 skipped (regression sanity on non-VLM paths)uv run torchrun --nproc_per_node=4 scripts/train.py configs/train/vlm_7b_moma.toml— 46.87 B params total, loss volatile through 5-step warmup as expected, ~9.8k tok/s steady-state, ~105 GB / 140 GB per GPUCloses #