Add MoMa (Mixture of Modality-Aware Experts) VLM architecture by amazloumi · Pull Request #107 · KempnerInstitute/KempnerForge

amazloumi · 2026-05-18T21:01:24Z

Summary

Adds paper-faithful MoMa as the 4th VLM architecture (arch = "moma"), following Lin et al. 2024 (arXiv:2407.21770). Single shared Q/K/V/O attention + per-modality MoE FFN groups with expert-choice + Sigmoid routing.
Token routing is two-stage: deterministic by modality_ids (level 1, reusing MoT's existing mechanism), then learned expert-choice + Sigmoid within each modality group (level 2, with Gumbel-Sigmoid noise per paper Eq. 5).
v1 supports training only — expert-choice routing is non-causal; auxiliary routers for inference (paper §2.4) are deferred to v2.
pre-norm vs paper's Swin post-norm.

Deferred to v2

Auxiliary routers for inference causality (paper §2.4) — required for autoregressive generation under EC routing.
Upcycling helper (1t1i seed → multi-expert MoMa, paper §2.5) — paper reports a 1.16–1.2× speedup from this staged-training trick.
torch.compile support — modality_ids scatter/gather + EC top-k currently cause graph breaks; JobConfig.validate emits a warning.
Expert Parallelism (rejected in v1 — per-modality expert groups need EP-aware dispatch that isn't wired yet).

Testing

ruff format — clean (148 files unchanged)
ruff check — All checks passed
pytest tests/unit/ — 1342 passed, 2 skipped (~180s)
pytest tests/integration/ — 77 passed on CUDA (~44s)
torchrun --nproc_per_node=2 -m pytest tests/distributed/test_vlm_*_fsdp.py — 30 passed (~38s)
torchrun --nproc_per_node=2 -m pytest tests/distributed/test_fsdp.py test_checkpoint.py — 17 passed, 1 skipped (regression sanity on non-VLM paths)
End-to-end smoke on 4× H200: uv run torchrun --nproc_per_node=4 scripts/train.py configs/train/vlm_7b_moma.toml — 46.87 B params total, loss volatile through 5-step warmup as expected, ~9.8k tok/s steady-state, ~105 GB / 140 GB per GPU

Closes #

codecov · 2026-05-18T21:04:46Z

Codecov Report

❌ Patch coverage is 94.35028% with 10 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
kempnerforge/config/job.py	11.11%	7 Missing and 1 partial ⚠️
kempnerforge/model/moma.py	97.97%	1 Missing and 1 partial ⚠️

Files with missing lines	Coverage Δ
kempnerforge/config/vlm.py	`100.00% <100.00%> (ø)`
kempnerforge/model/transformer.py	`94.41% <100.00%> (+0.74%)`	⬆️
kempnerforge/model/vlm.py	`98.94% <100.00%> (+0.13%)`	⬆️
kempnerforge/model/moma.py	`97.97% <97.97%> (ø)`
kempnerforge/config/job.py	`85.41% <11.11%> (-7.77%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…C no-op

mmshad · 2026-05-21T15:55:58Z

+# Parameter / memory note: with the default 7B-dense-shaped backbone
+# (dim=4096, n_layers=32, ffn ~14336) and 8 SwiGLU experts per layer
+# (4 image + 4 text), total params is much larger than dense 7B. Use
+# FSDP=4 + activation_checkpointing="full" to fit on 4x H200. For a


configs/train/vlm_7b_moma.toml gives three conflicting messages on activation_checkpointing:

L22 (header note): Use FSDP=4 + activation_checkpointing="full" to fit on 4x H200 - recommends it.

L84 (just above the field): AC=full is currently a no-op for MoMa - says it does nothing.

L89 (just above the field): Required given the per-layer expert duplication on 4x H200 - says it's mandatory.

A user reading top-down is told to enable AC, then that it's required, then that it's inert.

Suggest:

Reword L22 to flag the no-op and recommend reducing moma_experts_per_modality on 4x H200 instead.

Reword L89's "Required" to something like "Intended once apply_ac is refactored; currently inert".

mmshad · 2026-05-21T16:04:05Z

+                        "MoMa + Expert Parallelism is not supported in v1. Per-modality "
+                        "expert groups need EP-aware dispatch that is not yet wired."
+                    )
+                if self.train.compile_model:


The MoMa branch of JobConfig.validate (job.py:225-247) already warns on compile_model (lines 241-247). The AC=full no-op fits the same precedent but is silent: a user authoring a fresh MoMa config without reading vlm_7b_moma.toml's NOTE block gets no warning that activation checkpointing has no effect.

Add a sibling warning right after the compile block (import logging is already in scope there):

if self.train.activation_checkpointing == "full": logging.getLogger(__name__).warning( "AC=full is currently a no-op for MoMa (apply_ac matches " "TransformerBlock only; MoMaBlock is a sibling nn.Module). " "Use ac='selective' (wraps the Attention submodule, still works) " "or reduce moma_experts_per_modality until the apply_ac refactor lands." )

Trigger only on "full", not != "none": apply_ac's selective branch (parallel.py:120) uses isinstance(m, Attention), and MoMaBlock.attention is a standard Attention (moma.py:401), so selective mode does wrap MoMa correctly. Only full is broken.

Could ride on the upcoming apply_ac refactor PR, or land as a small patch here.

mmshad · 2026-05-21T17:57:21Z

I tested the MoMa architecture on two H100 GPU and it works fine. Added a few minor comments in PR.

Add MoMa (Mixture of Modality-Aware Experts) VLM architecture

d4e2b56

amazloumi requested review from Naeemkh and mmshad May 18, 2026 21:01

amazloumi added 3 commits May 20, 2026 10:23

Merge remote-tracking branch 'origin/main' into moma-arch

bf13b36

MoMa post-review: validate modality_ids, expose expert counts, note A…

d6f72ee

…C no-op

fixing pyright ckeck

a75c6f8

mmshad reviewed May 21, 2026

View reviewed changes

mmshad approved these changes May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MoMa (Mixture of Modality-Aware Experts) VLM architecture#107

Add MoMa (Mixture of Modality-Aware Experts) VLM architecture#107
amazloumi wants to merge 4 commits into
mainfrom
moma-arch

amazloumi commented May 18, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 18, 2026 •

edited

Loading

Uh oh!

mmshad May 21, 2026

Uh oh!

mmshad May 21, 2026

Uh oh!

mmshad commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

amazloumi commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Deferred to v2

Testing

Uh oh!

codecov Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mmshad May 21, 2026

Choose a reason for hiding this comment

Uh oh!

mmshad May 21, 2026

Choose a reason for hiding this comment

Uh oh!

mmshad commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

amazloumi commented May 18, 2026 •

edited

Loading

codecov Bot commented May 18, 2026 •

edited

Loading