KernelPilot is a Kernel Design Agents style prompt repository for B200 and H200
GPU kernel experiments, currently focused on SGLang diffusion (non-gemm,
non-attention) kernels and a couple of legacy reference tasks. Each optimization
target lives in a self-contained kernels/<task>/ folder, and each run starts
from an isolated git worktree with official Humanize commands.
This repository does not vendor or patch Humanize. Install and use the official Humanize runtime in your agent environment, then launch one kernel task at a time from the scripts in this repo.
external/KernelWiki/ # upstream kernel evidence and PR knowledge
external/ncu-report-skill/ # Nsight Compute profiling workflow
kernels/
b200_int8_scaled_mm__m64_n2048_k2048_bias/
b200_fa4_mha__bf16_head128_total32768/
{b200,h200}_diffusion_qknorm_rope__multi_shape/
{b200,h200}_diffusion_norm_infer__multi_shape/
{b200,h200}_diffusion_group_norm_silu__multi_shape/
{b200,h200}_diffusion_rotary_embedding__multi_shape/
{b200,h200}_diffusion_fuse_scale_shift__multi_shape/
{b200,h200}_diffusion_cutedsl_norm_tanh_mul_add__multi_shape/
{b200,h200}_diffusion_cutedsl_norm_scale_shift__multi_shape/
scripts/
launch_kda_kernel_task.sh
launch_kernels/
k01_b200_int8_scaled_mm.sh
k02_b200_fa4_mha.sh
k03_b200_diffusion_qknorm_rope__multi_shape.sh
k04_h200_diffusion_qknorm_rope__multi_shape.sh
...
k16_h200_diffusion_cutedsl_norm_scale_shift__multi_shape.sh
Each kernel folder owns its prompt, interface contract, benchmark scaffold, correctness scaffold, candidate ledger, benchmark ledger, notes, profiles, and NCU reports. Nothing in the workflow depends on a repo-local Humanize copy.
Clone with submodules:
git clone --recurse-submodules https://github.com/BBuf/kernel-pilot.git
cd kernel-pilotFor an existing checkout:
git submodule update --init --recursiveInstall official Humanize separately, following the upstream Humanize instructions used by your Claude Code environment. After installation, Claude Code should expose commands such as:
/humanize:gen-plan
/humanize:start-rlcr-loop
Use docs/ghostty_claude_code_workflow.md
for the cold-terminal flow that opens four parallel Claude Code RLCR panes
against the diffusion task queue, including the four-pane first wave,
prerequisites (the three local ion-* skills + HF_TOKEN + remote
cat.png), launcher invocation pattern, Humanize gen-plan / start-rlcr-loop
prep block, and per-task prompt cards.
Once a KDA task lands a promoted candidate, run
python3 scripts/export_kda_kernels/export.py <task-slug>to copy the task's src/ into
kda_kernels/diffusion/<family>/_impls/<arch>/ and rewire the matching
kda_kernels/diffusion/<family>/ package to route through an
architecture-aware dispatcher (flipping KDA_OPTIMIZED_<fn> = True). Exporting
both the B200 and H200 task slugs for the same family keeps both
implementations; runtime dispatch selects B200 for CUDA capability (10, 0) and
H200 for (9, 0). Activating every promoted kernel inside an sglang checkout
is one command:
export PYTHONPATH=/path/to/kernel-pilot:$PYTHONPATH
cd /path/to/sglang
git apply /path/to/kernel-pilot/patches/sglang_kda_kernels.patchThe patch adds 16 lines at the end of python/sglang/__init__.py that try
import kda_kernels; kda_kernels.install(). Functions still on the baseline
(not yet promoted via export.py) are untouched, so partial promotion is
safe. Inspect, undo at runtime, or revert the patch any time:
import kda_kernels
print(kda_kernels.status()) # currently-swapped sglang paths
kda_kernels.uninstall() # restore baseline without removing patchSee kda_kernels/README.md,
patches/README.md, and
scripts/export_kda_kernels/README.md
for the full contract (the EXPORTS = {...} rule each task's
src/register.py follows, revert flow, per-function KDA_TASK_<fn> /
KDA_COMMIT_<fn> / KDA_DATE_<fn> / KDA_SPEEDUP_<fn> stamps,
KDA_ARCHES_<fn> architecture coverage, and patch-regeneration when upstream
sglang edits __init__.py).
| Kernel | Goal | Launcher |
|---|---|---|
b200_int8_scaled_mm__m64_n2048_k2048_bias |
Optimize SGLang int8_scaled_mm for M=64, N=2048, K=2048, fp16 output, bias=true on B200. |
scripts/launch_kernels/k01_b200_int8_scaled_mm.sh |
b200_fa4_mha__bf16_head128_total32768 |
Build a standalone BF16 forward-only MHA kernel and compare against FlashAttention-4 on B200. | scripts/launch_kernels/k02_b200_fa4_mha.sh |
These tasks cover the SGLang non-gemm/non-attention diffusion kernels under
python/sglang/jit_kernel/diffusion/. Each task's shape table is captured from
accepted live SGLang diffusion benchmark preset runs on the corresponding remote
GPU box. Presets only contribute shapes when the native SGLang run finishes with
a valid denoise/refinement perf dump, so gated or failed runs are excluded from
the workload tables. The preset sweep runs on ion-b200 for the B200 variants
and on ion8-h200 / ion9-h200 for the H200 variants.
Each task's prompt.md carries the full shape table and explicitly allows
shape-bucketed dispatchers, per-bucket configs, and per-bucket kernel variants
(in the style of b200_fa4_mha__bf16_head128_total32768).
| Kernel | Arch | Goal | Launcher |
|---|---|---|---|
b200_diffusion_qknorm_rope__multi_shape |
B200 | Fused in-place QKNorm + RoPE (CUDA) across all diffusion preset shapes. | scripts/launch_kernels/k03_b200_diffusion_qknorm_rope__multi_shape.sh |
h200_diffusion_qknorm_rope__multi_shape |
H200 | Same as B200 variant, H200 target. | scripts/launch_kernels/k04_h200_diffusion_qknorm_rope__multi_shape.sh |
b200_diffusion_norm_infer__multi_shape |
B200 | Inference-only LN/RMSN baseline (norm_infer and triton_one_pass_rms_norm). |
scripts/launch_kernels/k05_b200_diffusion_norm_infer__multi_shape.sh |
h200_diffusion_norm_infer__multi_shape |
H200 | Same as B200 variant, H200 target. | scripts/launch_kernels/k06_h200_diffusion_norm_infer__multi_shape.sh |
b200_diffusion_group_norm_silu__multi_shape |
B200 | Fused GroupNorm + SiLU across image (2D/3D) and video (3D/5D) VAE inputs. | scripts/launch_kernels/k07_b200_diffusion_group_norm_silu__multi_shape.sh |
h200_diffusion_group_norm_silu__multi_shape |
H200 | Same as B200 variant, H200 target. | scripts/launch_kernels/k08_h200_diffusion_group_norm_silu__multi_shape.sh |
b200_diffusion_rotary_embedding__multi_shape |
B200 | Standard RoPE and LTX-2 split RoPE across all diffusion preset token counts. | scripts/launch_kernels/k09_b200_diffusion_rotary_embedding__multi_shape.sh |
h200_diffusion_rotary_embedding__multi_shape |
H200 | Same as B200 variant, H200 target. | scripts/launch_kernels/k10_h200_diffusion_rotary_embedding__multi_shape.sh |
b200_diffusion_fuse_scale_shift__multi_shape |
B200 | Triton fused scale-shift modulation (fuse_scale_shift_kernel + dual-modulation Z-Image variants). |
scripts/launch_kernels/k11_b200_diffusion_fuse_scale_shift__multi_shape.sh |
h200_diffusion_fuse_scale_shift__multi_shape |
H200 | Same as B200 variant, H200 target. | scripts/launch_kernels/k12_h200_diffusion_fuse_scale_shift__multi_shape.sh |
b200_diffusion_cutedsl_norm_tanh_mul_add__multi_shape |
B200 | CuTe-DSL fused_norm_tanh_mul_add (Z-Image residual modulation) and the second-norm scale combined kernel. |
scripts/launch_kernels/k13_b200_diffusion_cutedsl_norm_tanh_mul_add__multi_shape.sh |
h200_diffusion_cutedsl_norm_tanh_mul_add__multi_shape |
H200 | Same as B200 variant, H200 target. | scripts/launch_kernels/k14_h200_diffusion_cutedsl_norm_tanh_mul_add__multi_shape.sh |
b200_diffusion_cutedsl_norm_scale_shift__multi_shape |
B200 | CuTe-DSL fused_norm_scale_shift (and fused_scale_residual_norm_scale_shift) across all diffusion presets. |
scripts/launch_kernels/k15_b200_diffusion_cutedsl_norm_scale_shift__multi_shape.sh |
h200_diffusion_cutedsl_norm_scale_shift__multi_shape |
H200 | Same as B200 variant, H200 target. | scripts/launch_kernels/k16_h200_diffusion_cutedsl_norm_scale_shift__multi_shape.sh |
Run one kernel launcher from the repository root:
scripts/launch_kernels/k01_b200_int8_scaled_mm.shor:
scripts/launch_kernels/k02_b200_fa4_mha.shThe launcher creates a task-specific worktree, enters the selected kernel
folder, sets CLAUDE_PROJECT_DIR to that folder, and creates
.humanize/kernel-agent/draft.md from the local prompt.md.
The selected KDA_BASE_BRANCH must already contain the target kernel folder.
After editing these task definitions, commit them first or set KDA_BASE_BRANCH
to a ref that contains them.
Inside Claude Code, run the two commands printed by the launcher:
/humanize:gen-plan --input .humanize/kernel-agent/draft.md --output .humanize/kernel-agent/refined-plan.md --direct
/humanize:start-rlcr-loop .humanize/kernel-agent/refined-plan.md --skip-quiz --claude-answer-codex --max 12 --codex-model gpt-5.5:high --codex-timeout 5400 --base-branch <printed-review-base>
If Codex review appears stuck for a long time with a status like
running stop hook while tokens keep increasing, check the local Humanize
runtime before blaming the kernel task. Older Humanize hook scripts may launch
nested Codex review with only --disable codex_hooks. Newer Codex builds use
the hooks feature name, so that does not actually disable hooks; the nested
review can re-enter the Humanize Stop hook and loop inside the reviewer.
Avoid this before launching a large RLCR run:
- Keep the official
--codex-timeout 5400unless a task has a specific reason to fail faster or run longer. - Use an updated Humanize runtime whose nested
codex execcalls disable all known hook feature names:--disable hooks --disable plugin_hooks --disable codex_hooks. - On macOS, ensure the Humanize portable timeout kills the whole child process
group, not only the direct
codexparent. Otherwise timed-out reviewer descendants can keep running. - If hook files were edited locally, re-trust the Codex hook when Codex asks.
Useful overrides:
KDA_BASE_BRANCH=main # base ref for the task worktree
KDA_WORKTREE_BASE=/path/to/kernel-worktrees # parent directory for worktrees
KDA_RUN_ID=my-run # stable branch/worktree suffix
KDA_NO_CLAUDE=1 # prepare the worktree only
CLAUDE_BIN=claude # Claude executable
CLAUDE_MODEL=opus # Claude model flag
CLAUDE_EFFORT=max # Claude effort flag
HUMANIZE_CODEX_BYPASS_SANDBOX=true # default; forwarded into Claude's env so
# Codex in the RLCR loop skips per-call
# sandbox/approval prompts. Set anything
# other than true|1 to re-enable.prompt.md # source task prompt
interface.md # expected candidate wrapper/register contract
benchmark.py # local timing scaffold
benchmark.csv # append-only benchmark evidence ledger
solutions.jsonl # append-only candidate lineage ledger
src/ # candidate implementation entry point
tests/test_correctness.py # correctness scaffold
docs/ # plan notes, source notes, run logs
profile/ # profiler traces
ncu/ # Nsight Compute reports
The Humanize plan should recover K/R/W from prompt.md before implementation:
K: kernel semantics and callsite contractR: correctness oracle and baseline pathW: workload shapes and benchmark methodology
Keep generated .humanize* state untracked.
Validate the external knowledge submodule when updating it:
cd external/KernelWiki
python3 scripts/validate.pyCheck launcher syntax after edits:
bash -n scripts/launch_kda_kernel_task.sh scripts/launch_kernels/*.shUpdate external skills with:
git submodule update --remote external/KernelWiki
git submodule update --remote external/ncu-report-skillBBuf/kernel-design-agents: KernelPilot reuses this repository's kernel-design-agent method and workflow.BBuf/kernel-design-agent-with-sglang-omini: KernelPilot follows itskernels/folder style and per-kernel launch script pattern.PolyArch/humanize: official Humanize runtime used by the launch flow.BBuf/KernelWiki: upstream kernel evidence and prior-art source used by the prompts.DongyunZou/ncu-report-skill: Nsight Compute profiling workflow used when profiler evidence is needed.