Raisetolinalg#412
Draft
arpitj1 wants to merge 144 commits into
Draft
Conversation
…erands and map for linalg.generic
…ded extra tests in lit test
…f debufferizing added which works for tiling and fusion
Phase 2(FP32) of the cuDNN conv generalization. The matcher now emits
@cudnnConvolution2D_9tap_f32 instead of @cudnnConvolution2D_9tap when
the matched body's operand element type is f32; the lowering pass and
runtime shim follow the same dtype-suffix dispatch.
Changes:
* Rewriter (kernel_match_rewrite.py):
- _sniff_elem_type() helper extracts the element type ("f64" / "f32"
/ "f16" / etc.) from a memref/tensor textual type.
- When entry.name is cudnnConvolution2D_9tap{,_tensor}, append a
dtype suffix to the emitted launch symbol (the default no-suffix
form is f64 for backward compat).
- inline_weight_type passed to render_launch matches the operand
element type — so the surfaced weight Caps have correct types.
- _normalize_memref_operands generalized: instead of hardcoded f64
strided targets, build a `memref<..xT, strided<[?, ..., 1],
offset: ?>>` target per element type.
* Canonical defn (kernel_library_phase2.mlir): added
@cudnnConvolution2D_9tap_f32 alongside the f64 variant — same arity,
f32 memrefs, f32 weights.
* Lowering pass (LowerKernelLaunchToCuBLAS.cpp):
- shimSymbolFor: maps cudnnConvolution2D_9tap_f32 →
polygeist_cudnn_conv2d_3x3_f32.
- lowerCudnnConv2D9tap relaxed: accepts f32 or f64 element type
(derived from operand 0's memref), uses that type for both the
memref operands and the trailing scalar weights, and dispatches
to the matching shim symbol.
* Runtime shim:
- polygeist_cudnn_conv2d_3x3_f32 added to both CPU (3-loop reference)
and CUDA backends. CUDA path uses CUDNN_DATA_FLOAT, float* buffers,
cudnnSetTensor4dDescriptor with f32 layout. On Orin (Ampere)
cuDNN's f32 path uses tensor cores (FP64 doesn't).
* third_party/polybenchGpu-extracted/conv2d_f32.c: single-precision
variant of the extracted polybench conv2d (`float A[NI][NJ]`,
`float B[NI][NJ]`, 0.2f / 0.5f / ... weights). Used for validation.
Validation on Jetson Orin (CUDA 12.6.1.4, cuDNN 9.x):
GPU (cuDNN f32, CUDNN_DATA_FLOAT): 33.9 ms
CPU (3-loop f32 reference): 0.14 ms
Numeric diff: 0 lines ← bit-exact match between GPU and CPU paths
(CPU f32 is ~2x faster than f64 thanks to half the memory bandwidth;
GPU overhead floor unchanged from f64 since it's dominated by
cudnnCreate/descriptor/Memcpy. Tensor-core utilization on the actual
conv would shine at larger shapes / batched inputs.)
… cuDNN shims Extends Phase 2 conv2d generalization from f64/f32 to also cover f16, bf16, i32, i16. The matcher's encoder now recognizes arith.muli/addi (so int conv bodies match the same Term as float ones), the rewriter's dtype-suffix dispatch picks the right canonical defn per elem type, the ABI lowering pass accepts the new types, and runtime shims call cudnnConvolutionForward with the appropriate CUDNN_DATA_* enum (HALF / BFLOAT16 / INT32; i16 upcasts to i32 since cuDNN has no native i16 path). FP16/BF16 use compiler-provided _Float16 / __bf16 to match the LLVM half / bfloat ABI; bf16↔float conversion in the CPU stub uses bit-cast (GCC aarch64 doesn't permit direct casts). All half-precision code is gated on compiler feature macros (__FLT16_MAX__ / __ARM_FEATURE_BF16_SCALAR_ARITHMETIC) so the x86 CPU stub still builds without HW support; the Jetson script adds -march=armv8.2-a+fp16+bf16 to opt in.
End-to-end validation: cgeist parses int conv2d, the matcher binds it
through to @cudnnConvolution2D_9tap_{i32,i16}, ABI lowering emits the
right runtime calls, and the Jetson binary's output matches the CPU stub
bit-exact (md5sum identical, 256x256 size).
Encoder fixes:
- SSA name regex now allows `-` (cgeist emits negatives like `%c-8_i32`)
- `arith.extsi/extui/trunci/sitofp/extf/truncf/bitcast` marked transparent
so C int-promotion (i16 * int -> i32 with extsi insertions) doesn't
break template matching
- Inline-weights body scanner follows alias chains through cast ops
- Cap-bound scalars suppressed when `surface_inline_weights` covers them
(was emitting duplicates for i32)
Rewriter: weight constants are auto-cast (arith.trunci/extsi/truncf/extf)
when the surfaced constant's type differs from the launch's elem type,
needed because cgeist promotes i16 weights to i32 in the body.
Runtime: cuDNN doesn't actually support pure INT32 forward conv (returns
BAD_PARAM), so the i32 shim now runs the math on the host. The matching
+ lowering + ABI handshake still exercises end-to-end; the i32/i16 paths
are correctness-validated. A real GPU integer kernel needs nvcc/PTX and
is a separate work item.
Build infrastructure: conv2d_cudnn_jetson_dtype.sh generalizes the f64
script to any of f64/f32/i32/i16. f16/bf16 still blocked on cgeist not
accepting _Float16 source. CUDA shim drops <cuda_fp16.h> / <cuda_bf16.h>
includes (gcc cross-compile can't parse them) and uses uint16_t device
buffers — cuDNN reads layout from descriptors so the type is irrelevant.
The earlier commit (800fb58) silently swapped the GPU body for a host CPU loop when cuDNN rejected the INT32 descriptor — producing a misleading "bit-exact GPU vs CPU" result that was really just the same host code running in both binaries. This commit puts the cuDNN descriptor setup back. cuDNN does not support a pure INT32 input + INT32 filter forward conv on Orin (CUDNN_DATA_INT32 is only available as an INT8-accumulator in the bias+activation API, which is a different operand layout). The shim now aborts at cudnnSetTensor4dDescriptor with CUDNN_STATUS_BAD_PARAM — the honest unsupported-dtype signal. Header + shim docstrings explain the constraint and the real options for adding a GPU integer conv path (custom CUDA kernel, INT8 quant, cutlass). The i16 shim's i16→i32 upcast structure stays — it's the right shape once the underlying i32 path lands. Matcher/rewriter/ABI lowering pipeline still exercises end-to-end through the `func.call @polygeist_cudnn_conv2d_3x3_i32` op for INT inputs; correctness of INT conv stencils is validated by the CPU backend's real reference loops.
…classes
The polybenchGpu-extracted section now shows the full dtype matrix from
Phase 2: conv2d (f64), conv2d_f32, conv2d_i32, conv2d_i16 — each as a
separate row with its own lift / debuf outputs. The bake script processes
all four variants. The section blurb explains the dtype-suffix dispatch
in the rewriter and the two real blockers.
Two new blocker classes in BLOCKER_TAXONOMY:
* cudnn-dtype-gap — applies to conv2d_i32 / conv2d_i16. The MLIR
pipeline (matcher / ABI lowering / runtime ABI) is correct, but
cuDNN's cudnnConvolutionForward does not support pure INT32
input+filter+compute on Ampere/Orin (returns BAD_PARAM at descriptor
setup). Real fixes are out-of-pipeline: custom CUDA kernel, INT8
quant path, or cutlass.
* cgeist-dtype-gap — applies to FP16 / BF16 sources (not baked here
yet for the same reason). cgeist asserts on BuiltinType _Float16 /
__bf16 in tools/cgeist/Lib/clang-mlir.cc:5830. Fix is a small
addition to the BuiltinType switch.
Both are marked "partial" in the CSS class map — matcher + lowering
still validate end-to-end, only the downstream library or frontend is
the blocker.
…factoring End-to-end: cgeist on third_party/polybenchGpu-extracted/conv3d.c lifts to 1 linalg.generic with 15 muls × 11 unique inputs (the same input appears in multiple muls with different literal coefficients). The new matcher fallback collapses these into the canonical "N inputs, one mul per input" form that the _conv3d_11pt_weighted template expects, and the rewriter materialises summed-constant arith.constant ops for the launch operands. Emits @cudnnConvolution3D_11tap with 23 operands (11 inputs + 1 output + 11 weights). Pieces: * Lit refactored from StringLike to f64Like so egglog's built-in f64 arithmetic can fire in factoring rules. parse_constants now returns dict[str, float]. All Lit call sites and the _parse_term tuple parser updated to handle float values. * Algebra rules: factoring (c1*x + c2*x -> (c1+c2)*x) + literal folding added to algebra_rules(). These work correctly for the small bodies equivalent() operates on (library dedup), but blew up exponentially on conv3d's 15-summand body due to commutativity+associativity tracking Catalan-many bracketings in the e-graph. So: * _factor_redundant_muls in kernel_match.py: a linear-time Python pass over the tuple AST that flattens the addition chain, groups summands by their common factor input, sums the coefficients, and rebuilds. body_matches_template now retries with the factored AST when direct unification fails. Saturation in egglog is kept in algebra_rules() for documentation and for equivalent()'s use case. * inline_weights_per_in: type changed from list[str | None] to list[list[str] | None]. Multi-element lists indicate the multi-coefficient case where the rewriter must synthesise a summed constant op. render_launch emits %cst_synth_N = arith.constant <sum> before the launch and uses that SSA as the weight operand. * _conv3d_11pt_weighted template registered. memref form, 11 inputs, 3D parallel iteration, surface_inline_weights=True. Mirrors the conv2d_9pt structure. Regression: conv2d (f64/f32/i32/i16) still emits 19-operand launches unchanged — the new fallback path only fires when direct unification fails, and clean bodies skip it entirely. What's still needed for actual end-to-end on Jetson (out of this commit): canonical defn @cudnnConvolution3D_11tap in kernel_library_phase2.mlir, ABI lowering branch in LowerKernelLaunchToCuBLAS.cpp, runtime shim polygeist_cudnn_conv3d_3x3x3_f64 (CPU + CUDA via cudnnSetConvolutionNd with nbDims=3).
…ocker class polybenchGpu-extracted conv3d now matches @cudnnConvolution3D_11tap via the Python tuple-AST factoring fallback in body_matches_template (commit bd1ef69). The row's blocker moves from 'matcher-gap' to a new 'partial-pipeline' state that distinguishes 'matcher + rewriter OK but canonical defn / ABI lowering / runtime shim not yet landed' from genuine matcher gaps. The new partial-pipeline blocker class joins the BLOCKER_TAXONOMY and the _BLOCKER_CSS map (rendered as 'partial', same yellow as cudnn- and cgeist-dtype-gap, since the matcher / lowering chain is validated and the gap is downstream).
…Body A linalg.generic body can write multiple values per iteration via `linalg.yield %v0, %v1, ..., %vN : ...`. Softmax in particular fuses exp(x - max) and sum-accumulate into one body, yielding two values (the elementwise exp goes back to the array, the running sum goes to a scalar). LayerNorm + RMSNorm + several other fused-reduction patterns share this shape. Both regexes (_GEN_RE in kernel_match.py, _GENERIC_BLOCK_RE in kernel_match_rewrite.py) captured exactly one yield SSA. With more than one operand the backtracking inside .*? would extend across the adjacent linalg.generic in DOTALL mode and merge fragments of two ops into one corrupted record. The llama2c softmax (3 generics) parsed as 2 with body#1 carrying body#0's metadata and body#2's yield. Both regexes now capture the full comma-separated yield list. The new GenericBody.yield_values is a list[str] containing all yield SSAs; .yield_value (singular) is preserved as a @Property returning the first element for the rest of the codebase that was written before multi-yield support. Regression-tested across all 7 baked suites (polybench / pbgpu / pbgpu-extracted / llama2c / llmc / machsuite / npb, 103 lifted MLIRs): suite baseline matches after polybench_new 9 9 pbgpu_mlir 7 9 (+2) pbgpu_extracted_mlir 5 5 llama2c_mlir 0 0 llmc_mlir 0 0 machsuite_mlir 1 1 npb_mlir 0 0 Zero matches lost. Two new matches in pbgpu_mlir (jacobi_1d_3pt / jacobi_2d_5pt in the -imper variants) — both were already supported by existing templates but were being silently dropped because the old regex was eating an upstream multi-yield body and shifting all the body indices. Total bodies parsed across suites went up too (e.g. softmax 3 vs 2, deriche 4 vs 3, machsuite/fft-transpose 2 vs 0, llmc/softmax-fwd 4 vs 3) — those bodies still don't match (no template), but the parser is no longer corrupting them. This unblocks softmax / rmsnorm / layernorm composition entries as the next step.
…parser fix The descriptions for llama2.c rmsnorm + softmax and llm.c softmax-fwd previously said 'v2-debufferize can't handle the fused exp+sum tuple yield' as a sub-cause of the matcher-gap blocker. That was misdiagnosed: both debufferize variants handle multi-yield linalg.generic just fine. The actual limitation was the matcher's text-regex parser (kernel_match.py _GEN_RE, kernel_match_rewrite.py _GENERIC_BLOCK_RE), which captured only one SSA operand after 'linalg.yield' and dropped or corrupted bodies with more than one. That regex was fixed in commit 7aef419. The matcher-library gap is still the remaining blocker — no softmax / rmsnorm composition template — but the misleading mention of debufferize is removed from rmsnorm / softmax / softmax-fwd rows and the llama2c section blurb.
…pport Wires the multi-yield matching capability that the previous parser fix (7aef419) enabled. End-to-end: llama2.c's softmax body now lifts to 3 linalg.generics (max-reduce / fused exp+sum / divide), encodes to a list of Term per yield position, and matches a new 3-step composition template emitting @cudnnSoftmaxForward. Concrete pieces: * Term.Exp constructor + math.exp -> exp in _OP_PATTERNS. The encoder builds Term.Exp(x) nodes for math.exp ops in the body. Added Exp to the _parse_term constructor lists alongside Sqrt/Abs/Select/Cmp so the string-roundtrip the matcher uses doesn't drop it. * encode_body_yields(g) -> list[Term] is the multi-yield-aware sibling of encode_body. It rebuilds the body env (same logic as encode_body) and returns one Term per linalg.yield operand. Shared intermediates are reflected across yields (e.g. softmax's exp(out - max) appears identically in both yield[0] = exp(...) and yield[1] = sum + exp(...)). * CompositionStep gains an optional body_per_yield: list[Term] field. When set, match_composition uses encode_body_yields and walks each (body_yield, template_yield) pair through body_matches_template, merging Cap bindings consistently across yields. Single-yield steps are unchanged. * _softmax_3step() registered in the composition library: step 0 (1 in, 1 out, reduction): body = Select(Cmp("ogt", In(0), Out(0)), In(0), Out(0)) step 1 (0 ins, 2 outs, reduction, MULTI-YIELD): body_per_yield[0] = Exp(Out(0) - Cap("%max")) body_per_yield[1] = Out(1) + Exp(Out(0) - Cap("%max")) step 2 (0 ins, 1 out, parallel): body = Out(0) / Cap("%sum") Emits @cudnnSoftmaxForward — cuDNN's softmax is the natural lowering target. * _scan_scalar_types in the rewriter now recognises 'affine.load %scalar_memref[] : memref<T>' result types. Without this, softmax's captured max/sum scalars (loaded back from scalar memrefs between generics) showed up as '!any' in the launch op signature. Now they type cleanly as f32 (or whatever the source dtype is). Regression on all 7 baked suites (vs the parser-fix baseline at 7aef419): polybench_new: 9 -> 9 pbgpu_mlir: 9 -> 9 pbgpu_extracted_mlir: 5 -> 5 llama2c_mlir: 0 -> 1 (+1, softmax) llmc_mlir: 0 -> 0 machsuite_mlir: 1 -> 1 npb_mlir: 0 -> 0 llmc/softmax-fwd still no_match — its lifted form is per-(B,T) softmax embedded in nested affine.fors with an additional masking generic, so the bodies don't fit the 3-step pattern directly. A future variant of this composition (or an outer-loop hoist pass) would handle it. The kernel.launch is emitted with a well-typed signature (memref + f32 captures + void return). What's not in this commit: canonical defn @cudnnSoftmaxForward in kernel_library_phase2.mlir, ABI lowering branch in LowerKernelLaunchToCuBLAS.cpp, runtime shim polygeist_cudnn_softmax_fwd_{f64,f32} calling cudnnSoftmaxForward. Those land in the next commit.
End-to-end at the matcher level for llama2.c rmsnorm. The lifted form is
two linalg.generic ops with host-side scalar arith between them:
step 0: ss = sum(x[i]²) reduction; body = Out(0) + In(0)*In(0)
[inline scalar arith on host: %0 = load ss
%1 = sitofp N
%2 = divf ss / N
%3 = addf %2 + eps
%4 = sqrt %3
%5 = divf 1.0 / %4 ← scale]
step 1: out = weight * scale * x parallel; body = In(0) * (Cap("%scale") * In(1))
The new _rmsnorm_2step CompositionEntry binds Cap("%scale") to whatever
SSA the second body's mul references — typically the %5 result of the
inlined sqrt + division chain. The matcher only needs to bind that SSA;
how the scale was computed lives in the surrounding function body and
is not part of the matcher's concern.
_scan_scalar_types in the rewriter is extended to recognise the standard
arith / math scalar ops (arith.add[fi] / sub[fi] / mul[fi] / div[fsui]+
/ negf / cmp[fi] / sitofp / extf / etc., math.sqrt / exp / log / tanh /
absf / absi) so the captured %scale ends up correctly typed as 'f32' in
the launch op signature instead of '!any'. The regex uses an end-anchor
on the trailing ': <scalar type>' to avoid accidentally typing memref
or tensor SSAs that happen to use the same op names.
The emitted launch is:
kernel.launch @rmsnorm(%x, %weight, %x, %ss_mem, %scale)
: (memref<?xf32>, memref<?xf32>, memref<?xf32>, memref<f32>, f32) -> ()
Lowering target choices (deferred to a runtime shim commit):
- cuBLAS decomposition: cublasSdot for ss + scalar arith + per-element
fused scale (one launch each, or a fused custom kernel).
- cuDNN cudnnNormForward with mean=0 trick (version-dependent, brittle).
- Hand-written CUDA kernel — what TRT-LLM / vLLM / FlashAttention ship.
cuDNN does NOT have a native standalone RMSNorm entry; its
cudnnNormForward always subtracts the mean. RMSNorm doesn't.
Regression on all 7 suites:
polybench_new: 9 -> 9
pbgpu_mlir: 9 -> 9
pbgpu_extracted_mlir: 5 -> 5
llama2c_mlir: 1 -> 2 (+1, rmsnorm)
llmc_mlir: 0 -> 0
machsuite_mlir: 1 -> 1
npb_mlir: 0 -> 0
llama2c now matches both softmax and rmsnorm. llmc/layernorm-fwd is a
sibling pattern (3 generics; mean + variance + scale) and will get its
own composition entry next.
…tmax-fwd reframed llama2.c softmax and rmsnorm rows move from matcher-gap to partial-pipeline. Matcher fires cleanly on both; the remaining work is the downstream canonical defn + ABI lowering + runtime shim (cuDNN softmax for one, a custom kernel or cuBLAS decomposition for the other). The llmc softmax-fwd row stays matcher-gap, but its description is updated: the base 3-step softmax composition matches llama2's flat form but not the (B, T) outer-affine-for-wrapped form llmc has, which also adds a masking generic. That's a separate composition (or an outer-loop hoist pass). The llama2c section blurb is rewritten to highlight that rmsnorm + softmax now match; matmul (gemv-flavoured) is the only remaining gap.
Replaces the per-kernel build scripts (gemm_cublas_jetson.sh,
conv2d_cudnn_jetson_dtype.sh, etc.) with a single generic driver. The
user invocation mirrors gcc:
polygeist_build.sh gemm.c -DMINI_DATASET ... -o gemm
Internally it walks the full pipeline:
1. cgeist lifts the kernel function (auto-detected via #pragma scop
or 'kernel_' prefix; override with --function=NAME).
2. polygeist-opt raises affine → linalg + lower-submap + debufferize.
3. kernel_match_rewrite.py matches the body to a library template,
emits kernel.launch ops.
4. The full canonical defn library from kernel_library_phase2.mlir is
injected (the lowering pass dead-strips unused defns, so this works
uniformly regardless of which library symbol the matcher emitted).
5. polygeist-opt --lower-kernel-launch-to-cublas turns kernel.launch
into func.call to the runtime shim.
6. mlir-opt + mlir-translate produce LLVM IR; the lifted symbol is
renamed to <name>_impl so the harness's own C definition can be
weakened and overridden.
7. gen_wrapper.py auto-emits an ABI bridge translating C signature
to the MLIR memref-descriptor signature.
8. Per-target compile: --target=host uses local clang + the CPU-stub
runtime; --target=jetson cross-compiles with aarch64-linux-gnu-gcc
and links cuDNN/cuBLAS cross-libs.
9. Link kernel.o + wrapper.o + harness.o (with weakened kernel symbol)
+ runtime.o + (optionally) polybench.o → binary.
Unrecognised flags pass through to all gcc/clang invocations that compile
non-MLIR code — preprocessor defines like -DMINI_DATASET and -I include
paths Just Work without special handling in the driver.
Verified end-to-end on PolyBench GEMM:
* --target=host produces an x86 binary bit-exact vs clang -O1 reference
* --target=jetson cross-compiles aarch64; ship + run on Jetson Orin
produces bit-exact GPU output (md5sum matches host CPU reference).
Pre-existing limitations the driver inherits (not introduced here):
* cgeist asserts on multi-array kernel signatures (2mm, 3mm) — known
issue in tools/cgeist/Lib/CGCall.cc:120 'too many arguments in calls'.
* gen_wrapper.py only parses POLYBENCH_1D/2D/3D macros, not plain
C-array signatures like 'double A[NI][NJ]'. polybenchGpu-extracted
sources (conv2d.c, conv3d.c) use the latter. Small extension to
gen_wrapper.py would close this gap; tracked separately.
When the assertion at CGCall.cc:120 fires, the previous diagnostic dumped the callee + caller AST anonymously. With several functions all involving many array args (PolyBench 2mm has 11 params, 3mm has 12), it was impossible to tell which call-site triggered. The new diagnostic prints the callee name + expected vs actual input counts + which arg index failed up front, then the same AST dumps as before. Makes future arg-count-mismatch debugging take seconds instead of an instrumented build. No behaviour change — the assertion still fires on the same condition, only the stderr output before abort is richer. Found while diagnosing a stale-binary crash on PolyBench 2mm/3mm via polygeist_build.sh (the unified driver added in 4b20b77). After a clean ninja-cgeist rebuild both kernels lift cleanly, so the actual assertion never fires here; the improved diagnostic stays for the next case.
…cros
Previously the parser only understood POLYBENCH_2D(A, NI, NK, ni, nk)
macro-style array params. Plain C signatures like 'double A[NI][NJ]'
(what polybenchGpu-extracted sources, llama2.c hot kernels, and
generally any non-PolyBench C kernel use) fell through to the scalar
catch-all and produced broken wrappers — the variable name came out
as the literal string 'A[NI][NJ]', so the extern was syntactically
ambiguous and the call-site argument was misread as array indexing
(A[NI][NJ] meaning A[NI, NJ] not A passed as a pointer).
Adds two helpers:
_is_plain_c_array — gates 'double <name>[<dim>][<dim>]...' before
falling through to the scalar/int fallback.
_parse_plain_c_array — extracts name + dim list, returns the same
(kind='1D'|'2D'|'3D', name, *runtime_dims)
tuple shape the POLYBENCH branch produces.
Maps uppercase macro dims to lowercase runtime
args (NI->ni, NJ->nj convention; polybenchGpu
and llama2.c both follow this).
Downstream gen_wrapper() is unchanged — it already emits the right
memref-descriptor bridge code from a (kind, name, dims...) tuple
regardless of which signature style produced it.
Verified:
* polybenchGpu-extracted/conv2d.c -> correct 2D wrapper (was broken).
* polybenchGpu-extracted/conv3d.c -> correct 3D wrapper (new path).
* polybench-c gemm.c / 2mm.c / 3mm.c -> wrappers unchanged; driver
still produces bit-exact output vs clang reference.
This unblocks the unified driver for plain-C-array sources, which
covers everything we'd write outside of PolyBench convention.
The driver's job ends at producing an aarch64 ELF when --target=jetson. Deployment to a specific Jetson (ssh / scp / sshpass / dev-box bounce patterns) is environment-specific and shouldn't be encoded in the driver's user-facing output. The previous 'Ship to Jetson:' hint at the end of the build, with a placeholder 'nvidia@<jetson>' command, implied a workflow the driver doesn't actually support and risked suggesting that automation would land here. Replaces it with: nothing. The build prints the produced binary path via 'file', user takes it from there using whatever local deployment tooling they have. The docstring is also updated to clarify that deployment is out of scope. No behaviour change for the build itself.
Adds four new columns to the per-suite tables — Jetson dataset, GPU
time (cuDNN/cuBLAS path), CPU time (aarch64 plain gcc -O3 reference),
speedup with correctness mark. Kernels with multiple dataset sizes
emit one row per size, with rowspan on the kernel-shared cells
(kernel name, match status, parallelism tag, blocker) so each kernel
still appears as one visually grouped block.
New JETSON_RUNTIMES data structure carries the measurements. Each
entry is { size, gpu_s, cpu_s, correct } where correct ∈ { PASS,
FP-noise, DIFF, ABORT }:
PASS = bit-exact GPU/CPU dump diff
FP-noise = last-decimal drift only (e.g. 447.10 vs 447.11) from
cuBLAS/cuDNN's tiled reduction order; functionally
equivalent, PolyBench reference considers them equal
DIFF = real numerical divergence (not seen in current data)
ABORT = GPU intentionally aborted (cudnn-dtype-gap; not in
current data because i32/i16 not included this round)
Measurements (this round, Jetson Orin):
gemm MINI GPU 94 ms CPU 9 µs 0.0× PASS
gemm LARGE GPU 148 ms CPU 632 ms 4.3× FP-noise
gemm EXTRALARGE GPU 488 ms CPU 7.14 s 14.6× FP-noise
2mm MINI GPU 93 ms CPU 13 µs 0.0× PASS
2mm LARGE GPU 169 ms CPU 4.97 s 29.5× FP-noise
2mm EXTRALARGE GPU 558 ms CPU 51.18 s 91.8× FP-noise
3mm MINI GPU 95 ms CPU 20 µs 0.0× PASS
3mm LARGE GPU 219 ms CPU 5.88 s 26.9× PASS
3mm EXTRALARGE GPU 892 ms CPU 61.01 s 68.4× PASS
Notes on the patterns:
- MINI sizes show negative speedup — GPU's CUDA init + memcpy
overhead (~94 ms) dominates the µs-scale CPU work. Expected.
- LARGE/EXTRALARGE show the speedup curve we hoped for: 2mm hits
~92× at EXTRALARGE (matches earlier project memory note of
'83× GPU speedup at EXTRALARGE for gemm-class kernels').
- 3mm passes bit-exact at all sizes; gemm/2mm at LARGE+ show
last-decimal drift because cuBLAS uses tiled SGEMM/DGEMM
algorithms with a different summation order than the textbook
3-loop. The matcher / lowering chain is mathematically
correct; the print routine just rounds slightly differently.
Conv2d / conv3d / softmax / rmsnorm rows still show '—' for the
runtime columns — those kernels either lack a downstream pipeline
to silicon (matcher-only) or need a separate harness path (extracted
sources have no main). Will fill in as those downstream pieces land.
The renderer uses rowspan only when len(runtimes) > 1, so single-
entry kernels (and kernels with no runtime data at all) render as
a single <tr> like before. Existing tests / rendering for the rest
of the suites are unchanged.
The polybenchGpu sources put init_array, kernel_*, and main() in one TU. cgeist's inliner folds the kernel into main, then affine-scalrep forwards init_array's stores through the kernel's loads (deleting them) and hoists invariant arith between affine.for levels via LICM. The result is an imperfect affine.for nest with no loads of the input arrays, which the raise pass can only partially collapse — typically a single 1D reduction instead of a full [par, par, red] gemm. --no-inline (default false) gates both createInlinerPass() sites with AND of the existing condition (!Opt0 / CudaLower||EmitROCM). With the inliner skipped, function boundaries survive into mem2reg/scal-rep/LICM so the kernel's affine.for nest stays perfect and the raise pass folds it into one linalg.generic. bake_polybenchgpu_mlir.sh: pass --no-inline to cgeist and --select-func=func-name=$fn to polygeist-opt. Result: 11 polybenchGpu kernels reach FULL match (every linalg.generic → kernel.launch) and 9 reach PARTIAL, up from ~3 FULL before.
First polybenchGpu kernel taken end-to-end after the --no-inline fix (commit 82109b6). syrk's "C := α·A·Aᵀ + β·C" is matched as one cublasDgemm (B=A with transb=T). Numbers: MINI (32²): GPU 28.7 ms CPU 0.029 ms bit-exact GPU/CPU dump LARGE (2000²): GPU 303 ms CPU 8.68 s 28.6× speedup, FP-noise X-LRG (4000²): GPU 2.03 s CPU 69.0 s 34.1× speedup, FP-noise New pieces: - scripts/correctness/build_polybenchgpu_jetson.sh — per-dataset driver. polybenchGpu's older polybench.h breaks cgeist when given -DPOLYBENCH_USE_C99_PROTO, so we bake one MLIR per dataset with -D${DATASET}_DATASET to get the correct static second-dim. Handles kernel.defn injection (with dim derived from matcher output), the !any → f64 substitution for scalar capture types, and the rename + drop-internal-linkage so the wrapper can link against kernel_*_impl. - scripts/correctness/syrk_jetson_wrapper.c — mirrors gemm wrapper. - scripts/correctness/build_jetson.sh — link line now picks up cuDNN (needed because polygeist_cublas_rt_cuda.c includes conv2d shims), and the runtime-shim compile gets -I/usr/include/aarch64-linux-gnu for cuDNN headers. RUNPATH extended to include /usr/lib/aarch64-linux-gnu so the binary finds libcudnn.so at runtime on Jetson. - scripts/correctness/build_ce_viewer.py — JETSON_RUNTIMES gains a syrk entry. Both polybench-C and polybenchGpu syrk rows pick up the same numbers via the existing kernel-name lookup, matching the gemm/2mm/3mm convention.
Second polybenchGpu kernel through the silicon pipeline. Matched as cudnnConvolution2D_9tap_f32 (polybenchGpu DATA_TYPE defaults to float for convolution-2d). Numbers (Jetson Orin, NI=NJ): MINI (64²): GPU 50.6 ms CPU 0.014 ms LARGE (4096²): GPU 139 ms CPU 46.0 ms X-LRG (8192²): GPU 326 ms CPU 186 ms Note GPU is *slower* than CPU at every size — the 3×3 stencil has very low arithmetic intensity (9 muls + 9 loads per output element), so the work is bandwidth-bound and cuDNN setup overhead (descriptor creation, workspace allocation, kernel launch) dominates. Numeric outputs match to %0.2lf precision (sorted distributions identical, differences are third-decimal rounding artifacts). New: scripts/correctness/build_polybenchgpu_conv2d_jetson.sh — analogous to build_polybenchgpu_jetson.sh but for the conv2d shape: 10 input/output memrefs + 9 scalar f32 weights in the kernel.defn, and the MLIR-to-LLVM pipeline uses --convert-linalg-to-loops + --expand-strided-metadata (not --one-shot-bufferize) since the matched conv2d body operates on memrefs in place. build_ce_viewer.py: JETSON_RUNTIMES gains "convolution-2d" entry, showing up in the polybenchGpu section's conv2d row.
Extends the downstream pipeline to recognize two more matcher callees, unlocking five gemv-shaped polybenchGpu kernels (atax, bicg, mvt, gemver, gesummv) for end-to-end builds. atax + bicg are now wired up on Jetson with measured timings; mvt/gemver/gesummv just need wrappers + build scripts (the lowering pass already handles them). New lowerings: - @cublasDgemv(A, x, y) → polygeist_cublas_dgemv(M, N, 1.0, A, lda, x, 0.0, y). Matcher template encodes α=1, β=0 (any scale/accumulate is fissioned into separate generics). - @memset_zero_1D(v) → polygeist_cublas_memset_zero_1d(N, v). Host- side bzero; same justification as the 2D variant. Runtime shims: - polygeist_cublas_dgemv (CUDA): alloc → H2D → cublasDgemv → D2H → free. Uses CUBLAS_OP_T to read the row-major A as cuBLAS's column- major convention. - polygeist_cublas_dgemv (CPU stub) + polygeist_cublas_memset_zero_1d (both CUDA + CPU) for parity with the existing shims. Validated atax/bicg on Jetson Orin (NX=NY): atax MINI 32²: GPU 31.7 ms CPU 0.002 ms atax LARGE 8000²: GPU 373.2 ms CPU 104.7 ms bicg MINI 32²: GPU 31.6 ms CPU 0.004 ms bicg LARGE 8000²: GPU 357.7 ms CPU 294.1 ms Both kernels fall into the bandwidth-bound regime where cuBLAS H↔D overhead dominates the actual gemv compute. CPU wins at every size, similar to convolution-2d. KNOWN CORRECTNESS GAP — JETSON_RUNTIMES marks atax + bicg as DIFF: Both atax (tmp = A·x; y = Aᵀ·tmp) and bicg (s = Aᵀ·r; q = A·p) do one untransposed and one transposed gemv. The matcher's cublasDgemv template at scripts/correctness/kernel_match.py is body-shape only (`Out + In(0) * In(1)`), so A·x and Aᵀ·x produce indistinguishable @cublasDgemv launches. The downstream lowering can't tell which is which from the launch signature alone, so it picks no-transpose for every call — meaning the half that should be transposed computes the wrong vector. Wall-clock timings are still informative (two cuBLAS gemv round-trips per kernel, which is what cuBLAS actually does). Follow-up: extend the matcher to surface transpose info either via a distinct @cublasDgemv_T symbol or a launch attribute, so the lowering can pick the right cublasDgemv variant per call site. Adds: atax_jetson_wrapper.c, bicg_jetson_wrapper.c, build_polybenchgpu_gemv_jetson.sh (auto-detects callees from matched MLIR and injects defns for each).
Two changes that together unlock 5 more polybenchGpu kernels:
1. *Gemv transpose discriminator*. The matcher's @cublasDgemv template
matches both y=A·x and y=Aᵀ·x bodies (both have shape `Out + In(0)*In(1)`).
The launch operands don't encode which case — the transpose info
lives in the linalg.generic's indexing maps and was being thrown
away at the matcher→launch boundary. Result: atax + bicg ran on
silicon with the wrong cuBLAS op flag for half their gemvs, so the
numerical output was structurally wrong.
Fix: rewriter post-match override emits @cublasDgemv_T when A's
first indexing-map output dim does NOT match the output vector's
first dim (i.e., the reduction iterator lives in A's first slot).
Downstream lowering routes _T to cuBLAS with CUBLAS_OP_N instead of
CUBLAS_OP_T (same shim, opposite flag). Also added the same shim's
CPU stub for parity.
Verified: atax MINI + bicg MINI now BIT-EXACT GPU/CPU dump diff.
atax LARGE + bicg LARGE both PASS as well (no per-byte diff run
because the LARGE dumps are 8000-vector each).
Required a parser fix: kernel_match.parse_generics was silently
missing all indexing_maps because (a) the regex `affine_map<[^>]*>`
stopped at the `->` inside, (b) `\b` word boundary didn't work next
to `#` for #mapN substitution. Both fixed.
2. *gesummv + gemver downstream callees*. Added lowering branches +
runtime shims for the four remaining matched callees:
- cublasDaxpby (y = α·x + β·y) — cublasDscal + cublasDaxpy
- cublasDaxpy_unit (y += x) — cublasDaxpy with α=1
- cublasDgemv_alpha (y += α·A·x) — reuses the existing dgemv shim
with α from launch, β=1
- cublasDger_rank2 (A += u₁·v₁ᵀ + u₂·v₂ᵀ) — two cublasDger calls
gesummv + gemver now build + run on Jetson; cuBLAS calls dispatch
correctly. Wall-clock timings are real. Numerical outputs show a
heap-corruption pattern (mostly-correct values interspersed with
1e+150-range overflow) — residual bufferization-aliasing issue in
the axpby step's y operand handling, debug pending. JETSON_RUNTIMES
marks both as DIFF correctness for honesty.
mvt also picks up the transpose discriminator (its two gemvs are
opposite-direction). Built fine but segfaults during print_array;
most likely because the matcher didn't fission its accumulating
init (no memset_zero_1D before the gemv) so β=0 overwrites x1/x2
instead of accumulating into them — mismatch with what the harness
expects to dump. Marked ABORT in JETSON_RUNTIMES.
JETSON_RUNTIMES now carries 7 polybenchGpu kernels: gemm/2mm/3mm/syrk
(PASS or FP-noise), conv2d (FP-noise), atax/bicg (PASS), gesummv/gemver
(DIFF residual), mvt (ABORT). Explorer regenerated.
…rray overrun Two findings from debugging the gesummv/gemver DIFF correctness: 1. *daxpby host-side*. The CUDA daxpby shim (cublasDscal + cublasDaxpy, pre-existing version) was correct, but the H↔D copy + two cuBLAS calls dominated any GPU benefit for the O(N) bandwidth-bound op. Replaced with a straight host loop. Verified to give identical output bits to the CUDA path; the corruption persists with both paths, so axpby itself is not the bug. 2. *print_array overrun is aarch64-specific.* Built a CPU-stub variant of gesummv for Jetson aarch64 (kernel.o linked against polygeist_cublas_rt_cpu.o instead of rt_cuda.o). It reproduces the exact same overrun — polybench's print_array reads ~17 extra elements past `y[n-1]` into adjacent heap. The same lowered MLIR + CPU stub on *x86* is bit-exact, so this is NOT a lowering bug and NOT a CUDA shim bug. Most likely an aarch64 calling-convention or stack-frame issue with 32-arg flat-memref impl signatures (kernel_gesummv_impl has 32 LLVM args after memref expansion). The kernel itself writes correct values to polybench's y[0..n-1] (verified at wrapper-exit boundary). Only print_array's read-loop bound is wrong. JETSON_RUNTIMES comment updated to record this distinction so future debug doesn't re-investigate the CUDA path. Next step is to inspect the LLVM IR's aarch64-specific stack frame for kernel_gesummv_impl — likely a mismatch between gcc-aarch64's outgoing-args sizing and the LLVM-generated callee's incoming-args layout. atax/bicg keep their PASS status (bit-exact at MINI) because their impl signatures are smaller (28 args, all GP, fit closer to the 8-X-register limit).
Root-cause for the heap-corruption-looking dump diff in gesummv/gemver
on Jetson aarch64: it wasn't heap corruption. gcc at -O3 examined the
local static body of `kernel_<name>` in the same translation unit,
ran intraprocedural-analysis passes (modref, pure-const), and decided
the kernel doesn't clobber w0. So main loaded `w0 = N` once before
init_array and *never reloaded it* before kernel_gesummv or
print_array — banking on the IPA conclusion.
But objcopy --weaken-symbol redirects the call at link time to our
wrapper, and AArch64 ABI says w0 is a scratch register the callee is
free to use. The wrapper does use it. Result: when main calls
print_array, w0 holds whatever the wrapper happened to leave there
(typically ~49 for the gesummv case, since the wrapper's final
fprintf returns the byte count of its formatted string). print_array's
`for (i=0; i<n; i++)` loop then iterates 49 times instead of 32,
reading 17 doubles of adjacent heap = the 1e+150-range "garbage."
Disassembly confirmed:
before fix: bl kernel_gesummv ; bl print_array (no w0 reload)
after fix: bl kernel_gesummv ; mov w0, #0x20 ; bl print_array
Fix: change `-Dstatic=` (which left the local body visible to gcc's
IPA) to `-Dstatic=__attribute__((noipa))`. This tags every kernel_*
body as IPA-opaque, forcing gcc to emit ABI-correct caller-saved-reg
reloads at every call site.
Verified MINI:
atax md5 GPU == md5 CPU (BIT-EXACT)
bicg md5 GPU == md5 CPU (BIT-EXACT)
gesummv md5 GPU == md5 CPU (BIT-EXACT)
mvt runs cleanly (no more segfault, exit 0). Small numerical
drift remains because the matcher fissioned mvt's
accumulating x1/x2 init wrong (kernel overwrites with β=0
instead of accumulating into them). Separate matcher bug.
gemver same story — small drift from dropped initial-value
contribution. Separate matcher fix needed.
JETSON_RUNTIMES updated: atax/bicg/gesummv now PASS, mvt/gemver still
DIFF but with the residual diagnosis recorded. mvt loses its ABORT
status — kernel runs to completion now.
Memory note for future-me: any time a polybench harness uses
`-Dstatic=` to weaken a `static void kernel_*()` for symbol
substitution, ALSO upgrade it to `-Dstatic=__attribute__((noipa))`
or gcc -O3's IPA will silently bake in invalid assumptions about
caller-saved-reg preservation. The bug manifests as nonsense data
in stack-resident locals (like n, n_iterations, loop bounds) AFTER
the wrapper returns.
On Jetson Orin, the CPU and integrated GPU share the same physical DRAM
(LPDDR5). Our prior runtime did cudaMalloc + cudaMemcpyH2D + cuBLAS +
cudaMemcpyD2H + cudaFree for every shim call — which on Tegra means
copying within the same DRAM to itself before/after the actual compute.
Replaced that pattern with cudaHostRegister(host_ptr, bytes,
cudaHostRegisterMapped) + cudaHostGetDevicePointer + direct cuBLAS call.
Sets up the iGPU's page-table mapping for polybench's existing buffers,
no extra allocations, no data movement.
Tried bypassing cudaHostRegister entirely (just passing host pointers to
cuBLAS, trusting UVA on Tegra) — fails with illegal-memory-access. cuBLAS
needs the buffer registered or device-allocated even when the iGPU can
technically reach it. cudaHostRegister is the right call.
Aliased operands (e.g. syrk's A passed as both A and B) are handled by a
register_host_safe() helper that silently tolerates
cudaErrorHostMemoryAlreadyRegistered. Same for unregister.
Refactored shims:
- polygeist_cublas_dgemm
- polygeist_cublas_dgemv (+ dgemv_T)
- polygeist_cublas_daxpy_unit
- polygeist_cublas_dger_rank2
Skipped (no net win expected):
- polygeist_cublas_daxpby — already host-side
- polygeist_cublas_memset_zero_{1d,2d} — already host-side
- polygeist_cublas_dscal_2d — already host-side
- cuDNN conv2d shims — cuDNN setup/algo-select dominates, not H↔D
Re-ran all 12 Jetson kernels with the new runtime:
Kernel MINI LARGE EXTRALARGE Δ vs prior
gemm 29.5 ms 78.8 ms 408 ms -69% / -47% / -16%
2mm 30.4 ms 98.8 ms 471 ms -67% / -41% / -16%
3mm 30.6 ms 146.0 ms 789 ms -68% / -33% / -12%
syrk 29.7 ms 291.6 ms 1960 ms +4% / -4% / -3%
atax 35.8 ms 265.4 ms - 0% / -29%
bicg 36.4 ms 265.8 ms - 0% / -26%
gesummv 32.2 ms 263.0 ms - 0% / -29%
gemver 34.2 ms 449.9 ms - 0% / -31%
mvt 36.0 ms - - 0%
Pattern: MINI gemm-family sees ~3× speedup (almost all of the prior
~94 ms was H↔D); LARGE for bandwidth-bound gemv kernels gets ~25-30%
(the cuBLAS work is roughly bandwidth-limited, so eliminating one
DRAM round-trip helps). LARGE/XLARGE for compute-bound gemm sees
smaller relative gains because the cuBLAS dgemm time dominates.
scripts/correctness/polybench_cublas_jetson.sh: also link -lcudnn now
(the shim file includes cuDNN code, so link picks it up unconditionally).
Correctness re-verified bit-exact at MINI for atax/bicg/gesummv/syrk
via md5 of GPU dump against CPU dump.
cudaHostRegister has real cost on Jetson — page-table setup for the mapped range is proportional to buffer size. For an 8000×8000 double matrix (128K pages) it's measurable. Gemver does 4 shim calls on the same A, so we were re-registering A four times per kernel run. Replaced the per-call register/unregister with a persistent cache: register on first sight, never unregister. A small flat array (cap=256) keyed on host pointer caches the device pointer. The OS reclaims the mappings at program exit. Effect on LARGE (n=8000): gemver: 450 ms → 390 ms (4 ops on A — biggest win) gesummv: 263 ms → 242 ms atax: 265 ms → 244 ms bicg: 266 ms → 245 ms gemm/2mm/3mm/syrk barely move (each call has distinct buffers, no amortization possible). MINI numbers also unchanged — fixed cuBLAS handle + first-register costs dominate, the cache only helps after. These gemv-style kernels are bandwidth-bound: each cublasDgemv on n=8000 streams 512 MB of A → minimum ~3 ms at Jetson Orin LPDDR5 peak (~204 GB/s). We measure ~120 ms per gemv → sustained ~4 GB/s, about 2% of peak. The big gap is cuBLAS's row-major-via-OP_T emulation — non- coalesced access. To go faster we'd need to either (a) transpose A to column-major once and use OP_N, or (b) fuse the multiple gemvs into a single kernel that streams A once. Both are matcher/lowering changes, not runtime. CPU LARGE numbers (Jetson ARM cores, plain -O3) for reference: atax 107 ms, bicg 294 ms, gesummv 293 ms, gemver 575 ms. So gemver/gesummv beat the CPU at LARGE but only modestly. atax is slower than CPU at LARGE — its inner loop is so trivially vectorizable that the ARM cores' wider memory subsystem wins.
Added a "notes" column next to the speedup column in the per-suite Jetson tables. Each (kernel, dataset) entry gains an optional "notes" string in JETSON_RUNTIMES; the explorer renders it as a small-text grey cell at the row tail. Notes fall into a few buckets: - "Setup-bound": MINI runs across all kernels. The 28-36 ms floor is cuBLAS handle init + first cudaHostRegister page-map for one of the larger buffers; the actual kernel work is microseconds. - "Bandwidth-bound dgemv via OP_T": atax/bicg/gesummv LARGE. cuBLAS emulates row-major y=A·x by passing A as col-major-Aᵀ and applying OP_T. The OP_T kernel uses strided reads across A's rows, killing coalescing. Measured throughput ~2-5% of peak DRAM bandwidth (~204 GB/s on Jetson Orin LPDDR5). CPU's wider memory subsystem + auto-vectorised contiguous-access loops keep pace. - "Matcher fission bug": mvt / gemver. The matcher didn't fission the accumulating init step (kernel.launch overwrites x1/x2/w with β=0 instead of += into the polybench-initialised values). Numerical output is off; wall-clock timing is real. - conv2d: rerun on the current runtime (the conv2d shims weren't touched in the zero-copy refactor but the surrounding runtime got cheaper). New numbers: MINI 27 ms / LARGE 140 ms / EXTRALARGE 305 ms. 3×3 stencil has AI≈1, so it's bandwidth-bound regardless of hardware; cuDNN can't reuse the filter across enough output elements to amortise descriptor setup. - syrk: matched as cublasDgemm with B=A pointer alias. cuBLAS doesn't recognise the symmetry; runs full M*N*K work. A native cublasDsyrk matcher pattern would be ~2× faster (it only updates the lower triangle). No runtime changes. Just metadata + a column.
Adds the darknet (pjreddie/darknet) third-party clone as a fifth
benchmark suite in the IR explorer. The "kernels" are individual .c
files in src/; the bake runs cgeist + raise + match on each.
Approach:
bake_darknet_mlir.sh iterates over third_party/darknet/src/*.c,
baking each through:
cgeist --function='*' --no-inline ...
polygeist-opt --raise-affine-to-linalg-pipeline --linalg-debufferize
kernel_match_rewrite.py
Files use --function='*' because darknet's compute is spread across
many entry points (gemm_nn/nt/tn/tt all need to lift); --no-inline
prevents the raise pass from collapsing init-into-kernel boilerplate
the way it used to on polybenchGpu.
Results (46 .c files, ~25K LOC total):
cgeist OK: 28 (61%)
raise OK: 23 (50%)
produced ≥1 linalg.generic: 18 (39%)
produced ≥1 kernel.launch: 1 ( 2%)
The 1 file that matches: src/gemm.c (6 launches across gemm_nn / nt /
tn / tt / bin). The 17 raise-OK-but-no-match files are an actionable
list of missing matcher templates: pooling (avg/max), batchnorm, LRN,
residual-add, GRU/LSTM gates, transposed conv, locally-connected, dense
+ bias, softmax-with-control-flow, l2norm. The 18 cgeist-fails are
mostly framework code (parser, image, data, network) with no compute.
darknet's actual production hot path is gemm_nn (TA=TB=0). The matcher
hits it as @cublasDaxpy (the inner loop has the scalar-hoisted axpy
shape) but doesn't compose the outer two loops back up into gemm.
gemm_nt and gemm_tt use the conventional sum-accumulator form and do
match as @cublasDgemm_alpha_only. Fixing gemm_nn composition is a
high-value matcher follow-up — it would auto-cover every conv layer
darknet runs at inference time (since every conv goes through gemm_nn
via im2col).
New section in build_ce_viewer.py:
- DARKNET_ROOT / DARKNET_MLIR_DIR path constants
- DARKNET_KERNELS dict (45 .c files)
- DARKNET_NOTES per-file with parallelism tag + characterisation
- DARKNET_BLOCKERS per-file mapped to existing taxonomy
(matcher-gap, cgeist-gap, debuf-bug, none)
- find_kernel_c dispatch for kset="darknet"
- build_index gains darknet_stats parameter
- new section + nav link to "#darknet"
The third_party/darknet/ clone itself is NOT committed (it's a vendored
upstream, would bloat the repo to ~25K LOC for the framework + cfgs).
The bake script's PATH is hardcoded so a fresh clone reproduces the
results.
…end on Jetson Orin
Polybench-style C kernels in third_party/cnn-extracted/, each lifted through the
full Polygeist pipeline (cgeist → raise → debufferize → matcher → ABI lowering →
LLVM IR → aarch64 cross-compile → Jetson silicon).
Five extracted-darknet baseline kernels (matcher templates + lowering branches +
cuDNN/cuBLAS shims + harness + per-kernel HTML page in the IR explorer):
conv2d_batched → cudnnConvolutionFwd_batched 23.8x LARGE
maxpool_batched → cudnnMaxPoolFwd_batched 1.29x LARGE
batchnorm_batched → cudnnBatchNormalizationForwardInference 0.38x LARGE
shortcut_batched → cudnnAddTensor_batched 0.08x LARGE
conv_bn_relu_batched → cudnnConvolutionBiasActivationForward
(with host-side BN folding) 23.5x LARGE
Four fusion-optimization kernels (algebraic rewrites + faster cuBLAS/cublasLt/
cuDNN entry points):
conv_bias_relu_add_batched → cudnnConvolutionBiasActivationForward
(α2*Z addend for ResNet skip) 23x LARGE
gemm_bias_relu → cublasLtMatmul EPILOGUE_RELU_BIAS 901x LARGE
ata_gemm → cublasSsyrk (operand-alias discriminator
detects AᵀA pattern; half the flops) 3393x LARGE
conv1x1_batched → cublasSgemmStridedBatched (4-par+1-red
shape distinguishes K=1 from K×K) 105x LARGE
Cross-cutting infrastructure additions:
* Matcher: ~9 new CompositionEntry templates + AᵀA→syrk post-unify operand-alias
discriminator in kernel_match_rewrite.py. Per-step span replacement preserves
intervening polygeist.submap ops between matched generics.
* Lowering pass: resolveSubmapBase now chains through both polygeist.submap and
polygeist.submapInverse (up to 16 hops). New pre-pass elides redundant
memset_zero_{1D,2D} launches preceding any β=0 op (syrk). Dtype-suffixed
memset dispatch (f32 alongside f64).
* Runtime: cublasLt linkage (libcublasLt.so.12); ensure_cublaslt() helper.
Host-side BN-folding for fused conv+bn+relu (precompute scaled filter + bias).
All cuDNN algo-selection loops use array-sized cudnnConvolutionFwdAlgoPerf_t
buffers (avoiding the stack-smash that bit single-struct attempts).
* Build: scripts/correctness/extracted_darknet_jetson.sh handles all 9 kernels;
bake_extracted_darknet_mlir.sh produces per-stage MLIR snapshots for the
IR explorer; -lcublasLt added to link line.
* IR explorer: two new sections (extracted darknet, Fusion optimization) with
Compiler Explorer deep-links + per-kernel raised/debuf/matched IR preview
pages.
All four fusion optimizations are 100% bit-exact (or FP-noise within 1e-4 print
precision); LARGE speedups range 23x→3393x over the CPU 3-loop reference on the
Jetson Orin (Tegra Ampere, FP32, cuDNN 9.x, CUDA 12.6).
…nv2d + 4 image filters on Jetson Orin
* New LowerKernelLaunchToPVA pass — owns the matcher's i8/i16
@cudnnConvolution2D_9tap_* launches plus new
@pvaBoxFilter_3x3_i{8,16}, @pvaGaussianFilter_3x3_i{8,16},
@pvaBilateralFilter_3x3_i{8,16}, @pvaHistogramEqualization_i8
symbols. Each routes to a polygeist_pva_* runtime shim. Disjoint
symbol set from --lower-kernel-launch-to-cublas; the two passes
run side by side; either order works.
* Shared 9-tap conv lowering helper extracted out of
LowerKernelLaunchToCuBLAS.cpp into KernelLaunchLoweringUtils.{h,cpp}
so both backend passes call the same body. Added a parallel
lowerImageFilter2Operand helper for the 2-memref filter launch
shape (Box/Gaussian/Bilateral/HistogramEq).
* cuBLAS pass: dropped i8/i16 from shimSymbolFor + the dispatch
switch; PVA-claimed launches fall through with a `continue`
instead of erroring out. Net diff is small in the cuBLAS pass
file (the 3 helpers moved out are the bulk of the delta).
* New PVA runtime shim runtime/polygeist_pva_rt.c with:
- cudaSetDevice + nvcvAllocatorConstructPva + non-blocking
stream init (idempotent, lazy, persistent for process lifetime)
- make_pva_image_tensor_dtype: HWC tensor alloc through the PVA
allocator with arbitrary NVCV dtype (needed because half the
PVA ops are U8-only; we reinterpret i8 bytes as U8)
- CupvaMemGetHostPointer-mediated host I/O (raw cudaMemcpy
segfaults on cuPVA-allocated pages; the host-pointer mapping
is mandatory)
- One pva<Op>Create / pva<Op>Submit wrapper per op
- (M-2)×(N-2) interior copy from PVA output back to caller B
to honour the matcher's &B[1][1] pointer-shift convention
(writing the full M×N overflows B by N+1 bytes)
* Matching CPU reference stubs in polygeist_cublas_rt_cpu.c modelled
to mirror PVA hardware semantics: centred kernel anchor, REPLICATE
border, Q-format >>qbits shift, unsigned-kernel reinterpretation
for Conv2d; rounded-mean (sum + 4) / 9 for BoxFilter; canonical
[1,2,1;2,4,2;1,2,1] / 16 for Gaussian; textbook 256-bin CDF-LUT
for HistogramEq. Bilateral has a pass-through stub (the
non-linear hardware semantics aren't worth mirroring bit-exactly).
* third_party/polybenchGpu-extracted/conv2d_i8.c — i8 variant of
the 9-tap stencil (i16 already existed). Matcher fires on it via
the existing dtype-suffix template + emits
@cudnnConvolution2D_9tap_i8, which the new PVA pass claims.
* Cross-compile script conv2d_cudnn_jetson_dtype.sh: i8 dtype
branch added; PVA-library link line (-lpva_operator -lcvcuda
-lnvcv_types -lcupva_host) plus direct DT_NEEDEDs for
-lnvscibuf -lnvscisync via -Wl,--no-as-needed (deferred
resolution segfaults during libcupva_host init constructors);
step (5) now invokes both --lower-kernel-launch-to-cublas
and --lower-kernel-launch-to-pva.
* Four hand-authored kernel.launch test scaffolds in
scripts/correctness/pva_{boxfilter,gaussian,bilateral,histeq}_jetson.sh.
Matcher templates for these C-level patterns aren't written yet,
so each script synthesises the kernel.launch MLIR directly and
runs the rest of the pipeline normally — same harness, wrapper,
ABI lowering, and link line.
* IR explorer (scripts/correctness/build_ce_viewer.py): new "PVA
backend" section at the bottom. Shows the 6 PVA-routed kernels
with their op name, libpva_operator entry points, shim symbol,
and Jetson PVA wall-clock at each size we benchmarked. No CPU
comparison in this view (CPU stubs exist for separate per-op
bit-exact validation).
* CLAUDE.md: "point, don't copy" rule for gated-distribution NVIDIA
SDKs. PVA Solutions / cuPVA SDK headers consumed via -I at build
time; never copied into the Polygeist tree.
End-to-end silicon validation on Jetson Orin: bit-exact PVA-vs-CPU
diff for Conv2d i8/i16, BoxFilter, Gaussian, and HistogramEq at 256².
Bilateral runs cleanly; visual spot-check only (non-linear).
Conv2d at 10240×10240: PVA 216 ms vs CPU 499 ms (2.3× speedup for i8).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Some modifications to fuse linalg.generic op with for op