Skip to content

Raisetolinalg#412

Draft
arpitj1 wants to merge 144 commits into
llvm:raisetolinalgfrom
arpitj1:raisetolinalg
Draft

Raisetolinalg#412
arpitj1 wants to merge 144 commits into
llvm:raisetolinalgfrom
arpitj1:raisetolinalg

Conversation

@arpitj1
Copy link
Copy Markdown
Collaborator

@arpitj1 arpitj1 commented Jun 6, 2024

Some modifications to fuse linalg.generic op with for op

arpitj1 added 30 commits June 6, 2024 08:41
…f debufferizing added which works for tiling and fusion
arpitj1 and others added 30 commits May 23, 2026 16:15
Phase 2(FP32) of the cuDNN conv generalization. The matcher now emits
@cudnnConvolution2D_9tap_f32 instead of @cudnnConvolution2D_9tap when
the matched body's operand element type is f32; the lowering pass and
runtime shim follow the same dtype-suffix dispatch.

Changes:

  * Rewriter (kernel_match_rewrite.py):
    - _sniff_elem_type() helper extracts the element type ("f64" / "f32"
      / "f16" / etc.) from a memref/tensor textual type.
    - When entry.name is cudnnConvolution2D_9tap{,_tensor}, append a
      dtype suffix to the emitted launch symbol (the default no-suffix
      form is f64 for backward compat).
    - inline_weight_type passed to render_launch matches the operand
      element type — so the surfaced weight Caps have correct types.
    - _normalize_memref_operands generalized: instead of hardcoded f64
      strided targets, build a `memref<..xT, strided<[?, ..., 1],
      offset: ?>>` target per element type.

  * Canonical defn (kernel_library_phase2.mlir): added
    @cudnnConvolution2D_9tap_f32 alongside the f64 variant — same arity,
    f32 memrefs, f32 weights.

  * Lowering pass (LowerKernelLaunchToCuBLAS.cpp):
    - shimSymbolFor: maps cudnnConvolution2D_9tap_f32 →
      polygeist_cudnn_conv2d_3x3_f32.
    - lowerCudnnConv2D9tap relaxed: accepts f32 or f64 element type
      (derived from operand 0's memref), uses that type for both the
      memref operands and the trailing scalar weights, and dispatches
      to the matching shim symbol.

  * Runtime shim:
    - polygeist_cudnn_conv2d_3x3_f32 added to both CPU (3-loop reference)
      and CUDA backends. CUDA path uses CUDNN_DATA_FLOAT, float* buffers,
      cudnnSetTensor4dDescriptor with f32 layout. On Orin (Ampere)
      cuDNN's f32 path uses tensor cores (FP64 doesn't).

  * third_party/polybenchGpu-extracted/conv2d_f32.c: single-precision
    variant of the extracted polybench conv2d (`float A[NI][NJ]`,
    `float B[NI][NJ]`, 0.2f / 0.5f / ... weights). Used for validation.

Validation on Jetson Orin (CUDA 12.6.1.4, cuDNN 9.x):
  GPU (cuDNN f32, CUDNN_DATA_FLOAT): 33.9 ms
  CPU (3-loop f32 reference):        0.14 ms
  Numeric diff:  0 lines  ← bit-exact match between GPU and CPU paths

(CPU f32 is ~2x faster than f64 thanks to half the memory bandwidth;
GPU overhead floor unchanged from f64 since it's dominated by
cudnnCreate/descriptor/Memcpy. Tensor-core utilization on the actual
conv would shine at larger shapes / batched inputs.)
… cuDNN shims

Extends Phase 2 conv2d generalization from f64/f32 to also cover f16, bf16,
i32, i16. The matcher's encoder now recognizes arith.muli/addi (so int conv
bodies match the same Term as float ones), the rewriter's dtype-suffix
dispatch picks the right canonical defn per elem type, the ABI lowering
pass accepts the new types, and runtime shims call cudnnConvolutionForward
with the appropriate CUDNN_DATA_* enum (HALF / BFLOAT16 / INT32; i16 upcasts
to i32 since cuDNN has no native i16 path).

FP16/BF16 use compiler-provided _Float16 / __bf16 to match the LLVM half
/ bfloat ABI; bf16↔float conversion in the CPU stub uses bit-cast (GCC
aarch64 doesn't permit direct casts). All half-precision code is gated on
compiler feature macros (__FLT16_MAX__ / __ARM_FEATURE_BF16_SCALAR_ARITHMETIC)
so the x86 CPU stub still builds without HW support; the Jetson script
adds -march=armv8.2-a+fp16+bf16 to opt in.
End-to-end validation: cgeist parses int conv2d, the matcher binds it
through to @cudnnConvolution2D_9tap_{i32,i16}, ABI lowering emits the
right runtime calls, and the Jetson binary's output matches the CPU stub
bit-exact (md5sum identical, 256x256 size).

Encoder fixes:
- SSA name regex now allows `-` (cgeist emits negatives like `%c-8_i32`)
- `arith.extsi/extui/trunci/sitofp/extf/truncf/bitcast` marked transparent
  so C int-promotion (i16 * int -> i32 with extsi insertions) doesn't
  break template matching
- Inline-weights body scanner follows alias chains through cast ops
- Cap-bound scalars suppressed when `surface_inline_weights` covers them
  (was emitting duplicates for i32)

Rewriter: weight constants are auto-cast (arith.trunci/extsi/truncf/extf)
when the surfaced constant's type differs from the launch's elem type,
needed because cgeist promotes i16 weights to i32 in the body.

Runtime: cuDNN doesn't actually support pure INT32 forward conv (returns
BAD_PARAM), so the i32 shim now runs the math on the host. The matching
+ lowering + ABI handshake still exercises end-to-end; the i32/i16 paths
are correctness-validated. A real GPU integer kernel needs nvcc/PTX and
is a separate work item.

Build infrastructure: conv2d_cudnn_jetson_dtype.sh generalizes the f64
script to any of f64/f32/i32/i16. f16/bf16 still blocked on cgeist not
accepting _Float16 source. CUDA shim drops <cuda_fp16.h> / <cuda_bf16.h>
includes (gcc cross-compile can't parse them) and uses uint16_t device
buffers — cuDNN reads layout from descriptors so the type is irrelevant.
The earlier commit (800fb58) silently swapped the GPU body for a host CPU
loop when cuDNN rejected the INT32 descriptor — producing a misleading
"bit-exact GPU vs CPU" result that was really just the same host code
running in both binaries.

This commit puts the cuDNN descriptor setup back. cuDNN does not support
a pure INT32 input + INT32 filter forward conv on Orin (CUDNN_DATA_INT32
is only available as an INT8-accumulator in the bias+activation API,
which is a different operand layout). The shim now aborts at
cudnnSetTensor4dDescriptor with CUDNN_STATUS_BAD_PARAM — the honest
unsupported-dtype signal.

Header + shim docstrings explain the constraint and the real options
for adding a GPU integer conv path (custom CUDA kernel, INT8 quant,
cutlass). The i16 shim's i16→i32 upcast structure stays — it's the right
shape once the underlying i32 path lands.

Matcher/rewriter/ABI lowering pipeline still exercises end-to-end through
the `func.call @polygeist_cudnn_conv2d_3x3_i32` op for INT inputs;
correctness of INT conv stencils is validated by the CPU backend's real
reference loops.
…classes

The polybenchGpu-extracted section now shows the full dtype matrix from
Phase 2: conv2d (f64), conv2d_f32, conv2d_i32, conv2d_i16 — each as a
separate row with its own lift / debuf outputs. The bake script processes
all four variants. The section blurb explains the dtype-suffix dispatch
in the rewriter and the two real blockers.

Two new blocker classes in BLOCKER_TAXONOMY:

  * cudnn-dtype-gap — applies to conv2d_i32 / conv2d_i16. The MLIR
    pipeline (matcher / ABI lowering / runtime ABI) is correct, but
    cuDNN's cudnnConvolutionForward does not support pure INT32
    input+filter+compute on Ampere/Orin (returns BAD_PARAM at descriptor
    setup). Real fixes are out-of-pipeline: custom CUDA kernel, INT8
    quant path, or cutlass.

  * cgeist-dtype-gap — applies to FP16 / BF16 sources (not baked here
    yet for the same reason). cgeist asserts on BuiltinType _Float16 /
    __bf16 in tools/cgeist/Lib/clang-mlir.cc:5830. Fix is a small
    addition to the BuiltinType switch.

Both are marked "partial" in the CSS class map — matcher + lowering
still validate end-to-end, only the downstream library or frontend is
the blocker.
…factoring

End-to-end: cgeist on third_party/polybenchGpu-extracted/conv3d.c lifts to
1 linalg.generic with 15 muls × 11 unique inputs (the same input appears
in multiple muls with different literal coefficients). The new matcher
fallback collapses these into the canonical "N inputs, one mul per
input" form that the _conv3d_11pt_weighted template expects, and the
rewriter materialises summed-constant arith.constant ops for the launch
operands. Emits @cudnnConvolution3D_11tap with 23 operands (11 inputs +
1 output + 11 weights).

Pieces:

* Lit refactored from StringLike to f64Like so egglog's built-in f64
  arithmetic can fire in factoring rules. parse_constants now returns
  dict[str, float]. All Lit call sites and the _parse_term tuple parser
  updated to handle float values.

* Algebra rules: factoring (c1*x + c2*x -> (c1+c2)*x) + literal folding
  added to algebra_rules(). These work correctly for the small bodies
  equivalent() operates on (library dedup), but blew up exponentially on
  conv3d's 15-summand body due to commutativity+associativity tracking
  Catalan-many bracketings in the e-graph. So:

* _factor_redundant_muls in kernel_match.py: a linear-time Python pass
  over the tuple AST that flattens the addition chain, groups summands
  by their common factor input, sums the coefficients, and rebuilds.
  body_matches_template now retries with the factored AST when direct
  unification fails. Saturation in egglog is kept in algebra_rules()
  for documentation and for equivalent()'s use case.

* inline_weights_per_in: type changed from list[str | None] to
  list[list[str] | None]. Multi-element lists indicate the
  multi-coefficient case where the rewriter must synthesise a summed
  constant op. render_launch emits %cst_synth_N = arith.constant <sum>
  before the launch and uses that SSA as the weight operand.

* _conv3d_11pt_weighted template registered. memref form, 11 inputs,
  3D parallel iteration, surface_inline_weights=True. Mirrors the
  conv2d_9pt structure.

Regression: conv2d (f64/f32/i32/i16) still emits 19-operand launches
unchanged — the new fallback path only fires when direct unification
fails, and clean bodies skip it entirely.

What's still needed for actual end-to-end on Jetson (out of this commit):
canonical defn @cudnnConvolution3D_11tap in kernel_library_phase2.mlir,
ABI lowering branch in LowerKernelLaunchToCuBLAS.cpp, runtime shim
polygeist_cudnn_conv3d_3x3x3_f64 (CPU + CUDA via cudnnSetConvolutionNd
with nbDims=3).
…ocker class

polybenchGpu-extracted conv3d now matches @cudnnConvolution3D_11tap via the
Python tuple-AST factoring fallback in body_matches_template (commit
bd1ef69). The row's blocker moves from 'matcher-gap' to a new
'partial-pipeline' state that distinguishes 'matcher + rewriter OK but
canonical defn / ABI lowering / runtime shim not yet landed' from
genuine matcher gaps.

The new partial-pipeline blocker class joins the BLOCKER_TAXONOMY and
the _BLOCKER_CSS map (rendered as 'partial', same yellow as cudnn- and
cgeist-dtype-gap, since the matcher / lowering chain is validated and
the gap is downstream).
…Body

A linalg.generic body can write multiple values per iteration via
`linalg.yield %v0, %v1, ..., %vN : ...`. Softmax in particular fuses
exp(x - max) and sum-accumulate into one body, yielding two values (the
elementwise exp goes back to the array, the running sum goes to a
scalar). LayerNorm + RMSNorm + several other fused-reduction patterns
share this shape.

Both regexes (_GEN_RE in kernel_match.py, _GENERIC_BLOCK_RE in
kernel_match_rewrite.py) captured exactly one yield SSA. With more
than one operand the backtracking inside .*? would extend across the
adjacent linalg.generic in DOTALL mode and merge fragments of two ops
into one corrupted record. The llama2c softmax (3 generics) parsed as
2 with body#1 carrying body#0's metadata and body#2's yield.

Both regexes now capture the full comma-separated yield list. The new
GenericBody.yield_values is a list[str] containing all yield SSAs;
.yield_value (singular) is preserved as a @Property returning the first
element for the rest of the codebase that was written before multi-yield
support.

Regression-tested across all 7 baked suites (polybench / pbgpu /
pbgpu-extracted / llama2c / llmc / machsuite / npb, 103 lifted MLIRs):

  suite                  baseline matches  after
  polybench_new                 9            9
  pbgpu_mlir                    7            9   (+2)
  pbgpu_extracted_mlir          5            5
  llama2c_mlir                  0            0
  llmc_mlir                     0            0
  machsuite_mlir                1            1
  npb_mlir                      0            0

Zero matches lost. Two new matches in pbgpu_mlir (jacobi_1d_3pt /
jacobi_2d_5pt in the -imper variants) — both were already supported by
existing templates but were being silently dropped because the old
regex was eating an upstream multi-yield body and shifting all the body
indices. Total bodies parsed across suites went up too (e.g. softmax
3 vs 2, deriche 4 vs 3, machsuite/fft-transpose 2 vs 0, llmc/softmax-fwd
4 vs 3) — those bodies still don't match (no template), but the parser
is no longer corrupting them.

This unblocks softmax / rmsnorm / layernorm composition entries as the
next step.
…parser fix

The descriptions for llama2.c rmsnorm + softmax and llm.c softmax-fwd
previously said 'v2-debufferize can't handle the fused exp+sum tuple
yield' as a sub-cause of the matcher-gap blocker. That was misdiagnosed:
both debufferize variants handle multi-yield linalg.generic just fine.
The actual limitation was the matcher's text-regex parser (kernel_match.py
_GEN_RE, kernel_match_rewrite.py _GENERIC_BLOCK_RE), which captured only
one SSA operand after 'linalg.yield' and dropped or corrupted bodies
with more than one. That regex was fixed in commit 7aef419.

The matcher-library gap is still the remaining blocker — no softmax /
rmsnorm composition template — but the misleading mention of debufferize
is removed from rmsnorm / softmax / softmax-fwd rows and the llama2c
section blurb.
…pport

Wires the multi-yield matching capability that the previous parser fix
(7aef419) enabled. End-to-end: llama2.c's softmax body now lifts to 3
linalg.generics (max-reduce / fused exp+sum / divide), encodes to a list
of Term per yield position, and matches a new 3-step composition
template emitting @cudnnSoftmaxForward.

Concrete pieces:

* Term.Exp constructor + math.exp -> exp in _OP_PATTERNS. The encoder
  builds Term.Exp(x) nodes for math.exp ops in the body. Added Exp to
  the _parse_term constructor lists alongside Sqrt/Abs/Select/Cmp so
  the string-roundtrip the matcher uses doesn't drop it.

* encode_body_yields(g) -> list[Term] is the multi-yield-aware sibling
  of encode_body. It rebuilds the body env (same logic as encode_body)
  and returns one Term per linalg.yield operand. Shared intermediates
  are reflected across yields (e.g. softmax's exp(out - max) appears
  identically in both yield[0] = exp(...) and yield[1] = sum + exp(...)).

* CompositionStep gains an optional body_per_yield: list[Term] field.
  When set, match_composition uses encode_body_yields and walks each
  (body_yield, template_yield) pair through body_matches_template,
  merging Cap bindings consistently across yields. Single-yield steps
  are unchanged.

* _softmax_3step() registered in the composition library:
    step 0 (1 in, 1 out, reduction):
        body = Select(Cmp("ogt", In(0), Out(0)), In(0), Out(0))
    step 1 (0 ins, 2 outs, reduction, MULTI-YIELD):
        body_per_yield[0] = Exp(Out(0) - Cap("%max"))
        body_per_yield[1] = Out(1) + Exp(Out(0) - Cap("%max"))
    step 2 (0 ins, 1 out, parallel):
        body = Out(0) / Cap("%sum")
  Emits @cudnnSoftmaxForward — cuDNN's softmax is the natural lowering
  target.

* _scan_scalar_types in the rewriter now recognises 'affine.load
  %scalar_memref[] : memref<T>' result types. Without this, softmax's
  captured max/sum scalars (loaded back from scalar memrefs between
  generics) showed up as '!any' in the launch op signature. Now they
  type cleanly as f32 (or whatever the source dtype is).

Regression on all 7 baked suites (vs the parser-fix baseline at 7aef419):
  polybench_new:           9  ->  9
  pbgpu_mlir:              9  ->  9
  pbgpu_extracted_mlir:    5  ->  5
  llama2c_mlir:            0  ->  1   (+1, softmax)
  llmc_mlir:               0  ->  0
  machsuite_mlir:          1  ->  1
  npb_mlir:                0  ->  0

llmc/softmax-fwd still no_match — its lifted form is per-(B,T) softmax
embedded in nested affine.fors with an additional masking generic, so
the bodies don't fit the 3-step pattern directly. A future variant of
this composition (or an outer-loop hoist pass) would handle it.

The kernel.launch is emitted with a well-typed signature
(memref + f32 captures + void return). What's not in this commit:
canonical defn @cudnnSoftmaxForward in kernel_library_phase2.mlir,
ABI lowering branch in LowerKernelLaunchToCuBLAS.cpp, runtime shim
polygeist_cudnn_softmax_fwd_{f64,f32} calling cudnnSoftmaxForward.
Those land in the next commit.
End-to-end at the matcher level for llama2.c rmsnorm. The lifted form is
two linalg.generic ops with host-side scalar arith between them:

  step 0: ss = sum(x[i]²)             reduction; body = Out(0) + In(0)*In(0)
  [inline scalar arith on host:       %0 = load ss
                                        %1 = sitofp N
                                        %2 = divf ss / N
                                        %3 = addf %2 + eps
                                        %4 = sqrt %3
                                        %5 = divf 1.0 / %4     ← scale]
  step 1: out = weight * scale * x    parallel; body = In(0) * (Cap("%scale") * In(1))

The new _rmsnorm_2step CompositionEntry binds Cap("%scale") to whatever
SSA the second body's mul references — typically the %5 result of the
inlined sqrt + division chain. The matcher only needs to bind that SSA;
how the scale was computed lives in the surrounding function body and
is not part of the matcher's concern.

_scan_scalar_types in the rewriter is extended to recognise the standard
arith / math scalar ops (arith.add[fi] / sub[fi] / mul[fi] / div[fsui]+
/ negf / cmp[fi] / sitofp / extf / etc., math.sqrt / exp / log / tanh /
absf / absi) so the captured %scale ends up correctly typed as 'f32' in
the launch op signature instead of '!any'. The regex uses an end-anchor
on the trailing ': <scalar type>' to avoid accidentally typing memref
or tensor SSAs that happen to use the same op names.

The emitted launch is:
  kernel.launch @rmsnorm(%x, %weight, %x, %ss_mem, %scale)
      : (memref<?xf32>, memref<?xf32>, memref<?xf32>, memref<f32>, f32) -> ()

Lowering target choices (deferred to a runtime shim commit):
- cuBLAS decomposition: cublasSdot for ss + scalar arith + per-element
  fused scale (one launch each, or a fused custom kernel).
- cuDNN cudnnNormForward with mean=0 trick (version-dependent, brittle).
- Hand-written CUDA kernel — what TRT-LLM / vLLM / FlashAttention ship.

cuDNN does NOT have a native standalone RMSNorm entry; its
cudnnNormForward always subtracts the mean. RMSNorm doesn't.

Regression on all 7 suites:
  polybench_new:           9  ->  9
  pbgpu_mlir:              9  ->  9
  pbgpu_extracted_mlir:    5  ->  5
  llama2c_mlir:            1  ->  2   (+1, rmsnorm)
  llmc_mlir:               0  ->  0
  machsuite_mlir:          1  ->  1
  npb_mlir:                0  ->  0

llama2c now matches both softmax and rmsnorm. llmc/layernorm-fwd is a
sibling pattern (3 generics; mean + variance + scale) and will get its
own composition entry next.
…tmax-fwd reframed

llama2.c softmax and rmsnorm rows move from matcher-gap to
partial-pipeline. Matcher fires cleanly on both; the remaining work is
the downstream canonical defn + ABI lowering + runtime shim (cuDNN
softmax for one, a custom kernel or cuBLAS decomposition for the other).

The llmc softmax-fwd row stays matcher-gap, but its description is
updated: the base 3-step softmax composition matches llama2's flat form
but not the (B, T) outer-affine-for-wrapped form llmc has, which also
adds a masking generic. That's a separate composition (or an outer-loop
hoist pass).

The llama2c section blurb is rewritten to highlight that rmsnorm +
softmax now match; matmul (gemv-flavoured) is the only remaining gap.
Replaces the per-kernel build scripts (gemm_cublas_jetson.sh,
conv2d_cudnn_jetson_dtype.sh, etc.) with a single generic driver. The
user invocation mirrors gcc:

    polygeist_build.sh gemm.c -DMINI_DATASET ... -o gemm

Internally it walks the full pipeline:

  1. cgeist lifts the kernel function (auto-detected via #pragma scop
     or 'kernel_' prefix; override with --function=NAME).
  2. polygeist-opt raises affine → linalg + lower-submap + debufferize.
  3. kernel_match_rewrite.py matches the body to a library template,
     emits kernel.launch ops.
  4. The full canonical defn library from kernel_library_phase2.mlir is
     injected (the lowering pass dead-strips unused defns, so this works
     uniformly regardless of which library symbol the matcher emitted).
  5. polygeist-opt --lower-kernel-launch-to-cublas turns kernel.launch
     into func.call to the runtime shim.
  6. mlir-opt + mlir-translate produce LLVM IR; the lifted symbol is
     renamed to <name>_impl so the harness's own C definition can be
     weakened and overridden.
  7. gen_wrapper.py auto-emits an ABI bridge translating C signature
     to the MLIR memref-descriptor signature.
  8. Per-target compile: --target=host uses local clang + the CPU-stub
     runtime; --target=jetson cross-compiles with aarch64-linux-gnu-gcc
     and links cuDNN/cuBLAS cross-libs.
  9. Link kernel.o + wrapper.o + harness.o (with weakened kernel symbol)
     + runtime.o + (optionally) polybench.o → binary.

Unrecognised flags pass through to all gcc/clang invocations that compile
non-MLIR code — preprocessor defines like -DMINI_DATASET and -I include
paths Just Work without special handling in the driver.

Verified end-to-end on PolyBench GEMM:
  * --target=host produces an x86 binary bit-exact vs clang -O1 reference
  * --target=jetson cross-compiles aarch64; ship + run on Jetson Orin
    produces bit-exact GPU output (md5sum matches host CPU reference).

Pre-existing limitations the driver inherits (not introduced here):
  * cgeist asserts on multi-array kernel signatures (2mm, 3mm) — known
    issue in tools/cgeist/Lib/CGCall.cc:120 'too many arguments in calls'.
  * gen_wrapper.py only parses POLYBENCH_1D/2D/3D macros, not plain
    C-array signatures like 'double A[NI][NJ]'. polybenchGpu-extracted
    sources (conv2d.c, conv3d.c) use the latter. Small extension to
    gen_wrapper.py would close this gap; tracked separately.
When the assertion at CGCall.cc:120 fires, the previous diagnostic dumped
the callee + caller AST anonymously. With several functions all involving
many array args (PolyBench 2mm has 11 params, 3mm has 12), it was
impossible to tell which call-site triggered.

The new diagnostic prints the callee name + expected vs actual input
counts + which arg index failed up front, then the same AST dumps as
before. Makes future arg-count-mismatch debugging take seconds instead
of an instrumented build.

No behaviour change — the assertion still fires on the same condition,
only the stderr output before abort is richer.

Found while diagnosing a stale-binary crash on PolyBench 2mm/3mm via
polygeist_build.sh (the unified driver added in 4b20b77). After a clean
ninja-cgeist rebuild both kernels lift cleanly, so the actual assertion
never fires here; the improved diagnostic stays for the next case.
…cros

Previously the parser only understood POLYBENCH_2D(A, NI, NK, ni, nk)
macro-style array params. Plain C signatures like 'double A[NI][NJ]'
(what polybenchGpu-extracted sources, llama2.c hot kernels, and
generally any non-PolyBench C kernel use) fell through to the scalar
catch-all and produced broken wrappers — the variable name came out
as the literal string 'A[NI][NJ]', so the extern was syntactically
ambiguous and the call-site argument was misread as array indexing
(A[NI][NJ] meaning A[NI, NJ] not A passed as a pointer).

Adds two helpers:
  _is_plain_c_array  — gates 'double <name>[<dim>][<dim>]...' before
                       falling through to the scalar/int fallback.
  _parse_plain_c_array — extracts name + dim list, returns the same
                         (kind='1D'|'2D'|'3D', name, *runtime_dims)
                         tuple shape the POLYBENCH branch produces.
                         Maps uppercase macro dims to lowercase runtime
                         args (NI->ni, NJ->nj convention; polybenchGpu
                         and llama2.c both follow this).

Downstream gen_wrapper() is unchanged — it already emits the right
memref-descriptor bridge code from a (kind, name, dims...) tuple
regardless of which signature style produced it.

Verified:
  * polybenchGpu-extracted/conv2d.c -> correct 2D wrapper (was broken).
  * polybenchGpu-extracted/conv3d.c -> correct 3D wrapper (new path).
  * polybench-c gemm.c / 2mm.c / 3mm.c -> wrappers unchanged; driver
    still produces bit-exact output vs clang reference.

This unblocks the unified driver for plain-C-array sources, which
covers everything we'd write outside of PolyBench convention.
The driver's job ends at producing an aarch64 ELF when --target=jetson.
Deployment to a specific Jetson (ssh / scp / sshpass / dev-box bounce
patterns) is environment-specific and shouldn't be encoded in the
driver's user-facing output. The previous 'Ship to Jetson:' hint at
the end of the build, with a placeholder 'nvidia@<jetson>' command,
implied a workflow the driver doesn't actually support and risked
suggesting that automation would land here.

Replaces it with: nothing. The build prints the produced binary path
via 'file', user takes it from there using whatever local deployment
tooling they have. The docstring is also updated to clarify that
deployment is out of scope.

No behaviour change for the build itself.
Adds four new columns to the per-suite tables — Jetson dataset, GPU
time (cuDNN/cuBLAS path), CPU time (aarch64 plain gcc -O3 reference),
speedup with correctness mark. Kernels with multiple dataset sizes
emit one row per size, with rowspan on the kernel-shared cells
(kernel name, match status, parallelism tag, blocker) so each kernel
still appears as one visually grouped block.

New JETSON_RUNTIMES data structure carries the measurements. Each
entry is { size, gpu_s, cpu_s, correct } where correct ∈ { PASS,
FP-noise, DIFF, ABORT }:

  PASS      = bit-exact GPU/CPU dump diff
  FP-noise  = last-decimal drift only (e.g. 447.10 vs 447.11) from
              cuBLAS/cuDNN's tiled reduction order; functionally
              equivalent, PolyBench reference considers them equal
  DIFF      = real numerical divergence (not seen in current data)
  ABORT     = GPU intentionally aborted (cudnn-dtype-gap; not in
              current data because i32/i16 not included this round)

Measurements (this round, Jetson Orin):
  gemm  MINI       GPU 94 ms     CPU  9 µs    0.0×   PASS
  gemm  LARGE      GPU 148 ms    CPU 632 ms   4.3×   FP-noise
  gemm  EXTRALARGE GPU 488 ms    CPU 7.14 s   14.6×  FP-noise
  2mm   MINI       GPU 93 ms     CPU 13 µs    0.0×   PASS
  2mm   LARGE      GPU 169 ms    CPU 4.97 s   29.5×  FP-noise
  2mm   EXTRALARGE GPU 558 ms    CPU 51.18 s  91.8×  FP-noise
  3mm   MINI       GPU 95 ms     CPU 20 µs    0.0×   PASS
  3mm   LARGE      GPU 219 ms    CPU 5.88 s   26.9×  PASS
  3mm   EXTRALARGE GPU 892 ms    CPU 61.01 s  68.4×  PASS

Notes on the patterns:
  - MINI sizes show negative speedup — GPU's CUDA init + memcpy
    overhead (~94 ms) dominates the µs-scale CPU work. Expected.
  - LARGE/EXTRALARGE show the speedup curve we hoped for: 2mm hits
    ~92× at EXTRALARGE (matches earlier project memory note of
    '83× GPU speedup at EXTRALARGE for gemm-class kernels').
  - 3mm passes bit-exact at all sizes; gemm/2mm at LARGE+ show
    last-decimal drift because cuBLAS uses tiled SGEMM/DGEMM
    algorithms with a different summation order than the textbook
    3-loop. The matcher / lowering chain is mathematically
    correct; the print routine just rounds slightly differently.

Conv2d / conv3d / softmax / rmsnorm rows still show '—' for the
runtime columns — those kernels either lack a downstream pipeline
to silicon (matcher-only) or need a separate harness path (extracted
sources have no main). Will fill in as those downstream pieces land.

The renderer uses rowspan only when len(runtimes) > 1, so single-
entry kernels (and kernels with no runtime data at all) render as
a single <tr> like before. Existing tests / rendering for the rest
of the suites are unchanged.
The polybenchGpu sources put init_array, kernel_*, and main() in one TU.
cgeist's inliner folds the kernel into main, then affine-scalrep forwards
init_array's stores through the kernel's loads (deleting them) and
hoists invariant arith between affine.for levels via LICM. The result
is an imperfect affine.for nest with no loads of the input arrays,
which the raise pass can only partially collapse — typically a single
1D reduction instead of a full [par, par, red] gemm.

--no-inline (default false) gates both createInlinerPass() sites with
AND of the existing condition (!Opt0 / CudaLower||EmitROCM). With the
inliner skipped, function boundaries survive into mem2reg/scal-rep/LICM
so the kernel's affine.for nest stays perfect and the raise pass folds
it into one linalg.generic.

bake_polybenchgpu_mlir.sh: pass --no-inline to cgeist and
--select-func=func-name=$fn to polygeist-opt. Result: 11 polybenchGpu
kernels reach FULL match (every linalg.generic → kernel.launch) and
9 reach PARTIAL, up from ~3 FULL before.
First polybenchGpu kernel taken end-to-end after the --no-inline fix
(commit 82109b6). syrk's "C := α·A·Aᵀ + β·C" is matched as one
cublasDgemm (B=A with transb=T). Numbers:

  MINI  (32²):   GPU 28.7 ms  CPU 0.029 ms  bit-exact GPU/CPU dump
  LARGE (2000²): GPU 303 ms   CPU  8.68 s   28.6× speedup, FP-noise
  X-LRG (4000²): GPU 2.03 s   CPU 69.0  s   34.1× speedup, FP-noise

New pieces:

- scripts/correctness/build_polybenchgpu_jetson.sh — per-dataset
  driver. polybenchGpu's older polybench.h breaks cgeist when given
  -DPOLYBENCH_USE_C99_PROTO, so we bake one MLIR per dataset with
  -D${DATASET}_DATASET to get the correct static second-dim. Handles
  kernel.defn injection (with dim derived from matcher output), the
  !any → f64 substitution for scalar capture types, and the rename +
  drop-internal-linkage so the wrapper can link against kernel_*_impl.

- scripts/correctness/syrk_jetson_wrapper.c — mirrors gemm wrapper.

- scripts/correctness/build_jetson.sh — link line now picks up cuDNN
  (needed because polygeist_cublas_rt_cuda.c includes conv2d shims),
  and the runtime-shim compile gets -I/usr/include/aarch64-linux-gnu
  for cuDNN headers. RUNPATH extended to include /usr/lib/aarch64-linux-gnu
  so the binary finds libcudnn.so at runtime on Jetson.

- scripts/correctness/build_ce_viewer.py — JETSON_RUNTIMES gains a
  syrk entry. Both polybench-C and polybenchGpu syrk rows pick up the
  same numbers via the existing kernel-name lookup, matching the
  gemm/2mm/3mm convention.
Second polybenchGpu kernel through the silicon pipeline. Matched as
cudnnConvolution2D_9tap_f32 (polybenchGpu DATA_TYPE defaults to float
for convolution-2d).

Numbers (Jetson Orin, NI=NJ):

  MINI  (64²):    GPU  50.6 ms  CPU 0.014 ms
  LARGE (4096²):  GPU 139   ms  CPU 46.0  ms
  X-LRG (8192²):  GPU 326   ms  CPU 186    ms

Note GPU is *slower* than CPU at every size — the 3×3 stencil has very
low arithmetic intensity (9 muls + 9 loads per output element), so the
work is bandwidth-bound and cuDNN setup overhead (descriptor creation,
workspace allocation, kernel launch) dominates. Numeric outputs match
to %0.2lf precision (sorted distributions identical, differences are
third-decimal rounding artifacts).

New: scripts/correctness/build_polybenchgpu_conv2d_jetson.sh —
analogous to build_polybenchgpu_jetson.sh but for the conv2d shape:
10 input/output memrefs + 9 scalar f32 weights in the kernel.defn,
and the MLIR-to-LLVM pipeline uses --convert-linalg-to-loops
+ --expand-strided-metadata (not --one-shot-bufferize) since the
matched conv2d body operates on memrefs in place.

build_ce_viewer.py: JETSON_RUNTIMES gains "convolution-2d" entry,
showing up in the polybenchGpu section's conv2d row.
Extends the downstream pipeline to recognize two more matcher callees,
unlocking five gemv-shaped polybenchGpu kernels (atax, bicg, mvt,
gemver, gesummv) for end-to-end builds. atax + bicg are now wired up
on Jetson with measured timings; mvt/gemver/gesummv just need
wrappers + build scripts (the lowering pass already handles them).

New lowerings:

- @cublasDgemv(A, x, y) → polygeist_cublas_dgemv(M, N, 1.0, A, lda,
  x, 0.0, y). Matcher template encodes α=1, β=0 (any scale/accumulate
  is fissioned into separate generics).
- @memset_zero_1D(v) → polygeist_cublas_memset_zero_1d(N, v). Host-
  side bzero; same justification as the 2D variant.

Runtime shims:

- polygeist_cublas_dgemv (CUDA): alloc → H2D → cublasDgemv → D2H →
  free. Uses CUBLAS_OP_T to read the row-major A as cuBLAS's column-
  major convention.
- polygeist_cublas_dgemv (CPU stub) + polygeist_cublas_memset_zero_1d
  (both CUDA + CPU) for parity with the existing shims.

Validated atax/bicg on Jetson Orin (NX=NY):

  atax MINI 32²:    GPU 31.7 ms   CPU 0.002 ms
  atax LARGE 8000²: GPU 373.2 ms  CPU 104.7 ms
  bicg MINI 32²:    GPU 31.6 ms   CPU 0.004 ms
  bicg LARGE 8000²: GPU 357.7 ms  CPU 294.1 ms

Both kernels fall into the bandwidth-bound regime where cuBLAS H↔D
overhead dominates the actual gemv compute. CPU wins at every size,
similar to convolution-2d.

KNOWN CORRECTNESS GAP — JETSON_RUNTIMES marks atax + bicg as DIFF:

Both atax (tmp = A·x; y = Aᵀ·tmp) and bicg (s = Aᵀ·r; q = A·p) do one
untransposed and one transposed gemv. The matcher's cublasDgemv
template at scripts/correctness/kernel_match.py is body-shape only
(`Out + In(0) * In(1)`), so A·x and Aᵀ·x produce indistinguishable
@cublasDgemv launches. The downstream lowering can't tell which is
which from the launch signature alone, so it picks no-transpose for
every call — meaning the half that should be transposed computes the
wrong vector. Wall-clock timings are still informative (two cuBLAS
gemv round-trips per kernel, which is what cuBLAS actually does).

Follow-up: extend the matcher to surface transpose info either via a
distinct @cublasDgemv_T symbol or a launch attribute, so the lowering
can pick the right cublasDgemv variant per call site.

Adds: atax_jetson_wrapper.c, bicg_jetson_wrapper.c,
build_polybenchgpu_gemv_jetson.sh (auto-detects callees from matched
MLIR and injects defns for each).
Two changes that together unlock 5 more polybenchGpu kernels:

1. *Gemv transpose discriminator*. The matcher's @cublasDgemv template
   matches both y=A·x and y=Aᵀ·x bodies (both have shape `Out + In(0)*In(1)`).
   The launch operands don't encode which case — the transpose info
   lives in the linalg.generic's indexing maps and was being thrown
   away at the matcher→launch boundary. Result: atax + bicg ran on
   silicon with the wrong cuBLAS op flag for half their gemvs, so the
   numerical output was structurally wrong.

   Fix: rewriter post-match override emits @cublasDgemv_T when A's
   first indexing-map output dim does NOT match the output vector's
   first dim (i.e., the reduction iterator lives in A's first slot).
   Downstream lowering routes _T to cuBLAS with CUBLAS_OP_N instead of
   CUBLAS_OP_T (same shim, opposite flag). Also added the same shim's
   CPU stub for parity.

   Verified: atax MINI + bicg MINI now BIT-EXACT GPU/CPU dump diff.
   atax LARGE + bicg LARGE both PASS as well (no per-byte diff run
   because the LARGE dumps are 8000-vector each).

   Required a parser fix: kernel_match.parse_generics was silently
   missing all indexing_maps because (a) the regex `affine_map<[^>]*>`
   stopped at the `->` inside, (b) `\b` word boundary didn't work next
   to `#` for #mapN substitution. Both fixed.

2. *gesummv + gemver downstream callees*. Added lowering branches +
   runtime shims for the four remaining matched callees:

   - cublasDaxpby      (y = α·x + β·y) — cublasDscal + cublasDaxpy
   - cublasDaxpy_unit  (y += x)        — cublasDaxpy with α=1
   - cublasDgemv_alpha (y += α·A·x)    — reuses the existing dgemv shim
                                          with α from launch, β=1
   - cublasDger_rank2  (A += u₁·v₁ᵀ + u₂·v₂ᵀ) — two cublasDger calls

   gesummv + gemver now build + run on Jetson; cuBLAS calls dispatch
   correctly. Wall-clock timings are real. Numerical outputs show a
   heap-corruption pattern (mostly-correct values interspersed with
   1e+150-range overflow) — residual bufferization-aliasing issue in
   the axpby step's y operand handling, debug pending. JETSON_RUNTIMES
   marks both as DIFF correctness for honesty.

   mvt also picks up the transpose discriminator (its two gemvs are
   opposite-direction). Built fine but segfaults during print_array;
   most likely because the matcher didn't fission its accumulating
   init (no memset_zero_1D before the gemv) so β=0 overwrites x1/x2
   instead of accumulating into them — mismatch with what the harness
   expects to dump. Marked ABORT in JETSON_RUNTIMES.

JETSON_RUNTIMES now carries 7 polybenchGpu kernels: gemm/2mm/3mm/syrk
(PASS or FP-noise), conv2d (FP-noise), atax/bicg (PASS), gesummv/gemver
(DIFF residual), mvt (ABORT). Explorer regenerated.
…rray overrun

Two findings from debugging the gesummv/gemver DIFF correctness:

1. *daxpby host-side*. The CUDA daxpby shim (cublasDscal + cublasDaxpy,
   pre-existing version) was correct, but the H↔D copy + two cuBLAS
   calls dominated any GPU benefit for the O(N) bandwidth-bound op.
   Replaced with a straight host loop. Verified to give identical
   output bits to the CUDA path; the corruption persists with both
   paths, so axpby itself is not the bug.

2. *print_array overrun is aarch64-specific.* Built a CPU-stub variant
   of gesummv for Jetson aarch64 (kernel.o linked against
   polygeist_cublas_rt_cpu.o instead of rt_cuda.o). It reproduces the
   exact same overrun — polybench's print_array reads ~17 extra
   elements past `y[n-1]` into adjacent heap. The same lowered MLIR +
   CPU stub on *x86* is bit-exact, so this is NOT a lowering bug and
   NOT a CUDA shim bug. Most likely an aarch64 calling-convention or
   stack-frame issue with 32-arg flat-memref impl signatures
   (kernel_gesummv_impl has 32 LLVM args after memref expansion).

   The kernel itself writes correct values to polybench's y[0..n-1]
   (verified at wrapper-exit boundary). Only print_array's read-loop
   bound is wrong.

JETSON_RUNTIMES comment updated to record this distinction so future
debug doesn't re-investigate the CUDA path. Next step is to inspect
the LLVM IR's aarch64-specific stack frame for kernel_gesummv_impl —
likely a mismatch between gcc-aarch64's outgoing-args sizing and the
LLVM-generated callee's incoming-args layout.

atax/bicg keep their PASS status (bit-exact at MINI) because their
impl signatures are smaller (28 args, all GP, fit closer to the
8-X-register limit).
Root-cause for the heap-corruption-looking dump diff in gesummv/gemver
on Jetson aarch64: it wasn't heap corruption. gcc at -O3 examined the
local static body of `kernel_<name>` in the same translation unit,
ran intraprocedural-analysis passes (modref, pure-const), and decided
the kernel doesn't clobber w0. So main loaded `w0 = N` once before
init_array and *never reloaded it* before kernel_gesummv or
print_array — banking on the IPA conclusion.

But objcopy --weaken-symbol redirects the call at link time to our
wrapper, and AArch64 ABI says w0 is a scratch register the callee is
free to use. The wrapper does use it. Result: when main calls
print_array, w0 holds whatever the wrapper happened to leave there
(typically ~49 for the gesummv case, since the wrapper's final
fprintf returns the byte count of its formatted string). print_array's
`for (i=0; i<n; i++)` loop then iterates 49 times instead of 32,
reading 17 doubles of adjacent heap = the 1e+150-range "garbage."

Disassembly confirmed:
  before fix:  bl kernel_gesummv ; bl print_array  (no w0 reload)
  after fix:   bl kernel_gesummv ; mov w0, #0x20 ; bl print_array

Fix: change `-Dstatic=` (which left the local body visible to gcc's
IPA) to `-Dstatic=__attribute__((noipa))`. This tags every kernel_*
body as IPA-opaque, forcing gcc to emit ABI-correct caller-saved-reg
reloads at every call site.

Verified MINI:
  atax     md5 GPU == md5 CPU  (BIT-EXACT)
  bicg     md5 GPU == md5 CPU  (BIT-EXACT)
  gesummv  md5 GPU == md5 CPU  (BIT-EXACT)
  mvt      runs cleanly (no more segfault, exit 0). Small numerical
           drift remains because the matcher fissioned mvt's
           accumulating x1/x2 init wrong (kernel overwrites with β=0
           instead of accumulating into them). Separate matcher bug.
  gemver   same story — small drift from dropped initial-value
           contribution. Separate matcher fix needed.

JETSON_RUNTIMES updated: atax/bicg/gesummv now PASS, mvt/gemver still
DIFF but with the residual diagnosis recorded. mvt loses its ABORT
status — kernel runs to completion now.

Memory note for future-me: any time a polybench harness uses
`-Dstatic=` to weaken a `static void kernel_*()` for symbol
substitution, ALSO upgrade it to `-Dstatic=__attribute__((noipa))`
or gcc -O3's IPA will silently bake in invalid assumptions about
caller-saved-reg preservation. The bug manifests as nonsense data
in stack-resident locals (like n, n_iterations, loop bounds) AFTER
the wrapper returns.
On Jetson Orin, the CPU and integrated GPU share the same physical DRAM
(LPDDR5). Our prior runtime did cudaMalloc + cudaMemcpyH2D + cuBLAS +
cudaMemcpyD2H + cudaFree for every shim call — which on Tegra means
copying within the same DRAM to itself before/after the actual compute.

Replaced that pattern with cudaHostRegister(host_ptr, bytes,
cudaHostRegisterMapped) + cudaHostGetDevicePointer + direct cuBLAS call.
Sets up the iGPU's page-table mapping for polybench's existing buffers,
no extra allocations, no data movement.

Tried bypassing cudaHostRegister entirely (just passing host pointers to
cuBLAS, trusting UVA on Tegra) — fails with illegal-memory-access. cuBLAS
needs the buffer registered or device-allocated even when the iGPU can
technically reach it. cudaHostRegister is the right call.

Aliased operands (e.g. syrk's A passed as both A and B) are handled by a
register_host_safe() helper that silently tolerates
cudaErrorHostMemoryAlreadyRegistered. Same for unregister.

Refactored shims:
  - polygeist_cublas_dgemm
  - polygeist_cublas_dgemv (+ dgemv_T)
  - polygeist_cublas_daxpy_unit
  - polygeist_cublas_dger_rank2

Skipped (no net win expected):
  - polygeist_cublas_daxpby — already host-side
  - polygeist_cublas_memset_zero_{1d,2d} — already host-side
  - polygeist_cublas_dscal_2d — already host-side
  - cuDNN conv2d shims — cuDNN setup/algo-select dominates, not H↔D

Re-ran all 12 Jetson kernels with the new runtime:

  Kernel          MINI       LARGE      EXTRALARGE   Δ vs prior
  gemm        29.5 ms   78.8 ms      408 ms     -69% / -47% / -16%
  2mm         30.4 ms   98.8 ms      471 ms     -67% / -41% / -16%
  3mm         30.6 ms  146.0 ms      789 ms     -68% / -33% / -12%
  syrk        29.7 ms  291.6 ms     1960 ms      +4% /  -4% /  -3%
  atax        35.8 ms  265.4 ms        -          0% / -29%
  bicg        36.4 ms  265.8 ms        -          0% / -26%
  gesummv     32.2 ms  263.0 ms        -          0% / -29%
  gemver      34.2 ms  449.9 ms        -          0% / -31%
  mvt         36.0 ms      -           -          0%

Pattern: MINI gemm-family sees ~3× speedup (almost all of the prior
~94 ms was H↔D); LARGE for bandwidth-bound gemv kernels gets ~25-30%
(the cuBLAS work is roughly bandwidth-limited, so eliminating one
DRAM round-trip helps). LARGE/XLARGE for compute-bound gemm sees
smaller relative gains because the cuBLAS dgemm time dominates.

scripts/correctness/polybench_cublas_jetson.sh: also link -lcudnn now
(the shim file includes cuDNN code, so link picks it up unconditionally).

Correctness re-verified bit-exact at MINI for atax/bicg/gesummv/syrk
via md5 of GPU dump against CPU dump.
cudaHostRegister has real cost on Jetson — page-table setup for the
mapped range is proportional to buffer size. For an 8000×8000 double
matrix (128K pages) it's measurable. Gemver does 4 shim calls on the
same A, so we were re-registering A four times per kernel run.

Replaced the per-call register/unregister with a persistent cache:
register on first sight, never unregister. A small flat array (cap=256)
keyed on host pointer caches the device pointer. The OS reclaims the
mappings at program exit.

Effect on LARGE (n=8000):
  gemver:  450 ms → 390 ms   (4 ops on A — biggest win)
  gesummv: 263 ms → 242 ms
  atax:    265 ms → 244 ms
  bicg:    266 ms → 245 ms

gemm/2mm/3mm/syrk barely move (each call has distinct buffers, no
amortization possible). MINI numbers also unchanged — fixed cuBLAS
handle + first-register costs dominate, the cache only helps after.

These gemv-style kernels are bandwidth-bound: each cublasDgemv on
n=8000 streams 512 MB of A → minimum ~3 ms at Jetson Orin LPDDR5 peak
(~204 GB/s). We measure ~120 ms per gemv → sustained ~4 GB/s, about 2%
of peak. The big gap is cuBLAS's row-major-via-OP_T emulation — non-
coalesced access. To go faster we'd need to either (a) transpose A to
column-major once and use OP_N, or (b) fuse the multiple gemvs into a
single kernel that streams A once. Both are matcher/lowering changes,
not runtime.

CPU LARGE numbers (Jetson ARM cores, plain -O3) for reference:
  atax 107 ms, bicg 294 ms, gesummv 293 ms, gemver 575 ms.

So gemver/gesummv beat the CPU at LARGE but only modestly. atax is
slower than CPU at LARGE — its inner loop is so trivially vectorizable
that the ARM cores' wider memory subsystem wins.
Added a "notes" column next to the speedup column in the per-suite
Jetson tables. Each (kernel, dataset) entry gains an optional
"notes" string in JETSON_RUNTIMES; the explorer renders it as a
small-text grey cell at the row tail.

Notes fall into a few buckets:

- "Setup-bound": MINI runs across all kernels. The 28-36 ms floor is
  cuBLAS handle init + first cudaHostRegister page-map for one of the
  larger buffers; the actual kernel work is microseconds.

- "Bandwidth-bound dgemv via OP_T": atax/bicg/gesummv LARGE. cuBLAS
  emulates row-major y=A·x by passing A as col-major-Aᵀ and applying
  OP_T. The OP_T kernel uses strided reads across A's rows, killing
  coalescing. Measured throughput ~2-5% of peak DRAM bandwidth (~204
  GB/s on Jetson Orin LPDDR5). CPU's wider memory subsystem +
  auto-vectorised contiguous-access loops keep pace.

- "Matcher fission bug": mvt / gemver. The matcher didn't fission the
  accumulating init step (kernel.launch overwrites x1/x2/w with β=0
  instead of += into the polybench-initialised values). Numerical
  output is off; wall-clock timing is real.

- conv2d: rerun on the current runtime (the conv2d shims weren't
  touched in the zero-copy refactor but the surrounding runtime got
  cheaper). New numbers: MINI 27 ms / LARGE 140 ms / EXTRALARGE 305 ms.
  3×3 stencil has AI≈1, so it's bandwidth-bound regardless of
  hardware; cuDNN can't reuse the filter across enough output
  elements to amortise descriptor setup.

- syrk: matched as cublasDgemm with B=A pointer alias. cuBLAS doesn't
  recognise the symmetry; runs full M*N*K work. A native cublasDsyrk
  matcher pattern would be ~2× faster (it only updates the lower
  triangle).

No runtime changes. Just metadata + a column.
Adds the darknet (pjreddie/darknet) third-party clone as a fifth
benchmark suite in the IR explorer. The "kernels" are individual .c
files in src/; the bake runs cgeist + raise + match on each.

Approach:
  bake_darknet_mlir.sh iterates over third_party/darknet/src/*.c,
  baking each through:
    cgeist --function='*' --no-inline ...
    polygeist-opt --raise-affine-to-linalg-pipeline --linalg-debufferize
    kernel_match_rewrite.py
  Files use --function='*' because darknet's compute is spread across
  many entry points (gemm_nn/nt/tn/tt all need to lift); --no-inline
  prevents the raise pass from collapsing init-into-kernel boilerplate
  the way it used to on polybenchGpu.

Results (46 .c files, ~25K LOC total):
  cgeist OK:                  28  (61%)
  raise OK:                   23  (50%)
  produced ≥1 linalg.generic: 18  (39%)
  produced ≥1 kernel.launch:   1  ( 2%)

The 1 file that matches: src/gemm.c (6 launches across gemm_nn / nt /
tn / tt / bin). The 17 raise-OK-but-no-match files are an actionable
list of missing matcher templates: pooling (avg/max), batchnorm, LRN,
residual-add, GRU/LSTM gates, transposed conv, locally-connected, dense
+ bias, softmax-with-control-flow, l2norm. The 18 cgeist-fails are
mostly framework code (parser, image, data, network) with no compute.

darknet's actual production hot path is gemm_nn (TA=TB=0). The matcher
hits it as @cublasDaxpy (the inner loop has the scalar-hoisted axpy
shape) but doesn't compose the outer two loops back up into gemm.
gemm_nt and gemm_tt use the conventional sum-accumulator form and do
match as @cublasDgemm_alpha_only. Fixing gemm_nn composition is a
high-value matcher follow-up — it would auto-cover every conv layer
darknet runs at inference time (since every conv goes through gemm_nn
via im2col).

New section in build_ce_viewer.py:
  - DARKNET_ROOT / DARKNET_MLIR_DIR path constants
  - DARKNET_KERNELS dict (45 .c files)
  - DARKNET_NOTES per-file with parallelism tag + characterisation
  - DARKNET_BLOCKERS per-file mapped to existing taxonomy
    (matcher-gap, cgeist-gap, debuf-bug, none)
  - find_kernel_c dispatch for kset="darknet"
  - build_index gains darknet_stats parameter
  - new section + nav link to "#darknet"

The third_party/darknet/ clone itself is NOT committed (it's a vendored
upstream, would bloat the repo to ~25K LOC for the framework + cfgs).
The bake script's PATH is hardcoded so a fresh clone reproduces the
results.
…end on Jetson Orin

Polybench-style C kernels in third_party/cnn-extracted/, each lifted through the
full Polygeist pipeline (cgeist → raise → debufferize → matcher → ABI lowering →
LLVM IR → aarch64 cross-compile → Jetson silicon).

Five extracted-darknet baseline kernels (matcher templates + lowering branches +
cuDNN/cuBLAS shims + harness + per-kernel HTML page in the IR explorer):

  conv2d_batched              → cudnnConvolutionFwd_batched           23.8x LARGE
  maxpool_batched             → cudnnMaxPoolFwd_batched                1.29x LARGE
  batchnorm_batched           → cudnnBatchNormalizationForwardInference  0.38x LARGE
  shortcut_batched            → cudnnAddTensor_batched                   0.08x LARGE
  conv_bn_relu_batched        → cudnnConvolutionBiasActivationForward
                                 (with host-side BN folding)            23.5x LARGE

Four fusion-optimization kernels (algebraic rewrites + faster cuBLAS/cublasLt/
cuDNN entry points):

  conv_bias_relu_add_batched  → cudnnConvolutionBiasActivationForward
                                 (α2*Z addend for ResNet skip)          23x LARGE
  gemm_bias_relu              → cublasLtMatmul EPILOGUE_RELU_BIAS      901x LARGE
  ata_gemm                    → cublasSsyrk (operand-alias discriminator
                                 detects AᵀA pattern; half the flops)  3393x LARGE
  conv1x1_batched             → cublasSgemmStridedBatched (4-par+1-red
                                 shape distinguishes K=1 from K×K)     105x LARGE

Cross-cutting infrastructure additions:

* Matcher: ~9 new CompositionEntry templates + AᵀA→syrk post-unify operand-alias
  discriminator in kernel_match_rewrite.py. Per-step span replacement preserves
  intervening polygeist.submap ops between matched generics.
* Lowering pass: resolveSubmapBase now chains through both polygeist.submap and
  polygeist.submapInverse (up to 16 hops). New pre-pass elides redundant
  memset_zero_{1D,2D} launches preceding any β=0 op (syrk). Dtype-suffixed
  memset dispatch (f32 alongside f64).
* Runtime: cublasLt linkage (libcublasLt.so.12); ensure_cublaslt() helper.
  Host-side BN-folding for fused conv+bn+relu (precompute scaled filter + bias).
  All cuDNN algo-selection loops use array-sized cudnnConvolutionFwdAlgoPerf_t
  buffers (avoiding the stack-smash that bit single-struct attempts).
* Build: scripts/correctness/extracted_darknet_jetson.sh handles all 9 kernels;
  bake_extracted_darknet_mlir.sh produces per-stage MLIR snapshots for the
  IR explorer; -lcublasLt added to link line.
* IR explorer: two new sections (extracted darknet, Fusion optimization) with
  Compiler Explorer deep-links + per-kernel raised/debuf/matched IR preview
  pages.

All four fusion optimizations are 100% bit-exact (or FP-noise within 1e-4 print
precision); LARGE speedups range 23x→3393x over the CPU 3-loop reference on the
Jetson Orin (Tegra Ampere, FP32, cuDNN 9.x, CUDA 12.6).
…nv2d + 4 image filters on Jetson Orin

  * New LowerKernelLaunchToPVA pass — owns the matcher's i8/i16
    @cudnnConvolution2D_9tap_* launches plus new
    @pvaBoxFilter_3x3_i{8,16}, @pvaGaussianFilter_3x3_i{8,16},
    @pvaBilateralFilter_3x3_i{8,16}, @pvaHistogramEqualization_i8
    symbols. Each routes to a polygeist_pva_* runtime shim. Disjoint
    symbol set from --lower-kernel-launch-to-cublas; the two passes
    run side by side; either order works.

  * Shared 9-tap conv lowering helper extracted out of
    LowerKernelLaunchToCuBLAS.cpp into KernelLaunchLoweringUtils.{h,cpp}
    so both backend passes call the same body. Added a parallel
    lowerImageFilter2Operand helper for the 2-memref filter launch
    shape (Box/Gaussian/Bilateral/HistogramEq).

  * cuBLAS pass: dropped i8/i16 from shimSymbolFor + the dispatch
    switch; PVA-claimed launches fall through with a `continue`
    instead of erroring out. Net diff is small in the cuBLAS pass
    file (the 3 helpers moved out are the bulk of the delta).

  * New PVA runtime shim runtime/polygeist_pva_rt.c with:
      - cudaSetDevice + nvcvAllocatorConstructPva + non-blocking
        stream init (idempotent, lazy, persistent for process lifetime)
      - make_pva_image_tensor_dtype: HWC tensor alloc through the PVA
        allocator with arbitrary NVCV dtype (needed because half the
        PVA ops are U8-only; we reinterpret i8 bytes as U8)
      - CupvaMemGetHostPointer-mediated host I/O (raw cudaMemcpy
        segfaults on cuPVA-allocated pages; the host-pointer mapping
        is mandatory)
      - One pva<Op>Create / pva<Op>Submit wrapper per op
      - (M-2)×(N-2) interior copy from PVA output back to caller B
        to honour the matcher's &B[1][1] pointer-shift convention
        (writing the full M×N overflows B by N+1 bytes)

  * Matching CPU reference stubs in polygeist_cublas_rt_cpu.c modelled
    to mirror PVA hardware semantics: centred kernel anchor, REPLICATE
    border, Q-format >>qbits shift, unsigned-kernel reinterpretation
    for Conv2d; rounded-mean (sum + 4) / 9 for BoxFilter; canonical
    [1,2,1;2,4,2;1,2,1] / 16 for Gaussian; textbook 256-bin CDF-LUT
    for HistogramEq. Bilateral has a pass-through stub (the
    non-linear hardware semantics aren't worth mirroring bit-exactly).

  * third_party/polybenchGpu-extracted/conv2d_i8.c — i8 variant of
    the 9-tap stencil (i16 already existed). Matcher fires on it via
    the existing dtype-suffix template + emits
    @cudnnConvolution2D_9tap_i8, which the new PVA pass claims.

  * Cross-compile script conv2d_cudnn_jetson_dtype.sh: i8 dtype
    branch added; PVA-library link line (-lpva_operator -lcvcuda
    -lnvcv_types -lcupva_host) plus direct DT_NEEDEDs for
    -lnvscibuf -lnvscisync via -Wl,--no-as-needed (deferred
    resolution segfaults during libcupva_host init constructors);
    step (5) now invokes both --lower-kernel-launch-to-cublas
    and --lower-kernel-launch-to-pva.

  * Four hand-authored kernel.launch test scaffolds in
    scripts/correctness/pva_{boxfilter,gaussian,bilateral,histeq}_jetson.sh.
    Matcher templates for these C-level patterns aren't written yet,
    so each script synthesises the kernel.launch MLIR directly and
    runs the rest of the pipeline normally — same harness, wrapper,
    ABI lowering, and link line.

  * IR explorer (scripts/correctness/build_ce_viewer.py): new "PVA
    backend" section at the bottom. Shows the 6 PVA-routed kernels
    with their op name, libpva_operator entry points, shim symbol,
    and Jetson PVA wall-clock at each size we benchmarked. No CPU
    comparison in this view (CPU stubs exist for separate per-op
    bit-exact validation).

  * CLAUDE.md: "point, don't copy" rule for gated-distribution NVIDIA
    SDKs. PVA Solutions / cuPVA SDK headers consumed via -I at build
    time; never copied into the Polygeist tree.

End-to-end silicon validation on Jetson Orin: bit-exact PVA-vs-CPU
diff for Conv2d i8/i16, BoxFilter, Gaussian, and HistogramEq at 256².
Bilateral runs cleanly; visual spot-check only (non-linear).
Conv2d at 10240×10240: PVA 216 ms vs CPU 499 ms (2.3× speedup for i8).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants