Skip to content

Cut compile time: de-inline setup helpers + add a small precompile workload#45

Open
lkdvos wants to merge 7 commits into
mainfrom
ld-compile-time
Open

Cut compile time: de-inline setup helpers + add a small precompile workload#45
lkdvos wants to merge 7 commits into
mainfrom
ld-compile-time

Conversation

@lkdvos

@lkdvos lkdvos commented Jun 17, 2026

Copy link
Copy Markdown
Member

This PR removes some of the forced @inline annotations on sreshape and permutedims, which, combined with the @inline calls on the recursive _compute* for the strides and sizes functions meant that these methods have to be compiled in quite a lot of the TensorOperations kernels, for each different combination of T,N,....
This PR just removes the annotation, allowing the compiler to decide when to inline, which seems to have quite a large impact on the actual compilation time in TensorOperations calls.

On top of that, since these functions are not inlined, it now makes sense to add a precompilation workload as well, in an attempt to remove some of the TTFX as well as precompile time burden in TensorOperations.

From what I can measure, it seems to reduce about 25% of the TTFX on a workload in TensorOperations where I exhaustively perform all binary contractions up to N1,N2,N3 <= 3 (open,contracted,open legs).
On my machine, precompilation time is order ~2 seconds, so this hurts very little.

The runtime cost at the TensorOperations level is negligible, so I'd say to just merge and release this.


Precompile-time comparison: TensorOperations suite, main vs this PR

Cold-precompiling the TensorOperations precompile workload (enabled, fixed grid
precompile_contract_ndims=[3,2], precompile_add_ndims=3, precompile_trace_ndims=[3,2],
eltypes [Float64, ComplexF64]), back-to-back on Julia 1.12.6:

precompile of… StridedViews main StridedViews (this PR) Δ
TensorOperations (the workload suite) 90.9 s 68.0 s −22.9 s (−25%)
Strided 1.72 s 1.90 s +0.2 s (noise)
StridedViews itself 0.59 s 3.59 s +3.0 s (this PR's @compile_workload)
whole environment (cold) 93.3 s 75.4 s −17.9 s (−19%)

The de-inlining stops TensorOperations' contraction specializations from re-inferring the
StridedView stride/permute helpers, so its precompile suite is ~25% cheaper. The added cost
lives in StridedViews' own precompile (+3.0 s, one-time per build, shared by all downstream),
for a net ~19% faster cold precompile of the whole environment.

lkdvos and others added 6 commits June 17, 2026 11:11
`permutedims`, `sreshape`, the SliceIndex `getindex`/`sview` view constructors,
and the `_computeviewsize`/`_computeviewstrides`/`_computeviewoffset` helpers are
all "once-per-operation" setup steps, not hot inner-loop code. Forcing `@inline`
on them duplicated their per-N size/stride/offset/permute computation into every
downstream specialization and re-inferred it per shape, bloating compile times.

Dropping `@inline` lets each compile once per signature and dedup across callers.
The hot indexing path is deliberately left inlined: scalar `getindex`/`setindex!`
and `_computeind` keep `@inline`, as does the trivial `_normalizeparent` accessor.

Measured (Julia 1.12.6):
- Downstream TensorOperations dynamic-ncon grid: TTFX 42.1s -> 31.6s (-25%) from
  de-inlining permutedims/sreshape, with no runtime regression (StridedBLAS vs
  BaseCopy results agree to 3e-16).
- StridedViews-local A/B vs origin/main: view construction 4.35ns -> 4.35ns and
  the scalar getindex hot loop 20.22us -> 20.24us, i.e. steady-state runtime
  unchanged for the additionally de-inlined view helpers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Warm the core `StridedView` specializations for the BLAS element types
(`Float32`, `Float64`, `ComplexF32`, `ComplexF64`) over ndims 1:4 plus the 2D
transpose/adjoint cases: construction, `permutedims`, `sreshape`, `sview`/slice
`getindex`, `conj`, `transpose`/`adjoint`, and `size`/`strides`/`offset`. These
are exactly the specializations downstream packages hit on their first call, so
caching them removes that first-call latency.

The workload is intentionally kept small (BLAS floats, ndims 1:4, identity/conj
plus the 2D wrappers) to keep StridedViews' own precompile bounded.

Measured (Julia 1.12.6, cold compiled-cache depot):
- StridedViews cold precompile: ~0.53s -> ~2.29s (Pkg build line), i.e. ~+1.76s
  one-time, bounded.
- First-call latency of the exercised core ops in a fresh process: ~1.78s ->
  ~0.027s (~66x), the inference cost being moved into the cached precompile.

Bumps version to 0.5.2 and adds PrecompileTools to [deps]/[compat].

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lkdvos lkdvos force-pushed the ld-compile-time branch from 34adf20 to 4c04cf9 Compare June 17, 2026 15:47
Comment thread src/stridedview.jl
return getindex(StridedView(a), I...)
end
@inline function sview(a::AbstractArray, I::SliceIndex)
function sview(a::AbstractArray, I::SliceIndex)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I indeed have no idea why any of the above @inlines were here, this didn't make any sense.

@Jutho

Jutho commented Jun 17, 2026

Copy link
Copy Markdown
Member

I'll approve after you finished fighting JET

@codecov

codecov Bot commented Jun 17, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 93.75000% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/stridedview.jl 60.00% 2 Missing ⚠️
Files with missing lines Coverage Δ
src/StridedViews.jl 100.00% <ø> (ø)
src/precompile.jl 100.00% <100.00%> (ø)
src/stridedview.jl 28.20% <60.00%> (+14.74%) ⬆️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants