Speed up runtime by reducing allocation churn in tile generation by Symmetricity · Pull Request #902 · systemed/tilemaker

Symmetricity · 2026-05-27T15:25:26Z

This PR is AI generated.

Summary

In a warmed Austria OpenMapTiles tile generation run, this reduced wall time by
about 6.4% and user CPU by about 9-10%. In heaptrack runs on a smaller fixture,
allocation calls fell by about 47% and temporary allocations by about 74%.

This PR reduces allocation churn and unnecessary geometry work in several hot
paths used while generating tiles.

The changes are grouped around the same goal: keep existing output behavior,
but avoid repeated temporary allocations, avoid avoidable geometry copies, and
skip cheap-to-detect no-op work before it reaches heavier geometry routines.

The largest measured effect is in tile generation with the OpenMapTiles
profile: fewer geometry/vector reallocations, fewer decoded sorted-node chunks,
and less work in clipping/indexing paths.

Implementation

Node and way lookup hot paths

Cache a small number of recent sorted-node chunks per worker instead of only
one chunk.
Avoid copying geometry while indexing objects.
Fill way geometry buffers directly instead of building temporary output and
copying it afterward.
Reserve OSMStore and way geometry buffers before filling them.

These changes target repeated lookups and geometry population during PBF-backed
tile generation.

Geometry construction and scaling

Move freshly built or corrected rings into their destination containers where
the source is no longer needed.
Pre-size populated multipolygons and closed-way polygon rings.
Reserve split-linestring and Visvalingam output storage.
Reuse projection/scaling output buffers for polygon area and scaled geometry
calculations.

These changes reduce short-lived std::vector allocation/reallocation in
geometry preparation paths.

Clipping and intersection paths

Reuse the scratch ring used by the fast polygon clipper.
Skip fast clipping for bounded linestrings that are already inside the clip
box.
Use line segments directly for line tile-intersection checks.
Reuse dissolve intersection output storage.
Move uncached multipolygons into the clipping path instead of copying them
when the current function owns the geometry.
Skip Sutherland-Hodgman edge passes when no point needs clipping against that
edge.

These changes keep the same clipping model, but reduce repeated temporary
storage and avoid unnecessary pass-through clipping work.

Performance

I compared current upstream master with this performance stack applied directly
on top of upstream master, without other unmerged PRs.

Branches/binaries:

upstream: b437d7c
performance stack: ba8db48

Runtime fixture:

fixture: Austria Geofabrik extract
profile: OpenMapTiles profile
output: PMTiles
store: no --store
threads: 8
warmup: PBF plus coastline/landcover sidecar files

Three alternating warmed runs, upstream first:

wall time:   -2.19s  (-6.42%)
user CPU:    -19.34s (-9.28%)
system CPU:  +2.09s  (+6.00%)
RSS:         -47.6 MiB (-1.07%)

Three alternating warmed runs, performance stack first:

wall time:   -2.17s  (-6.38%)
user CPU:    -20.05s (-9.75%)
system CPU:  +2.21s  (+5.93%)
RSS:         +47.4 MiB (+1.08%)

The wall-time and user-CPU improvement reproduced in both orders. RSS moved in
opposite directions depending on run order, so I would treat native RSS as
neutral rather than claiming a memory saving.

Perf counters from the warmed forward run:

instructions:      -10.63%
cycles:            -7.08%
task clock:        -7.38%
branches:          -10.16%
branch misses:     -3.62%
cache references:  -3.12%
cache misses:      -3.52%
L1 loads:          -8.66%
L1 load misses:    -5.02%
dTLB loads:        -3.60%
dTLB load misses:  -4.08%

Allocation profile:

fixture: Liechtenstein Geofabrik extract
same profile/output/thread shape

Heaptrack/Massif over three pairs:

allocation calls:       -3,492,919 (-46.72%)
temporary allocations:    -650,325 (-73.87%)
heaptrack runtime:          -2.49s (-36.23%)
heaptrack peak heap: unchanged at 2 GiB
Massif peak total:       +21.31 MB (+1.09%)
Massif stack peak:       +48 bytes (+0.05%)
leaked bytes: unchanged

Raw allocation counts:

metric	upstream	performance stack	delta
allocation calls	7,475,754	3,982,835	-3,492,919
temporary allocations	880,362.7	230,037.3	-650,325.3

Possible Regressions

The changes are intended to preserve generated tile behavior. They mostly alter
temporary storage ownership, reserve sizes, and fast-path checks.

Potential risks:

The sorted-node cache keeps a few decoded chunks per worker instead of one,
which may retain a small amount of extra per-thread cache memory.
Some geometry buffers now reserve or reuse capacity more deliberately. This
reduces allocation churn, but retained capacity may slightly change heap
shape.
The clipping fast paths depend on existing bounds checks. Incorrect bounds
handling would be a correctness bug, so those changes are kept narrow and
only skip work when the existing geometry is already inside the tested box or
no points need clipping for a given edge.

Testing

Code checks:

git diff --check

Build:

cmake -S . -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo
cmake --build build --parallel 8

CTest:

ctest --test-dir build --output-on-failure

Result:

No tests were found

SortedNodeStore currently keeps only the most recently decoded compressed chunk per worker. Austria profiling showed about 277M node lookups and 90M compressed chunk decodes, with a one-entry cache hit rate around 67%. Keep four recently decoded chunks in fixed per-thread arrays. The cache stays bounded and allocation-free in the lookup path while raising the measured Austria hit rate to about 85%, cutting decoded chunks to about 40M. Validation: test_sorted_node_store passes; Liechtenstein MBTiles match semantically against the PR-stack baseline; Austria timing improved by about 2-3% wall time and about 4% user CPU with flat heap/RSS in heaptrack and Massif checks. Co-authored-by: Codex <noreply@openai.com>

fast_clip constructs a result ring for each clip edge and for each polygon ring. Heaptrack showed this path producing hundreds of thousands of allocations on the Liechtenstein fixture after the sorted-node cache candidate reduced node decode cost. Keep one scratch ring for the Sutherland-Hodgman edge passes and reuse it across a polygon's outer and inner rings. This preserves the public fast_clip wrapper and generated tile semantics while removing repeated vector growth in the clipping path. Liechtenstein output matched semantically against the node-cache baseline. On Austria with --threads 8, wall time was effectively flat to slightly lower, system time decreased, and heaptrack allocation calls on Liechtenstein fell by about 758k. Massif peak total dropped by about 64 MB on that small fixture. Co-authored-by: Codex <noreply@openai.com>

Build scaled polygon rings directly in TileBbox instead of copying through intermediate vector and Ring objects. This keeps the existing scale/backtrack behavior but avoids polygon-by-value iteration and moves freshly scaled rings into the destination geometry. The Liechtenstein fixture is semantically unchanged, while heaptrack on the same fixture reports about 488k fewer allocation calls. Austria timing is neutral to slightly faster with no stable RSS regression. Co-authored-by: Codex <noreply@openai.com>

Reserve the custom simplification heap and output containers before appending points, rings, and polygons. Also read triangle-area points by const reference instead of copying them. The Liechtenstein fixture is semantically unchanged. On the Austria fixture this is timing-neutral to slightly faster, and heaptrack on Liechtenstein reports about 65k fewer allocation calls with no stable heap or RSS regression. Co-authored-by: Codex <noreply@openai.com>

Iterate multiline and multipolygon members by const reference when calculating tile coverage for addGeometryToIndex. The loops only read geometry before calling insertIntermediateTiles, so copying each Linestring or Polygon is unnecessary. The Liechtenstein fixture is semantically unchanged. On the Austria fixture the combined forward and reverse timing is slightly faster, with lower system time and no stable RSS regression. Co-authored-by: Codex <noreply@openai.com>

Avoid constructing a temporary two-point Linestring for each segment clipping check. Both line geometry build paths only need a single segment-vs-box intersection test, so use boost::geometry::model::segment<Point> directly. The Liechtenstein fixture is semantically unchanged. Heaptrack on Liechtenstein reports about 500k fewer allocation calls and about 500k fewer temporary allocations; Austria timing is effectively neutral with no stable RSS regression. Co-authored-by: Codex <noreply@openai.com>

llListPolygon and llListLinestring know the way-node range length before fillPoints appends converted coordinates. Reserving the local output containers avoids repeated vector growth without changing conversion, integrity handling, or geometry correction. Liechtenstein semantic comparison against the submitted-stack baseline reported changed_tiles 0. Austria timing/RSS was effectively neutral after forward and reverse alternating runs; heaptrack on Liechtenstein showed about 205k fewer allocation calls, with Massif peak flat within measurement noise. Co-authored-by: Codex <noreply@openai.com>

SortedWayStore::at decodes the full way-node list before building the returned LatpLon vector, and OsmMemTiles::populateLinestring receives that full vector before appending points. Reserve both local output containers to avoid repeated vector growth without changing decoding, node lookup, conversion, or cache behavior. Liechtenstein semantic comparison against the previous accepted stack reported changed_tiles 0. Austria forward and reverse timing was consistently faster; heaptrack on Liechtenstein showed about 234k fewer allocation calls with temporary allocations and Massif peak effectively flat. Co-authored-by: Codex <noreply@openai.com>

TileDataSource::populateMultiPolygon copies a complete mmap-backed multipolygon into a normal MultiPolygon. Resize the destination polygons and rings before assigning points, matching the explicit copy shape already used by storeMultiPolygon and avoiding repeated generic assignment growth. Liechtenstein semantic comparison against the previous accepted stack reported changed_tiles 0. Austria timing was effectively neutral; heaptrack on Liechtenstein showed about 116k fewer allocation calls with temporary allocations flat and no stable RSS or Massif peak regression. Co-authored-by: Codex <noreply@openai.com>

Closed-way polygon construction reads the cached linestring without mutating it, so avoid copying that linestring in the Lua Layer path. Reserve polygon outer-ring storage where the source size is known and move temporary polygons into their destination containers. Liechtenstein semantic comparison against the previous accepted stack reported changed_tiles 0. Austria forward and reverse timing/RSS were slightly positive, heaptrack on Liechtenstein showed about 256k fewer allocation calls, and perf showed lower instructions, cycles, and cache misses. Co-authored-by: Codex <noreply@openai.com>

Split way geometry builds a temporary linestring by appending source points before moving completed pieces into the output geometry. Reserve the temporary linestring storage from the known source size, and re-reserve the remaining possible size after moving out a completed split. Liechtenstein semantic comparison against the previous accepted stack reported changed_tiles 0. Austria timing/RSS was neutral and order-dependent, while heaptrack on Liechtenstein showed about 118k fewer allocation calls with no stable heap regression. Co-authored-by: Codex <noreply@openai.com>

Low zoom tile collection only reads the materialized low zoom object vectors, but the helper accepted those vectors by value. That copied the full vector-of-vectors for each low zoom tile collection, working against the memory and thrash-reduction goal of the low zoom path added for large extracts. Pass the low zoom object lists by const reference instead. This keeps behavior unchanged while avoiding the unnecessary copy. Liechtenstein semantic comparison against the previous accepted housekeeping stack reported changed_tiles 0. Austria timing/RSS profiling was wall-time neutral and showed a stable ~80-85 MB RSS reduction across both run orders; perf counters showed no meaningful regression. Co-authored-by: Codex <noreply@openai.com>

Area() reprojected every polygon into a fresh DegPoint polygon before asking Boost for spherical area. Reuse the per-processing projected polygon storage and fill it directly instead, preserving the existing lon/latp to lon/lat conversion while avoiding repeated ring allocations. The candidate matched the accepted stack semantically on Liechtenstein. On warmed Austria runs it was runtime-neutral and reduced peak RSS by roughly 35-90 MiB, while Liechtenstein heaptrack showed about 110k fewer allocation calls. Co-authored-by: Codex <noreply@openai.com>

Polygon output scaled each multipolygon into fresh geometry storage before simplify/correct/write. Add destination-taking scale helpers and reuse a thread-local scaled multipolygon buffer in the writer so ring storage can be retained across objects on the same worker. The candidate matched the accepted stack semantically on Liechtenstein. On warmed Austria runs it was runtime-neutral with no stable RSS regression, and Liechtenstein heaptrack showed about 65k fewer allocation calls. The 64 MB Massif swing matched the mmap allocator chunk size and flipped with run order. Co-authored-by: Codex <noreply@openai.com>

Avoid materializing temporary vectors in the hot way geometry path. OsmMemTiles now fills a reusable thread-local way-node buffer, while WayStore implementations expose fill-into-buffer overloads that preserve the existing return-by-value API for other callers. SortedWayStore also reuses a per-thread decoded NodeID buffer when expanding encoded ways. This removes repeated short-lived vector allocations without changing the generated tile semantics checked by the Liechtenstein fixture. Co-authored-by: Codex <noreply@openai.com>

The line geometry path splits ways into bbox-overlapping sections and then always runs the result through Boost intersection against the extended tile box. For sections whose retained points are already inside that extended box, the intersection is an identity operation but still allocates and walks geometry. Track whether any retained point falls outside the extended box while building the split linestring output. Return the split result directly when no clipping is needed, and keep the existing Boost intersection path for sections that still extend outside the box. Liechtenstein semantic output matched the previous stack, and profiling showed about 241k fewer heap allocation calls on the heaptrack fixture. Austria runtime was neutral/noisy, so this should be treated as allocation cleanup rather than a wall-time improvement. Co-authored-by: Codex <noreply@openai.com>

The geometry correction helper returns a single corrected ring in the common path where no self-intersections are found. Returning it through an initializer list copies the ring into the result vector, which creates avoidable point-vector allocations in the Layer() geometry correction path. Build the one-element result vector explicitly and move the corrected ring into it. This keeps the same behavior while avoiding the ring copy. Liechtenstein semantic output matched the previous housekeeping stack. Profiling showed about 41k fewer allocation calls on the heaptrack fixture. Austria runtime was neutral to slightly slower, so this is an allocation cleanup rather than a speed improvement. Co-authored-by: Codex <noreply@openai.com>

The geometry correction helper receives polygons through an rvalue reference and every current caller passes std::move(...). Because the named parameter is an lvalue inside result_combine(), pushing it into the result vector copied the polygon instead of moving it. Forward the parameter into the result vector so rvalue callers keep move semantics. This preserves behavior while avoiding avoidable geometry copies in the correction path. Liechtenstein semantic output matched the previous housekeeping stack. Profiling showed about 24k fewer allocation calls on the heaptrack fixture and small Austria RSS/runtime improvements in both forward and reverse order. Co-authored-by: Codex <noreply@openai.com>

dissolve_find_intersections() currently constructs a fresh vector for each segment pair tested by the rtree callback. Heap profiling showed this path still contributed a large temporary-allocation bucket during tile generation. Keep one output vector for the duration of the function, reserve the common two-point segment intersection capacity, and clear it before each Boost intersection call. This preserves the existing callback behavior while avoiding repeated vector construction and allocation churn. Co-authored-by: Codex <noreply@openai.com>

Freshly populated multipolygons are only used to build the mutable clipping buffer, so move them into that buffer instead of copying them first. Cached clip entries still use the existing copy path because fast_clip mutates its input and cache entries must remain reusable. If an uncached fast-clip result needs the Boost intersection fallback, re-populate the original multipolygon before intersecting so the fallback keeps the existing source-geometry behavior. Semantic comparison against the accepted stack on the Liechtenstein fixture produced no changed tiles. Profiling showed a modest allocation-call reduction; wall time and RSS were effectively neutral, and the 64 MB Massif movement is treated as mmap-backed allocator noise rather than native RSS evidence. Co-authored-by: Codex <noreply@openai.com>

fast_clip clips each ring against all four box edges even when no point is outside a given edge. Checking the existing bit code before running an edge pass avoids copying the ring through scratch output for no-op sides while preserving the existing clipping, validity, and fallback behavior. Liechtenstein semantic comparison against the accepted housekeeping stack produced no changed tiles. Profiling showed fewer allocation calls and favorable CPU counters, with wall time and native RSS effectively neutral on the Austria fixture. Co-authored-by: Codex <noreply@openai.com>

Symmetricity · 2026-05-27T15:27:35Z

I don't have a beefy machine to run this on to confirm the improvements spread to world generation - can someone test this to confirm the gains?

systemed · 2026-05-27T20:45:42Z

This looks great - thank you! I'm away for a few days, but when I'm back I can run this over the planet and do before/after timings.

Symmetricity and others added 21 commits May 25, 2026 21:10

Symmetricity changed the title ~~Reduce allocation churn in tile generation~~ Speed up runtime by reducing allocation churn in tile generation May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up runtime by reducing allocation churn in tile generation#902

Speed up runtime by reducing allocation churn in tile generation#902
Symmetricity wants to merge 21 commits into
systemed:masterfrom
Symmetricity:perf/reduce-allocation-churn

Symmetricity commented May 27, 2026 •

edited

Loading

Uh oh!

Symmetricity commented May 27, 2026

Uh oh!

systemed commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Symmetricity commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Implementation

Node and way lookup hot paths

Geometry construction and scaling

Clipping and intersection paths

Performance

Possible Regressions

Testing

Uh oh!

Symmetricity commented May 27, 2026

Uh oh!

systemed commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Symmetricity commented May 27, 2026 •

edited

Loading