Speed up runtime by reducing allocation churn in tile generation#902
Open
Symmetricity wants to merge 21 commits into
Open
Speed up runtime by reducing allocation churn in tile generation#902Symmetricity wants to merge 21 commits into
Symmetricity wants to merge 21 commits into
Conversation
SortedNodeStore currently keeps only the most recently decoded compressed chunk per worker. Austria profiling showed about 277M node lookups and 90M compressed chunk decodes, with a one-entry cache hit rate around 67%. Keep four recently decoded chunks in fixed per-thread arrays. The cache stays bounded and allocation-free in the lookup path while raising the measured Austria hit rate to about 85%, cutting decoded chunks to about 40M. Validation: test_sorted_node_store passes; Liechtenstein MBTiles match semantically against the PR-stack baseline; Austria timing improved by about 2-3% wall time and about 4% user CPU with flat heap/RSS in heaptrack and Massif checks. Co-authored-by: Codex <noreply@openai.com>
fast_clip constructs a result ring for each clip edge and for each polygon ring. Heaptrack showed this path producing hundreds of thousands of allocations on the Liechtenstein fixture after the sorted-node cache candidate reduced node decode cost. Keep one scratch ring for the Sutherland-Hodgman edge passes and reuse it across a polygon's outer and inner rings. This preserves the public fast_clip wrapper and generated tile semantics while removing repeated vector growth in the clipping path. Liechtenstein output matched semantically against the node-cache baseline. On Austria with --threads 8, wall time was effectively flat to slightly lower, system time decreased, and heaptrack allocation calls on Liechtenstein fell by about 758k. Massif peak total dropped by about 64 MB on that small fixture. Co-authored-by: Codex <noreply@openai.com>
Build scaled polygon rings directly in TileBbox instead of copying through intermediate vector and Ring objects. This keeps the existing scale/backtrack behavior but avoids polygon-by-value iteration and moves freshly scaled rings into the destination geometry. The Liechtenstein fixture is semantically unchanged, while heaptrack on the same fixture reports about 488k fewer allocation calls. Austria timing is neutral to slightly faster with no stable RSS regression. Co-authored-by: Codex <noreply@openai.com>
Reserve the custom simplification heap and output containers before appending points, rings, and polygons. Also read triangle-area points by const reference instead of copying them. The Liechtenstein fixture is semantically unchanged. On the Austria fixture this is timing-neutral to slightly faster, and heaptrack on Liechtenstein reports about 65k fewer allocation calls with no stable heap or RSS regression. Co-authored-by: Codex <noreply@openai.com>
Iterate multiline and multipolygon members by const reference when calculating tile coverage for addGeometryToIndex. The loops only read geometry before calling insertIntermediateTiles, so copying each Linestring or Polygon is unnecessary. The Liechtenstein fixture is semantically unchanged. On the Austria fixture the combined forward and reverse timing is slightly faster, with lower system time and no stable RSS regression. Co-authored-by: Codex <noreply@openai.com>
Avoid constructing a temporary two-point Linestring for each segment clipping check. Both line geometry build paths only need a single segment-vs-box intersection test, so use boost::geometry::model::segment<Point> directly. The Liechtenstein fixture is semantically unchanged. Heaptrack on Liechtenstein reports about 500k fewer allocation calls and about 500k fewer temporary allocations; Austria timing is effectively neutral with no stable RSS regression. Co-authored-by: Codex <noreply@openai.com>
llListPolygon and llListLinestring know the way-node range length before fillPoints appends converted coordinates. Reserving the local output containers avoids repeated vector growth without changing conversion, integrity handling, or geometry correction. Liechtenstein semantic comparison against the submitted-stack baseline reported changed_tiles 0. Austria timing/RSS was effectively neutral after forward and reverse alternating runs; heaptrack on Liechtenstein showed about 205k fewer allocation calls, with Massif peak flat within measurement noise. Co-authored-by: Codex <noreply@openai.com>
SortedWayStore::at decodes the full way-node list before building the returned LatpLon vector, and OsmMemTiles::populateLinestring receives that full vector before appending points. Reserve both local output containers to avoid repeated vector growth without changing decoding, node lookup, conversion, or cache behavior. Liechtenstein semantic comparison against the previous accepted stack reported changed_tiles 0. Austria forward and reverse timing was consistently faster; heaptrack on Liechtenstein showed about 234k fewer allocation calls with temporary allocations and Massif peak effectively flat. Co-authored-by: Codex <noreply@openai.com>
TileDataSource::populateMultiPolygon copies a complete mmap-backed multipolygon into a normal MultiPolygon. Resize the destination polygons and rings before assigning points, matching the explicit copy shape already used by storeMultiPolygon and avoiding repeated generic assignment growth. Liechtenstein semantic comparison against the previous accepted stack reported changed_tiles 0. Austria timing was effectively neutral; heaptrack on Liechtenstein showed about 116k fewer allocation calls with temporary allocations flat and no stable RSS or Massif peak regression. Co-authored-by: Codex <noreply@openai.com>
Closed-way polygon construction reads the cached linestring without mutating it, so avoid copying that linestring in the Lua Layer path. Reserve polygon outer-ring storage where the source size is known and move temporary polygons into their destination containers. Liechtenstein semantic comparison against the previous accepted stack reported changed_tiles 0. Austria forward and reverse timing/RSS were slightly positive, heaptrack on Liechtenstein showed about 256k fewer allocation calls, and perf showed lower instructions, cycles, and cache misses. Co-authored-by: Codex <noreply@openai.com>
Split way geometry builds a temporary linestring by appending source points before moving completed pieces into the output geometry. Reserve the temporary linestring storage from the known source size, and re-reserve the remaining possible size after moving out a completed split. Liechtenstein semantic comparison against the previous accepted stack reported changed_tiles 0. Austria timing/RSS was neutral and order-dependent, while heaptrack on Liechtenstein showed about 118k fewer allocation calls with no stable heap regression. Co-authored-by: Codex <noreply@openai.com>
Low zoom tile collection only reads the materialized low zoom object vectors, but the helper accepted those vectors by value. That copied the full vector-of-vectors for each low zoom tile collection, working against the memory and thrash-reduction goal of the low zoom path added for large extracts. Pass the low zoom object lists by const reference instead. This keeps behavior unchanged while avoiding the unnecessary copy. Liechtenstein semantic comparison against the previous accepted housekeeping stack reported changed_tiles 0. Austria timing/RSS profiling was wall-time neutral and showed a stable ~80-85 MB RSS reduction across both run orders; perf counters showed no meaningful regression. Co-authored-by: Codex <noreply@openai.com>
Area() reprojected every polygon into a fresh DegPoint polygon before asking Boost for spherical area. Reuse the per-processing projected polygon storage and fill it directly instead, preserving the existing lon/latp to lon/lat conversion while avoiding repeated ring allocations. The candidate matched the accepted stack semantically on Liechtenstein. On warmed Austria runs it was runtime-neutral and reduced peak RSS by roughly 35-90 MiB, while Liechtenstein heaptrack showed about 110k fewer allocation calls. Co-authored-by: Codex <noreply@openai.com>
Polygon output scaled each multipolygon into fresh geometry storage before simplify/correct/write. Add destination-taking scale helpers and reuse a thread-local scaled multipolygon buffer in the writer so ring storage can be retained across objects on the same worker. The candidate matched the accepted stack semantically on Liechtenstein. On warmed Austria runs it was runtime-neutral with no stable RSS regression, and Liechtenstein heaptrack showed about 65k fewer allocation calls. The 64 MB Massif swing matched the mmap allocator chunk size and flipped with run order. Co-authored-by: Codex <noreply@openai.com>
Avoid materializing temporary vectors in the hot way geometry path. OsmMemTiles now fills a reusable thread-local way-node buffer, while WayStore implementations expose fill-into-buffer overloads that preserve the existing return-by-value API for other callers. SortedWayStore also reuses a per-thread decoded NodeID buffer when expanding encoded ways. This removes repeated short-lived vector allocations without changing the generated tile semantics checked by the Liechtenstein fixture. Co-authored-by: Codex <noreply@openai.com>
The line geometry path splits ways into bbox-overlapping sections and then always runs the result through Boost intersection against the extended tile box. For sections whose retained points are already inside that extended box, the intersection is an identity operation but still allocates and walks geometry. Track whether any retained point falls outside the extended box while building the split linestring output. Return the split result directly when no clipping is needed, and keep the existing Boost intersection path for sections that still extend outside the box. Liechtenstein semantic output matched the previous stack, and profiling showed about 241k fewer heap allocation calls on the heaptrack fixture. Austria runtime was neutral/noisy, so this should be treated as allocation cleanup rather than a wall-time improvement. Co-authored-by: Codex <noreply@openai.com>
The geometry correction helper returns a single corrected ring in the common path where no self-intersections are found. Returning it through an initializer list copies the ring into the result vector, which creates avoidable point-vector allocations in the Layer() geometry correction path. Build the one-element result vector explicitly and move the corrected ring into it. This keeps the same behavior while avoiding the ring copy. Liechtenstein semantic output matched the previous housekeeping stack. Profiling showed about 41k fewer allocation calls on the heaptrack fixture. Austria runtime was neutral to slightly slower, so this is an allocation cleanup rather than a speed improvement. Co-authored-by: Codex <noreply@openai.com>
The geometry correction helper receives polygons through an rvalue reference and every current caller passes std::move(...). Because the named parameter is an lvalue inside result_combine(), pushing it into the result vector copied the polygon instead of moving it. Forward the parameter into the result vector so rvalue callers keep move semantics. This preserves behavior while avoiding avoidable geometry copies in the correction path. Liechtenstein semantic output matched the previous housekeeping stack. Profiling showed about 24k fewer allocation calls on the heaptrack fixture and small Austria RSS/runtime improvements in both forward and reverse order. Co-authored-by: Codex <noreply@openai.com>
dissolve_find_intersections() currently constructs a fresh vector for each segment pair tested by the rtree callback. Heap profiling showed this path still contributed a large temporary-allocation bucket during tile generation. Keep one output vector for the duration of the function, reserve the common two-point segment intersection capacity, and clear it before each Boost intersection call. This preserves the existing callback behavior while avoiding repeated vector construction and allocation churn. Co-authored-by: Codex <noreply@openai.com>
Freshly populated multipolygons are only used to build the mutable clipping buffer, so move them into that buffer instead of copying them first. Cached clip entries still use the existing copy path because fast_clip mutates its input and cache entries must remain reusable. If an uncached fast-clip result needs the Boost intersection fallback, re-populate the original multipolygon before intersecting so the fallback keeps the existing source-geometry behavior. Semantic comparison against the accepted stack on the Liechtenstein fixture produced no changed tiles. Profiling showed a modest allocation-call reduction; wall time and RSS were effectively neutral, and the 64 MB Massif movement is treated as mmap-backed allocator noise rather than native RSS evidence. Co-authored-by: Codex <noreply@openai.com>
fast_clip clips each ring against all four box edges even when no point is outside a given edge. Checking the existing bit code before running an edge pass avoids copying the ring through scratch output for no-op sides while preserving the existing clipping, validity, and fallback behavior. Liechtenstein semantic comparison against the accepted housekeeping stack produced no changed tiles. Profiling showed fewer allocation calls and favorable CPU counters, with wall time and native RSS effectively neutral on the Austria fixture. Co-authored-by: Codex <noreply@openai.com>
Contributor
Author
|
I don't have a beefy machine to run this on to confirm the improvements spread to world generation - can someone test this to confirm the gains? |
Owner
|
This looks great - thank you! I'm away for a few days, but when I'm back I can run this over the planet and do before/after timings. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR is AI generated.
Summary
In a warmed Austria OpenMapTiles tile generation run, this reduced wall time by
about 6.4% and user CPU by about 9-10%. In heaptrack runs on a smaller fixture,
allocation calls fell by about 47% and temporary allocations by about 74%.
This PR reduces allocation churn and unnecessary geometry work in several hot
paths used while generating tiles.
The changes are grouped around the same goal: keep existing output behavior,
but avoid repeated temporary allocations, avoid avoidable geometry copies, and
skip cheap-to-detect no-op work before it reaches heavier geometry routines.
The largest measured effect is in tile generation with the OpenMapTiles
profile: fewer geometry/vector reallocations, fewer decoded sorted-node chunks,
and less work in clipping/indexing paths.
Implementation
Node and way lookup hot paths
one chunk.
copying it afterward.
These changes target repeated lookups and geometry population during PBF-backed
tile generation.
Geometry construction and scaling
the source is no longer needed.
calculations.
These changes reduce short-lived
std::vectorallocation/reallocation ingeometry preparation paths.
Clipping and intersection paths
box.
when the current function owns the geometry.
edge.
These changes keep the same clipping model, but reduce repeated temporary
storage and avoid unnecessary pass-through clipping work.
Performance
I compared current upstream master with this performance stack applied directly
on top of upstream master, without other unmerged PRs.
Branches/binaries:
Runtime fixture:
Three alternating warmed runs, upstream first:
Three alternating warmed runs, performance stack first:
The wall-time and user-CPU improvement reproduced in both orders. RSS moved in
opposite directions depending on run order, so I would treat native RSS as
neutral rather than claiming a memory saving.
Perf counters from the warmed forward run:
Allocation profile:
Heaptrack/Massif over three pairs:
Raw allocation counts:
Possible Regressions
The changes are intended to preserve generated tile behavior. They mostly alter
temporary storage ownership, reserve sizes, and fast-path checks.
Potential risks:
which may retain a small amount of extra per-thread cache memory.
reduces allocation churn, but retained capacity may slightly change heap
shape.
handling would be a correctness bug, so those changes are kept narrow and
only skip work when the existing geometry is already inside the tested box or
no points need clipping for a given edge.
Testing
Code checks:
Build:
cmake -S . -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo cmake --build build --parallel 8CTest:
Result: