_deform_mesh: bump _mesh_version + clear runner nav caches by lmoresi · Pull Request #191 · underworldcode/underworld3

lmoresi · 2026-05-15T22:36:35Z

Summary

Two cache-invalidation gaps in Mesh._deform_mesh, both triggered by direct _deform_mesh(coords) calls (which bypass the mesh.X.coords NDArray callback that normally performs this hygiene — e.g. every free-surface RK stage in the convection benchmark):

_mesh_version not bumped. PR Consolidate and unify cached spatial indexing (KDTree) #182 introduced version-keyed kdtree navigation caches (_BaseMeshVariable._get_kdtree, Mesh._get_domain_kdtree) gated on _mesh_version. The mesh.X.coords callback bumps it; direct _deform_mesh() did not — so navigation kdtrees stayed frozen on the undeformed mesh. PR Invalidate evaluate/DMInterp/topology caches on Mesh._deform_mesh #188 added _topology_version invalidation here but missed _mesh_version.
Runner coord-identity nav caches not cleared. A runner's restore_points_to_domain caches a kdtree keyed on id(mesh.X.coords). _deform_mesh replaces self._coords, but CPython reuses freed ids → a fresh array can collide with the old id() and the staleness check false-negatives. Explicitly clearing _restore_kdt / _restore_coords_id defeats the id()-reuse hazard.

Brings _deform_mesh into line with the cache hygiene mesh.adapt() and _legacy_access already perform.

Test plan

_mesh_version increments on a direct _deform_mesh() call (was frozen at 0 before the fix)
Existing mesh/smoother test suites pass
Parallel smoke (np=2) of a deforming-mesh case

Note: this is an independent correctness fix. It does not by itself resolve the separate free-surface convection feedback regression under investigation (verified — both fixes present in a clean build still reproduce the damped regime).

Underworld development team with AI support from Claude Code

Two cache-invalidation gaps in Mesh._deform_mesh, both exposed by direct _deform_mesh(coords) calls (e.g. every free-surface RK stage in the convection benchmark), which bypass the mesh.X.coords NDArray callback that normally performs this hygiene: 1. _mesh_version was not incremented. PR #182 introduced version-keyed kdtree navigation caches (_BaseMeshVariable._get_kdtree, Mesh._get_domain_kdtree) that gate their rebuild on _mesh_version. The mesh.X.coords callback bumps it; direct _deform_mesh() did not. Result: navigation kdtrees stay frozen on the undeformed mesh, so spatial lookups return pre-deform DOFs after the geometry has moved. PR #188 added _topology_version invalidation here but missed _mesh_version. 2. User-installed coord-identity nav caches were not cleared. A runner's restore_points_to_domain typically caches a kdtree keyed on id(mesh.X.coords). _deform_mesh replaces self._coords with a new object, but CPython reuses freed ids, so a fresh coords array can collide with the old id() and the staleness check false-negatives. Explicitly clearing _restore_kdt / _restore_coords_id defeats the id()-reuse hazard. Verified: _mesh_version now increments on a direct _deform_mesh() call (was frozen at 0 before). Matches the cache hygiene mesh.adapt() and _legacy_access already perform; brings _deform_mesh into line. Independent correctness fix; does not by itself resolve the separate free-surface convection feedback regression under investigation. Underworld development team with AI support from Claude Code

Copilot

Pull request overview

This PR closes two cache invalidation gaps in Mesh._deform_mesh() that occur when _deform_mesh(coords) is called directly (bypassing the mesh.X.coords NDArray callback), ensuring version-gated navigation caches and runner-installed navigation helpers don’t remain stale after mesh deformation.

Changes:

Increment self._mesh_version inside _deform_mesh() so version-keyed KDTree navigation caches rebuild on geometry updates.
Clear runner/user-installed navigation cache attributes (_restore_kdt, _restore_coords_id) to avoid stale reuse due to CPython id() reuse.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+            # Bump the geometry-version counter so version-keyed
+            # kdtree navigation caches rebuild against the new DOF
+            # positions: _BaseMeshVariable._get_kdtree and
+            # Mesh._get_domain_kdtree both gate their rebuild on
+            # `_mesh_version`. PR #182 introduced those version-keyed
+            # caches; the mesh.X.coords callback path bumps
+            # _mesh_version, but direct _deform_mesh() calls (every
+            # free-surface RK stage) bypass that callback. Without
+            # this bump the navigation kdtrees stay frozen on the
+            # undeformed mesh — back-advected SL samples land at the
+            # wrong DOFs, corrupting the temperature field. PR #188
+            # added _topology_version invalidation but missed this.
+            self._mesh_version += 1
+            # Also nuke any *user-installed* navigation caches that


test_lambdify_caching and test_rbf_false_not_slow asserted one-off wall-clock timings (time2 <= time1*2, elapsed < 0.01s). These are inherently flaky on shared CI runners and test_lambdify_caching has been failing CI on unrelated PRs (0.0107s vs 0.0049s runner jitter). Replaced with deterministic, timing-free behaviour tests: - test_lambdify_cache_hit: exercises get_cached_lambdified directly -- identical request returns the SAME function object and adds no entry (true cache hit); a distinct expression is cached separately. This is the cache mechanism's actual contract. - test_rbf_modes_consistent: rbf=True and rbf=False must agree for a pure-sympy expression (meaningful, timing-free; replaces the "rbf=False not slow" wall-clock check). Note: writing the cache-hit test surfaced that the high-level uw.function.evaluate() path does NOT hit the lambdify cache on repeated identical calls -- _expr_hash(srepr) differs every call, so the cache grows by one entry per call and never returns a hit. The old loose timing tolerance was masking this. Not fixed here (evaluate/lambdify is a performance-critical hot path needing separate benchmarking); see PR description. Full module: 19 passed. Underworld development team with AI support from Claude Code

_expr_hash used sympy.srepr(expr), which embeds the volatile global dummy_index of any sympy.Dummy. The evaluate() coordinate-substitution path mints a fresh Dummy per call, so an otherwise-identical expression hashed differently every call and the lambdify cache never matched -- _lambdify_cache grew one entry per call (1,2,3,4,...) and sympy.lambdify recompiled every time on a hot path. Fix: canonicalise Dummy -> name-stable Symbol in _expr_hash before srepr. This changes only the cache *key*; the real sympy.lambdify() call still uses the original expr/symbols, so numerics are unchanged. The cache key separately carries the symbol-name tuple, so name-keying is safe and deterministic. Verified: repeated identical evaluate() now holds cache size flat ([1,1,1,...] vs [1,2,3,...] before). test_0720 module 19 passed; test_0501_integrals 9 passed / 3 pre-existing xfail (unrelated CellWiseIntegral #172/#174). test_evaluate_cache_stable_across_calls added as the #194 regression guard (aggregate cache behaviour + result-consistency, no wall-clock). Underworld development team with AI support from Claude Code

lmoresi · 2026-06-18T10:38:55Z

Diagnosis — bundles a clean cache fix with a version-counter design issue

The red CI here is not flaky — it's a real semantic collision. The 3 failures are all in the snapshot suite (test_0007_snapshot_inmemory, test_0008_snapshot_realsolver):

SnapshotInvalidatedError: _mesh_version moved from 0 to 1 since snapshot.
mesh.adapt() rebuild on restore is scheduled for v1.2; v1 refuses rather than corrupt the DOF arrays

The snapshot system treats any _mesh_version bump as a DOF-invalidating topology change and refuses to restore. This PR's self._mesh_version += 1 in _deform_mesh fires on node movement (ALE deform), which keeps the same topology — so the snapshot should still be restorable (coords are part of the payload). _mesh_version is overloaded across two different consumers:

kdtree navigation caches want a geometry version (bump on node movement)
the snapshot system wants a topology version (bump on adapt/rebuild) — PR Invalidate evaluate/DMInterp/topology caches on Mesh._deform_mesh #188 already added _topology_version

Recommended resolution: give node-movement its own counter (e.g. have the kdtree nav caches gate on a _geometry_version that _deform_mesh bumps, and leave the snapshot-watched _mesh_version/_topology_version for true topology changes). That also cleanly provides the version hook #216's projector-cache fix needs.

Splitting out the clean parts now

This PR also carries two independent, uncontroversial changes that don't touch _mesh_version:

the _expr_hash sympy.Dummy canonicalisation (fixes uw.function.evaluate() never hits the lambdify cache (fresh sympy.Dummy per call → +1 cache entry every call) #194 — evaluate() minted a fresh Dummy per call so the lambdify cache never hit; real correctness/perf bug)
removal of the flaky timing-based test_lambdify_caching

I'm extracting those into a separate small PR so they land for the release. Leaving this PR open for the geometry-vs-topology version split.

Underworld development team with AI support from Claude Code

lmoresi · 2026-06-18T10:39:41Z

Correction — the clean parts already landed independently

Scratch the extraction plan: development already has both clean changes via commit ca5c3ef ("Fix lambdify cache never hitting on evaluate() — closes #194"). The _expr_hash Dummy canonicalisation and the flaky test_lambdify_caching removal are present on current development.

So this PR's only remaining unique content is the _mesh_version += 1 + _restore_kdt/_restore_coords_id clear in _deform_mesh — i.e. exactly the change that collides with the snapshot invalidation semantics (the 3 red tests).

Re-scoping this PR to the version-counter design only. It needs the geometry-vs-topology split before it can land:

a _geometry_version (bumped by _deform_mesh on node movement) that the kdtree nav caches gate on, and
leave the snapshot-watched _mesh_version / _topology_version for true topology changes.

That also gives #216's projector-cache fix a clean staleness hook. I'd suggest a fresh branch off current development for that rather than rebasing this stale one. Leaving open as the tracking PR for the design.

Underworld development team with AI support from Claude Code

Copilot AI review requested due to automatic review settings May 15, 2026 22:36

Copilot started reviewing on behalf of lmoresi May 15, 2026 22:37 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

lmoresi added 2 commits May 25, 2026 12:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_deform_mesh: bump _mesh_version + clear runner nav caches#191

_deform_mesh: bump _mesh_version + clear runner nav caches#191
lmoresi wants to merge 3 commits into
developmentfrom
bugfix/deform-cache-invalidation

lmoresi commented May 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

lmoresi commented Jun 18, 2026

Uh oh!

lmoresi commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lmoresi commented May 15, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

lmoresi commented Jun 18, 2026

Diagnosis — bundles a clean cache fix with a version-counter design issue

Splitting out the clean parts now

Uh oh!

lmoresi commented Jun 18, 2026

Correction — the clean parts already landed independently

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants