_deform_mesh: bump _mesh_version + clear runner nav caches#191
Conversation
Two cache-invalidation gaps in Mesh._deform_mesh, both exposed by direct _deform_mesh(coords) calls (e.g. every free-surface RK stage in the convection benchmark), which bypass the mesh.X.coords NDArray callback that normally performs this hygiene: 1. _mesh_version was not incremented. PR #182 introduced version-keyed kdtree navigation caches (_BaseMeshVariable._get_kdtree, Mesh._get_domain_kdtree) that gate their rebuild on _mesh_version. The mesh.X.coords callback bumps it; direct _deform_mesh() did not. Result: navigation kdtrees stay frozen on the undeformed mesh, so spatial lookups return pre-deform DOFs after the geometry has moved. PR #188 added _topology_version invalidation here but missed _mesh_version. 2. User-installed coord-identity nav caches were not cleared. A runner's restore_points_to_domain typically caches a kdtree keyed on id(mesh.X.coords). _deform_mesh replaces self._coords with a new object, but CPython reuses freed ids, so a fresh coords array can collide with the old id() and the staleness check false-negatives. Explicitly clearing _restore_kdt / _restore_coords_id defeats the id()-reuse hazard. Verified: _mesh_version now increments on a direct _deform_mesh() call (was frozen at 0 before). Matches the cache hygiene mesh.adapt() and _legacy_access already perform; brings _deform_mesh into line. Independent correctness fix; does not by itself resolve the separate free-surface convection feedback regression under investigation. Underworld development team with AI support from Claude Code
There was a problem hiding this comment.
Pull request overview
This PR closes two cache invalidation gaps in Mesh._deform_mesh() that occur when _deform_mesh(coords) is called directly (bypassing the mesh.X.coords NDArray callback), ensuring version-gated navigation caches and runner-installed navigation helpers don’t remain stale after mesh deformation.
Changes:
- Increment
self._mesh_versioninside_deform_mesh()so version-keyed KDTree navigation caches rebuild on geometry updates. - Clear runner/user-installed navigation cache attributes (
_restore_kdt,_restore_coords_id) to avoid stale reuse due to CPythonid()reuse.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Bump the geometry-version counter so version-keyed | ||
| # kdtree navigation caches rebuild against the new DOF | ||
| # positions: _BaseMeshVariable._get_kdtree and | ||
| # Mesh._get_domain_kdtree both gate their rebuild on | ||
| # `_mesh_version`. PR #182 introduced those version-keyed | ||
| # caches; the mesh.X.coords callback path bumps | ||
| # _mesh_version, but direct _deform_mesh() calls (every | ||
| # free-surface RK stage) bypass that callback. Without | ||
| # this bump the navigation kdtrees stay frozen on the | ||
| # undeformed mesh — back-advected SL samples land at the | ||
| # wrong DOFs, corrupting the temperature field. PR #188 | ||
| # added _topology_version invalidation but missed this. | ||
| self._mesh_version += 1 | ||
| # Also nuke any *user-installed* navigation caches that |
test_lambdify_caching and test_rbf_false_not_slow asserted one-off wall-clock timings (time2 <= time1*2, elapsed < 0.01s). These are inherently flaky on shared CI runners and test_lambdify_caching has been failing CI on unrelated PRs (0.0107s vs 0.0049s runner jitter). Replaced with deterministic, timing-free behaviour tests: - test_lambdify_cache_hit: exercises get_cached_lambdified directly -- identical request returns the SAME function object and adds no entry (true cache hit); a distinct expression is cached separately. This is the cache mechanism's actual contract. - test_rbf_modes_consistent: rbf=True and rbf=False must agree for a pure-sympy expression (meaningful, timing-free; replaces the "rbf=False not slow" wall-clock check). Note: writing the cache-hit test surfaced that the high-level uw.function.evaluate() path does NOT hit the lambdify cache on repeated identical calls -- _expr_hash(srepr) differs every call, so the cache grows by one entry per call and never returns a hit. The old loose timing tolerance was masking this. Not fixed here (evaluate/lambdify is a performance-critical hot path needing separate benchmarking); see PR description. Full module: 19 passed. Underworld development team with AI support from Claude Code
_expr_hash used sympy.srepr(expr), which embeds the volatile global dummy_index of any sympy.Dummy. The evaluate() coordinate-substitution path mints a fresh Dummy per call, so an otherwise-identical expression hashed differently every call and the lambdify cache never matched -- _lambdify_cache grew one entry per call (1,2,3,4,...) and sympy.lambdify recompiled every time on a hot path. Fix: canonicalise Dummy -> name-stable Symbol in _expr_hash before srepr. This changes only the cache *key*; the real sympy.lambdify() call still uses the original expr/symbols, so numerics are unchanged. The cache key separately carries the symbol-name tuple, so name-keying is safe and deterministic. Verified: repeated identical evaluate() now holds cache size flat ([1,1,1,...] vs [1,2,3,...] before). test_0720 module 19 passed; test_0501_integrals 9 passed / 3 pre-existing xfail (unrelated CellWiseIntegral #172/#174). test_evaluate_cache_stable_across_calls added as the #194 regression guard (aggregate cache behaviour + result-consistency, no wall-clock). Underworld development team with AI support from Claude Code
Diagnosis — bundles a clean cache fix with a version-counter design issueThe red CI here is not flaky — it's a real semantic collision. The 3 failures are all in the snapshot suite ( The snapshot system treats any
Recommended resolution: give node-movement its own counter (e.g. have the kdtree nav caches gate on a Splitting out the clean parts nowThis PR also carries two independent, uncontroversial changes that don't touch
I'm extracting those into a separate small PR so they land for the release. Leaving this PR open for the geometry-vs-topology version split. Underworld development team with AI support from Claude Code |
Correction — the clean parts already landed independentlyScratch the extraction plan: So this PR's only remaining unique content is the Re-scoping this PR to the version-counter design only. It needs the geometry-vs-topology split before it can land:
That also gives #216's projector-cache fix a clean staleness hook. I'd suggest a fresh branch off current Underworld development team with AI support from Claude Code |
Summary
Two cache-invalidation gaps in
Mesh._deform_mesh, both triggered by direct_deform_mesh(coords)calls (which bypass themesh.X.coordsNDArray callback that normally performs this hygiene — e.g. every free-surface RK stage in the convection benchmark):_mesh_versionnot bumped. PR Consolidate and unify cached spatial indexing (KDTree) #182 introduced version-keyed kdtree navigation caches (_BaseMeshVariable._get_kdtree,Mesh._get_domain_kdtree) gated on_mesh_version. Themesh.X.coordscallback bumps it; direct_deform_mesh()did not — so navigation kdtrees stayed frozen on the undeformed mesh. PR Invalidate evaluate/DMInterp/topology caches on Mesh._deform_mesh #188 added_topology_versioninvalidation here but missed_mesh_version.restore_points_to_domaincaches a kdtree keyed onid(mesh.X.coords)._deform_meshreplacesself._coords, but CPython reuses freed ids → a fresh array can collide with the old id() and the staleness check false-negatives. Explicitly clearing_restore_kdt/_restore_coords_iddefeats the id()-reuse hazard.Brings
_deform_meshinto line with the cache hygienemesh.adapt()and_legacy_accessalready perform.Test plan
_mesh_versionincrements on a direct_deform_mesh()call (was frozen at 0 before the fix)Note: this is an independent correctness fix. It does not by itself resolve the separate free-surface convection feedback regression under investigation (verified — both fixes present in a clean build still reproduce the damped regime).
Underworld development team with AI support from Claude Code