Skip to content

Refactor mesh pimpl#88

Open
Luochenghuang wants to merge 6 commits into
NanoComp:masterfrom
Luochenghuang:refactor-mesh-pimpl
Open

Refactor mesh pimpl#88
Luochenghuang wants to merge 6 commits into
NanoComp:masterfrom
Luochenghuang:refactor-mesh-pimpl

Conversation

@Luochenghuang
Copy link
Copy Markdown
Contributor

@Luochenghuang Luochenghuang commented May 30, 2026

Addresses a latent bug in the mesh PR (#74) wake, as well as an old bug (#82), both surfaced by the new GitHub Actions CI:

1. PIMPL refactor for the mesh struct. PR #74 hand-extended utils/ctlgeom-types.h with C-only fields (mesh_bvh_node, face_normals, bvh, is_closed, etc.) that don't have Scheme analogs in geom.scm. Any build with Guile detected regenerated the file from geom.scm and lost the internals, producing unknown type name 'mesh_bvh_node'. This commit applies PIMPL: the public mesh struct in ctlgeom-types.h now contains only Scheme-visible fields (vertices, face_indices as a vector3_list, and void *internal), while all C-only state moves into a new mesh_internal struct in utils/mesh-internal.h, accessed via a static inline mesh_priv(m) cast helper. ctlgeom-types.h goes back to being a regular auto-generated artifact derived from geom.scm. init_mesh allocates the cache and unpacks face_indices; mesh_copy eagerly rebuilds the BVH on copies (eliminating the previous shallow-pointer-share footgun); mesh_destroy frees the cache via mesh_internal_free. About 105 mechanical m->X to mesh_priv(m)->X rewrites in geom.c, 6 in test-mesh.c.

2. Eliminate slist_unique VLA in intersect_line_with_prism. this is an existing dead-store bug: the prism dedup loop wrote deduped values into a stack VLA and then did slist = slist_unique;, which is a no-op since slist is a pass-by-value pointer parameter. The symptom is that a thin spurious "inside" sliver of width ~1e-3*|s| plus a dropped real crossing on edge-grazing rays. test-prism didn't catch it at OMP=1 because the deterministic rand() sequence didn't statistically hit the 1e-3 window often enough, but parallel runs (OMP=8) shuffle the libc rand() order and trip it routinely. Fix: dedup in place with a write head iv and read head nv. This eliminates the VLA, eliminates the dead store, and the behavior is bit-identical to the dedup's original intent (slist[nv-1] still reads the original sorted value because writes go to slist[iv++] with iv <= nv).

Closes #86, closes #82.

Luochenghuang and others added 3 commits May 29, 2026 23:22
…-driven public struct

PR NanoComp#74 (mesh geometry import) hand-extended utils/ctlgeom-types.h with C-only internal fields (mesh_bvh_node, face_normals, face_areas, bvh, bvh_face_ids, is_closed, centroid, lengthscale, num_bvh_nodes, and a flat int* face_indices) because none of those had natural Scheme representations in geom.scm. That worked when Guile wasn't detected at configure time, but any from-git build with Guile installed (including the new GitHub Actions CI) regenerated ctlgeom-types.h from geom.scm, clobbered the hand-extensions, and failed to compile with 'unknown type name mesh_bvh_node'.

This commit applies the PIMPL (pointer-to-implementation) idiom to clean up that situation properly. The public mesh struct in ctlgeom-types.h now contains only Scheme-visible fields — vertices, face_indices (as a vector3_list, each vector3 packing one triangle's three int indices), and a void* internal pointer. All C-only state moves into mesh_internal in a new private header utils/mesh-internal.h, accessed via a static-inline mesh_priv(m) cast helper. ctlgeom-types.h goes back to being a regular auto-generated artifact derived from geom.scm; no Makefile.am workaround needed.

Lifecycle:
  - make_mesh / make_mesh_with_center: populate public fields, then call init_mesh which calloc()s mesh_internal, unpacks face_indices into a flat int[], computes face normals/areas, checks watertight closure, builds the BVH, and stores the pointer in m->internal.
  - reinit_mesh: fast-path returns when m->internal != NULL; otherwise calls init_mesh. Same single-threaded-init contract as before.
  - mesh_destroy (auto-generated): frees vertices.items + face_indices.items, then calls mesh_internal_free (defined in geom.c) to free the cache.
  - mesh_copy (auto-generated): deep-copies the public fields, sets internal=NULL, then calls mesh_init_internal (defined in geom.c) to eagerly rebuild the BVH on the copy. Eliminates the previous shallow-pointer-share footgun where two mesh copies pointed at the same BVH allocation.

geom.c gets ~105 mechanical m->X → mesh_priv(m)->X rewrites for the 10 internal fields (the cast inlines to nothing at runtime). test-mesh.c gets the same rewrite for 6 sites where tests poke directly at m->is_closed.

Both auto-generated files (ctlgeom-types.h and geom-ctl-io.c) now match what gen-ctl-io would emit from the simplified geom.scm, with one hand-tweak in geom-ctl-io.c: mesh_destroy and mesh_copy call into the externally-defined mesh_internal_free / mesh_init_internal so they can manage the opaque cache. This is the minimum manual surface required for any PIMPL design.

Verified: configure + make + make check all clean. test-prism passes (0/10000 point failures, 0/1000 segment failures, 0/65 prism helper failures). test-mesh passes 0 failures across all 19 cases including test_copy_destroy (which exercises mesh_copy + point queries on the copy + destroy of both original and copy).
The duplicate-removal step previously wrote deduped values into a stack VLA slist_unique[num_vertices+2] and then did 'slist = slist_unique;' before returning the new count. That assignment is a dead store: slist is a pass-by-value pointer parameter, so reassigning it only rebinds the local variable. The caller's buffer is never touched and slist_unique is destroyed when the function returns, so the caller reads the first num_unique_elements entries of the sorted-but-undeduped buffer instead of the actual deduped values.

In intersect_line_segment_with_prism this corrupts the inside/outside parity walk: a near-duplicate that the dedup loop intended to drop stays in the buffer (producing a phantom thin 'inside' sliver of width ~1e-3*|s|), and a real far-end crossing is silently dropped because the caller stops at the smaller returned count. The bug triggers when two s-values fall within 1e-3*|s| of each other — edge-grazing rays, rays through a shared vertex, near-tangent rays. test-prism's random-ray distribution didn't statistically hit the window often enough at OMP=1 to flag it, but parallel runs (e.g. OMP=8) shuffle the libc rand() sequence and hit the window often enough that the strict 1e-6 segment-length check fails.

The fix eliminates the VLA entirely. The dedup is done in place with a write head 'iv' and read head 'nv': because writes go to slist[iv++] with iv <= nv, position nv-1 is either untouched or was previously written with its own value as a no-op, so reading slist[nv-1] still yields the original sorted value and the comparison semantics are bit-identical to the original intent. Net effect: removes the dead store, removes a potentially huge stack allocation, and the dedup now actually applies to the caller's buffer.

Verified with --enable-openmp: test-prism passes at both OMP_NUM_THREADS=1 (0/10000 points failed, 0/1000 segments failed, 0/65 prism helper failures) and OMP_NUM_THREADS=8 across multiple trials. test-mesh remains clean at OMP=8 as well.
Comment thread utils/geom.c Outdated
static void material_type_copy(void **src, void **dest) { *dest = *src; }
#endif
#include "ctlgeom.h"
#include "mesh-internal.h"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this will break something like MPB whose build relies on copying the geom.c file over.

At some point we need to refactor MPB so that it can just link to libctlgeom, at which point we can split geom.c into multiple files, but for now it needs to be one giant file. 😢

The previous PIMPL commit placed the private mesh_bvh_node / mesh_internal definitions and the mesh_priv() accessor in a separate header utils/mesh-internal.h, included by geom.c. That breaks MPB, meep, and the libctl examples/ directory itself: all of those build by 'cp -f utils/geom.c their_tree/geom.c' and compile the copy in their own build context, without the companion header.

Until those downstreams are taught to link against libctlgeom directly (instead of copying geom.c), geom.c must remain self-contained. This commit inlines the private definitions directly at the top of geom.c (right after the existing includes) and removes the #include 'mesh-internal.h' line. The mesh-internal.h file is deleted.

To preserve test-mesh's ability to verify the closure check without exposing mesh_priv outside geom.c, the is_closed field is promoted to the public mesh struct via geom.scm. It is conceptually a derived public property anyway (users may want to query 'did init_mesh detect this mesh as watertight?'), and the auto-generated mesh_copy / mesh_equal handle it as a plain boolean value with no special code. ctlgeom-types.h and geom-ctl-io.c are updated to match. test-mesh.c drops the #include 'mesh-internal.h' line and reverts mesh_priv(m)->is_closed back to m->is_closed.

Build clean. test-prism and test-mesh both pass at OMP_NUM_THREADS=1 and 8: test-mesh 0 failures (including 0 mismatches in test_cube_vs_block and test_cube_segments_vs_block); test-prism 0/10000 point failures, 0/1000 segment failures, 0/65 prism helper failures.
@Luochenghuang
Copy link
Copy Markdown
Contributor Author

Inlined the private mesh internals into geom.c and dropped mesh-internal.h, so the file stays self-contained for cp-based downstreams. Promoted is_closed to the public mesh struct via geom.scm so test-mesh still has closure-check access.

Comment thread utils/geom-ctl-io.c
}

/* Defined in geom.c; eagerly rebuilds the opaque mesh_internal cache. */
extern void mesh_init_internal(mesh *m);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will this prototype appear in the file when it is re-generated?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. It wouldn't appear. gen-ctl-io has no knowledge of these PIMPL hooks. The follow-up commit explicitly removes the regen rule, so the file is now hand maintained.

The PIMPL refactor added two hand-tweaks to utils/geom-ctl-io.c that gen-ctl-io would not emit: extern declarations for mesh_init_internal and mesh_internal_free, plus calls to them inside the auto-generated mesh_copy and mesh_destroy bodies. These hooks complete the opaque mesh_internal lifecycle (eagerly rebuild the cache on mesh_copy so the copy has its own state; free the cache on mesh_destroy).

But geom-ctl-io.c was still classified as a BUILT_SOURCE with an active regen rule. On any build that detects Guile (WITH_GUILE=true, true on the new GitHub Actions CI), the regen rule sed-rewrites geom-ctl-io.c from ctl-io.c and silently overwrites the hand-tweaks. Compilation still succeeds, but the runtime contract breaks: mesh_destroy stops freeing the internal cache (silent leak per mesh) and mesh_copy reverts to a shallow pointer copy (two meshes sharing one mesh_internal allocation, with a dangling-pointer footgun on first destroy). PR NanoComp#74 didn't expose this because legacy Guile detection in configure.ac fails on modern distros so WITH_GUILE was effectively false; the new CI is the first build where Guile is actually detected.

This commit removes geom-ctl-io.c from BUILT_SOURCES, deletes its regen rule, adds it to EXTRA_DIST so it ships in tarballs, and drops it from clean-local so 'make clean' doesn't wipe it. ctlgeom-types.h stays as-is because the PIMPL refactor doesn't hand-edit it (the regenerated content matches the committed content byte-for-byte).

Verified: 'make -n geom-ctl-io.c' reports 'Nothing to be done' (regen rule successfully gone). test-mesh and test-prism both pass cleanly at OMP_NUM_THREADS=1 and 8.
PR NanoComp#74 committed both ctlgeom-types.h and geom-ctl-io.c, but the versions it shipped were generated from a pre-PR-NanoComp#75 geom.scm that still had the prism workspace field. After NanoComp#75 removed workspace from geom.scm and from the C code in geom.c, the committed copies became stale. Without-Guile builds silently kept compiling because the two stale files were consistent with each other (prism_copy/destroy/equal operated on a workspace field that nothing else touched). With-Guile builds papered over it by regenerating ctlgeom-types.h to match the post-NanoComp#75 geom.scm, which silently differed from the committed copy.

The previous commit (treat geom-ctl-io.c as checked-in) exposed the inconsistency by removing the regeneration safety net: with WITH_GUILE=true, ctlgeom-types.h gets regenerated fresh from geom.scm (no workspace, no number_list typedef), while the committed geom-ctl-io.c still tried to access prism->workspace, producing 'has no member named workspace' compile errors in prism_copy / prism_equal / prism_destroy.

This commit aligns both committed files with current geom.scm:
  - ctlgeom-types.h: replaced with the gen-ctl-io output. Drops the unused number_list typedef and the workspace field from prism struct. Same mesh struct as before (vertices, face_indices, is_closed, internal).
  - geom-ctl-io.c: removed the stale workspace handling from prism_copy / prism_equal / prism_destroy. Kept the PIMPL hand-tweaks (extern mesh_init_internal / mesh_internal_free + their calls in mesh_copy / mesh_destroy) since the file is no longer auto-regenerated.

Verified locally in both build modes:
  - WITHOUT Guile (WITH_GUILE=false, no regen): make + make check + OMP=1/8 clean.
  - WITH Guile (WITH_GUILE=true, ctlgeom-types.h regenerates and matches committed): make + make install + examples/ build (which exercises the cp utils/geom.c -> examples/geom.c MPB-style copy path) + make check + OMP=1/8 clean.
@stevengj
Copy link
Copy Markdown
Collaborator

stevengj commented Jun 4, 2026

I would like to keep the files autogenerated (and removed from git) rather than checking them and doing manual tweaking of the code-generation output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

compilation failre with new mesh feature intersect_line_with_prism: slist dedup never writes back

2 participants