Skip to content

[nanvix] E: Phase 4 — build lxml C extensions as .so (unbundled)#11

Open
esaurez wants to merge 1 commit into
feat/phase3-tier3-external-sharedfrom
feat/phase4-tier4-lxml-shared
Open

[nanvix] E: Phase 4 — build lxml C extensions as .so (unbundled)#11
esaurez wants to merge 1 commit into
feat/phase3-tier3-external-sharedfrom
feat/phase4-tier4-lxml-shared

Conversation

@esaurez

@esaurez esaurez commented Jun 3, 2026

Copy link
Copy Markdown
Owner

Summary

Phase 4 of the .a.so migration (roadmap). Switches the lxml extension chain from being statically bundled into python.elf to being loaded at runtime via dlopen of liblxml_etree.so, with a proper DT_NEEDED chain pulling in libxslt.solibexslt.solibxml2.so transitively.

Architecture

_lxml_etree.cpython-312.so          ~10 KB shim   (dlopen + dlsym PyInit_etree)
  └─> liblxml_etree.so              1.7 MB        (lxml Cython output)
        ├─> libxslt.so              296 KB
        │     └─> libxml2.so        1.5 MB        (libz embedded)
        └─> libexslt.so             92 KB
              ├─> libxslt.so        (shared with above)
              └─> libxml2.so        (shared with above)

The shim modules (Modules/lxml_etree_builtin.c, Modules/lxml_elementpath_builtin.c) are tiny C wrappers that:

  1. dlopen(RTLD_NOW | RTLD_GLOBAL) the underlying .so from /lib/python3.12/lib-dynload/.
  2. dlsym PyInit_etree / PyInit__elementpath.
  3. Forward the call. Leak the handle intentionally (process-lifetime); dlclose only on the failure path.

The Setup.local entries (*shared* block) consume only the shim .c files — no -L/-l flags, no --whole-archive, no MODLIBS piggyback. All resolution happens at dlopen time against python.elf's .dynsym (which already exports the full POSIX/libc/libm runtime via PR #1).

Size impact

Artifact Before (Phase-4 static) After (Phase-4 dlopen)
python.elf 12.26 MB 9.98 MB
_lxml_etree.cpython-312.so 3.78 MB (vendored copy) 10 KB shim
liblxml_etree.so (none) 1.7 MB (shared, on-disk)

~2.3 MB reclaimed from the binary; the lxml code now ships once on disk rather than being duplicated per-process.

What changed

File Change
Modules/lxml_etree_builtin.c Rewrote as dlopen-shim (was --whole-archive-embedded module). Leaks handle on success, dlclose on dlsym failure.
Modules/lxml_elementpath_builtin.c Same pattern for _elementpath.
.nanvix/lxml.py _SETUP_LOCAL_TEMPLATE rewritten to bare _lxml_etree lxml_etree_builtin.c (no link flags). generate_setup_local keeps its signature for source-compat with build.py.
.nanvix/docker.py _generate_setup_local_cmd mirrors lxml.py_nanvix line no longer carries MODLIBS-piggyback flags.
.nanvix/package.py Stages liblxml_etree.so, liblxml_elementpath.so, libxslt.so, libexslt.so, libxml2.so into the release tarball's lib/python3.12/lib-dynload/. Hard-fails with FileNotFoundError if any are missing.
.nanvix/test.py Same staging + hard-fail for the test ramfs.

Validation

End-to-end test on nanvix-dev:

STEP_1:python_started (3, 12, 3)
STEP_2:about to import _lxml_etree (chain: libxslt+libexslt+libxml2)
STEP_3:_lxml_etree imported /lib/python3.12/lib-dynload/_lxml_etree.cpython-312.so
STEP_4:about to import lxml.etree
STEP_5:lxml.etree imported /lib/python3.12/lib-dynload/_lxml_etree.cpython-312.so
STEP_6:parsed XML, root= root child= lxml-ok
LXML_CHAIN_PASS

The 3-deep DT_NEEDED chain plus the diamond at libexslt.so{libxslt, libxml2}liblxml_etree.so{libxslt, libexslt, libxml2} resolves cleanly thanks to the loader fixes in esaurez/nanvix#27 and esaurez/nanvix#28.

Independent regrtest run: 122 / 185 PASS, 0 regression vs the pre-rework baseline. Remaining failures (44 ERROR + 19 TIMEOUT) are all pre-existing Nanvix issues independent of this PR (see nanvix-todo/cpython-libregrtest-compile-time-import.md and the unittest-teardown hang).

Code review

A code-review pass found three issues — all fixed in the commit pushed here:

  1. (Critical) .nanvix/lxml.py's template still carried the old MODLIBS-piggyback -L/-l flags, which would silently break local (non-docker) builds.
  2. (Medium) dlopen handle was leaked on the dlsym-failure path; now dlclose'd.
  3. (Medium) test.py / package.py previously printed WARNING and continued when required .so files were missing, producing silently broken artifacts. Now they raise FileNotFoundError.

Runtime dependencies (must ship together)

Build-time dependencies (in this repo)

Sequenced rollout

  1. Merge [syscall] E: Run dlopen ctors/dtors and DT_RUNPATH nanvix#27 and [syscall] B: Fix diamond DT_NEEDED handling nanvix#28 → cut a nanvix release.
  2. Merge [build] E: Build libxml2.so alongside libxml2.a libxml2#1 → cut release.
  3. Bump [build] E: Build libxslt.so and libexslt.so libxslt#1's pin to the new libxml2 → merge → cut release.
  4. Bump [build] E: Build lxml C extensions as .so lxml#1's pins to the new libxml2 + libxslt → merge → cut release.
  5. Bump this PR's pin to the new lxml release → merge.

Until step 5, this repo's CI can still produce a working python.elf against the existing static-only release tarballs (the shim modules degrade to dlopen-fails-at-import, which is a self-contained failure rather than a build break).

@esaurez esaurez force-pushed the feat/phase3-tier3-external-shared branch from 1ccd71d to 3c60ab2 Compare June 3, 2026 22:34
@esaurez esaurez force-pushed the feat/phase4-tier4-lxml-shared branch from 7c7e39f to 453b1e3 Compare June 3, 2026 22:34
@esaurez esaurez force-pushed the feat/phase3-tier3-external-shared branch from 3c60ab2 to 81de428 Compare June 3, 2026 22:40
@esaurez esaurez force-pushed the feat/phase4-tier4-lxml-shared branch 2 times, most recently from 19e231b to 399a6d6 Compare June 3, 2026 23:20
@esaurez esaurez changed the title [nanvix] E: Phase 4 — build lxml C extensions as .so [nanvix] E: Phase 4 — build lxml C extensions as .so (unbundled) Jun 3, 2026
esaurez pushed a commit to esaurez/lxml that referenced this pull request Jun 4, 2026
Produce position-independent liblxml_etree.so and
liblxml_elementpath.so alongside the existing static archives,
wired as a real DT_NEEDED chain on top of esaurez/libxml2 +
esaurez/libxslt:

  liblxml_etree.so       -> NEEDED libxslt.so, libexslt.so, libxml2.so
  liblxml_elementpath.so -> (pure-Cython, no native deps)

Only the cython-generated lxml.etree.c is embedded in
liblxml_etree.so; libxslt, libxml2, and libz live in their own
.so files and are pulled in transitively by the Nanvix dynamic
loader at dlopen time. This exercises the DT_NEEDED chain support
shipped in esaurez/nanvix#27 in a real-world setting and
eliminates the multi-megabyte per-module duplication that a
self-contained build would cause.

Concretely:

* `-fPIC` is added to the per-source compile commands, so the
  same .o files are usable for both .a and .so.
* Two new SHAREDLIB targets link via `-shared -fPIC -nostdlib
  -Wl,--whole-archive <own>.a -Wl,--no-whole-archive [-lxslt
  -lexslt -lxml2]`, setting DT_SONAME=liblxml_etree.so /
  DT_SONAME=liblxml_elementpath.so.
* `.nanvix/z.py` `output_files` and the Makefile's `package` /
  `verify-package` targets ship both the static and shared
  variants.

Sizes (stripped, DT_NEEDED chain vs the discarded self-contained
prototype):

  liblxml_etree.so       1.7 MB (was 3.5 MB)
  liblxml_elementpath.so 157 KB (was 153 KB; pure-Cython, no deps)

Runtime dependencies:

* esaurez/nanvix#27 — `.init_array` invocation + DT_NEEDED chain
  walking in the user-space loader.
* esaurez/libxml2#1 + esaurez/libxslt#1 — libxml2.so, libxslt.so,
  and libexslt.so must be present in the buildroot. This implies
  a sequenced rollout: merge libxml2#1 -> release -> bump libxslt's
  pin -> merge libxslt#1 -> release -> bump this PR's pins ->
  merge this PR.

End-to-end validation (DT_NEEDED chain resolved by the Nanvix
loader: liblxml_etree.so -> libxslt.so -> libxml2.so) will land
in a follow-up against esaurez/cpython#11. CPython's Phase 4 will
switch from the MODLIBS-piggyback workaround to a clean dlopen
of liblxml_etree.so, letting python.elf shrink by ~3 MB.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Phase 4 of the .a -> .so migration (see
nanvix-todo/cpython-static-to-shared-migration.md section 8).
Promotes the 2 lxml C extension modules (_lxml_etree,
_lxml_elementpath) from statically linked into python.elf to
dlopen-loaded shared objects, and unbundles their underlying C
archives (liblxml_etree, liblxml_elementpath, libxslt, libexslt,
libxml2) into python.elf so each .so stays a thin shim.

Modules/Setup.local changes (.nanvix/docker.py):

- _lxml_etree and _lxml_elementpath move from *static* to *shared*
  with no per-module -L/-l flags (no bundling).
- The lxml C archives are attached to the _nanvix static-module
  line via `-Wl,--whole-archive liblxml_etree.a liblxml_elementpath.a
  libxslt.a libexslt.a libxml2.a -Wl,--no-whole-archive`. CPython's
  Setup processor places these in MODLIBS, so they flow into the
  python.elf link command but not into autoconf conftest links
  (which lack libpython3.12.a and therefore cannot resolve the
  Python C API symbols those archives reference).
- python.elf re-exports the lxml symbols via --export-dynamic, and
  the .so shims resolve PyInit_etree / PyInit_elementpath via the
  main exe's .dynsym at dlopen time.

Both shim files (lxml_etree_builtin.c, lxml_elementpath_builtin.c)
are unchanged. The lxml/etree.py Python-level re-export shim
continues to work because Python's import system finds the .so
files in lib/python3.12/lib-dynload/ at import time.

Size impact (vs Phase 3 baseline, stripped):

- _lxml_etree.cpython-312.so: ~3 MB -> 8.3 KB (-3 MB)
- _lxml_elementpath.cpython-312.so: ~3 MB -> 8.3 KB (-3 MB)
- python.elf: 8.48 MB -> 12.26 MB (+3.78 MB)

Net: small overall growth (the shared lxml code now lives in
python.elf, but each .so is now ~8 KB instead of ~3 MB, matching
the canonical upstream CPython model).

Validation:

- z build PASS (configure conftest unaffected by the MODLIBS
  trick — verified by inspecting config.log)
- z test PASS (lxml smoke + 160/160 regrtest modules)
- lxml import via dlopen confirmed by phase test probe output

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant