Skip to content

[nanvix] E: Phase 1C — build 8 Tier-1 text-codec modules as .so#13

Open
esaurez wants to merge 1 commit into
feat/phase1b-tier1-mathmem-sharedfrom
feat/phase1c-tier1-codecs-shared
Open

[nanvix] E: Phase 1C — build 8 Tier-1 text-codec modules as .so#13
esaurez wants to merge 1 commit into
feat/phase1b-tier1-mathmem-sharedfrom
feat/phase1c-tier1-codecs-shared

Conversation

@esaurez

@esaurez esaurez commented Jun 3, 2026

Copy link
Copy Markdown
Owner

Summary

Phase 1C of the .a.so migration. Promotes the 8 Tier-1 "text codec" stdlib extension modules from statically linked into python.elf to dlopen-loaded shared objects: unicodedata, _multibytecodec, _codecs_cn, _codecs_hk, _codecs_iso2022, _codecs_jp, _codecs_kr, _codecs_tw.

Note: This PR replaces the original #8 which auto-closed when its base branch (feat/phase1b-drop-libm-from-math-so) was deleted as part of folding the drop-libm work into PR #6 (per esaurez review preference for consolidating .so move + lib-resolution changes into single PRs).

Size impact

  • python.elf: 19.18 MB → 17.48 MB (−1.70 MB, biggest single-phase reduction at the time).
  • unicodedata.so is 1193 KB (Unicode database tables).
  • Other 7 modules total ~1.0 MB.

Validation

Full regrtest 160/160 PASS + lxml + HTTP smoke + Phase 1A/1B/1C import probes.

Prerequisites

Stacked on Phase 1B (esaurez/cpython#6) which now includes the libm-unbundling formerly in PR #7. No other prereqs beyond what Phase 1 already requires.

Phase 1C of the .a -> .so migration (see
nanvix-todo/cpython-static-to-shared-migration.md section 5).
Builds on Phase 1B (#6, #7) by promoting the remaining 8 Tier-1
"text codec" stdlib extension modules from statically linked into
python.elf to dlopen-loaded shared objects under
lib/python3.12/lib-dynload/.

Modules moved to *shared* in Modules/Setup.local generation
(.nanvix/docker.py):

- unicodedata: Unicode database lookups (the big one — 1.2 MB of
  unicode data tables).
- _multibytecodec: shared CJK codec infrastructure.
- _codecs_cn / _codecs_hk / _codecs_iso2022 / _codecs_jp /
  _codecs_kr / _codecs_tw: per-region CJK codec tables.

None of the eight reference external libraries; they are pure C
with embedded data tables. They link against the same -lc /
runtime symbols that the rest of the Phase 1 modules use.

Test coverage (.nanvix/test.py):

- New phase1c_snippet imports each module, asserts it is NOT in
  sys.builtin_module_names, exercises one trivial API call to
  confirm dlopen + PyInit_<name> succeeded (unicodedata.lookup,
  _multibytecodec.__create_codec, _codecs_<region>.getcodec), and
  prints the resolved __file__ path. Phase 1A/1B probes retained.

Validation on local toolchain (phase0-llfix):

- All 8 new .so files produced and installed under lib-dynload/
  (unicodedata 1193K, _codecs_jp 262K, _codecs_hk 168K,
  _codecs_cn 155K, _codecs_kr 145K, _multibytecodec 147K,
  _codecs_tw 115K, _codecs_iso2022 76K — total ~2.2 MB across
  the eight files).
- nm python.elf no longer shows PyInit_<name> for any of the 8.
- python.elf size: 19.18 MB (Phase 1B) -> 17.48 MB (Phase 1C),
  -1.70 MB. Biggest single-phase reduction so far because the
  CJK codec tables and the Unicode database are large.
- Hello + Phase 1A + Phase 1B + Phase 1C import probes + lxml +
  HTTP smoke + full regrtest 160/160 PASS in standalone mode.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant