Skip to content

newgen#168

Closed
webern wants to merge 45 commits into
base-newgenfrom
newgen
Closed

newgen#168
webern wants to merge 45 commits into
base-newgenfrom
newgen

Conversation

@webern

@webern webern commented Jun 11, 2026

Copy link
Copy Markdown
Owner

No description provided.

webern and others added 25 commits June 8, 2026 12:05
First stages of the gen pipeline (python3 -m gen):

- gen/xsd: ElementTree parser into a model of the MusicXML XSD
  subset, plus a structural analysis report ("analyze" command).
- gen/ir: lowers the XSD model to a resolved, dependency-ordered
  intermediate representation ("ir" command). Names and hoists
  anonymous types, collapses restriction chains, resolves refs,
  and drops unreferenced types.
- gen/tests: unittest suite asserting the acyclic-graph and
  no-name-collision invariants and IR referential integrity
  across all MusicXML versions in docs (make test-gen).
- gen/README.md: architecture, IR glossary, and XSD analysis.
- AGENTS.md: reconcile the now-stale generator section.
Collapse the IR's preserved named structure on demand instead of
forcing every emitter to re-derive it. Resolver (gen/ir/resolve.py)
expands attribute groups, splices model-group refs into content, and
walks the base chain, exposing attributes/all_attributes/content/
elements. build now computes deps through it, removing the duplicated
group walk.

Also make UnionMember hold a Ref so every type reference in the IR has
one shape, and add `ir --resolve` to dump the collapsed view.
Fold the standard instrument-sound identifiers into the IR as a
sound-id enum unioned with an open string (gen/ir/sounds.py): the
XSD types instrument-sound as xs:string and lists the values only
in the sounds.xml companion, not the schema. Opt-in per target.

Vendor each MusicXML version's XSD and sounds.xml under docs/ with
matching git-commit hash suffixes (3.0 5fd8eb3, 3.1 8bbe8e5,
beside the existing 4.0 ed15c23 and 4.1 0d56097).

Each target pins its schema and sounds policy via [input] xsd and
[sounds] xml: C++ is 4.0 with sounds, C is 3.1 with sounds, Go is
3.1 without -- the C/Go pair differ only by the fold, keeping the
generator honest about extensibility.
Design doc for the layer between the IR and the templates: the
per-target projection that templates consume. The IR stays a pure,
config-free function of the schema; the Galley is where config.toml
meets it, so templates can stay dumb.

Covers: the name and rejected alternatives; one-rich-layer-with-two-
field-groups vs two passes, decided by a JSON Schema emitter contrast;
the materialized, dumpable data shape built on the Resolver; automatic
name-convention expansion with a precise tokenizer and a worked table
(default-x, brass.alphorn, the empty value, midi/id); the wire string
preserved separately; the two-tier rename/override system with scoped
addressing, TOML schema, precedence, and IR validation; post-projection
collision detection as a CI gate; the transformation catalog
(representation, cardinality, primitives, derived types, default/fixed
to variant, structure, files, docs, ordering); and a JSON Schema
walkthrough proving the layer is neutral, not C++-shaped.
Each metadata object handed to a template (one per emitted type) is a
Plate; the collection projected for a target is the Plates. Rename the
design doc, rework its name-and-rationale section around the music
engraving plate metaphor, rename the sketched dataclasses (EnumPlate,
NumberPlate, StringPlate, UnionPlate, ComplexPlate, PlateRef, Plates)
and the planned CLI to 'gen plates', and add the projection stage to
the gen/README.md pipeline. The former layer name is removed entirely.

https://claude.ai/code/session_01XUoGfETVUYbSedoEPx3mAV
Add the projection stage to the generator pipeline description, define
plate (per-type, template-facing metadata object) and the Plates (the
per-target collection), point to docs/ai/design/plates.md from the
architecture section, the status note, and the repository layout tree.

https://claude.ai/code/session_01XUoGfETVUYbSedoEPx3mAV
Implement docs/ai/design/plates.md as gen/plates/: the per-target projection
that sits between the IR and the templates. One plate per emitted type,
internally split into a neutral core (wire names, resolved structure, facets,
docs) and a target binding (casings, sanitized identifiers, type mappings,
strategy tags, file assignment).

- gen/plates/names.py: tokenizer (separators, case transitions, digits ride),
  the convention registry (pascal/camel/snake/kebab/screaming), acronym set,
  and the reserved-word/validity sanitizer.
- gen/plates/model.py: the plate dataclasses; every plate, member, and
  variant carries both its Name bundle and its final sanitized identifier.
- gen/plates/build.py: the projection. Rename validation (fail loud on stale
  keys), member flattening over the Resolver (effective cardinality computed
  from the resolved particle tree), default/fixed literals resolved to enum
  variant identifiers, per-type file assignment with an include graph.
- gen/plates/check.py: post-projection collision detection (type, variant,
  member, and case-insensitive file-stem scopes).
- gen/plates/languages.py: per-language seeds (type maps, reserved words,
  doc styles) that config overrides.
- gen/config.py: parse the new [target]/[naming]/[reserved]/[types]/[layout]/
  [docs]/[rename.*] sections, including the two-tier rename scheme, scoped
  addressing, and the [naming] extends shared base.
- gen/naming.base.toml: first real shared rename -- barline carries both
  elements and attributes named segno/coda, which any field casing collapses;
  the attributes become segno-sound/coda-sound for every target.
- gen/ir/build.py: canonicalize xs:ID/xs:IDREF to token so the IR primitive
  set stays the documented eight.
- CLI: python3 -m gen plates --config C [--type N] [--check]; --check is a CI
  gate like analyze.
- gen/tests/test_plates.py: the design's worked conversion table, override
  tiers and precedence, fail-loud validation, collision detection, and
  spot-checked projections of all three shipped target configs.

Design deltas recorded in plates.md section 11 (MemberRepr dropped, named
convention fields, grouped partition deferred, group renames reserved).
…eview)

Two independent reviews of 49a7c84 (one correctness-focused, one
architectural). The accepted findings, by theme:

Correctness (the critical find):
- Resolver.flat_elements now owns the effective-cardinality view, with a
  co-occurrence-aware duplicate merge: occurrences of one element name in
  different branches of one choice are exclusive (optional), anything else
  can co-occur in a single instance and must be a vector. The old rule
  demoted every duplicate to optional, which mis-projected metronome's
  beat-unit (legally twice in one instance; the corpus exercises this).
  The flattening and the base-chain merge (all_flat_elements, base_chain)
  moved from the projection into gen/ir/resolve.py where schema reasoning
  lives once; never-occurring particles (max=0) are skipped.

Naming and the collision gate:
- Variant.ident is now the FINAL emitted constant. Constant scoping is a
  language fact (gen/plates/languages.py): bare for C++ enum class,
  composed for Go/C flat namespaces (NoteTypeValue1024th,
  MX_NOTE_TYPE_VALUE_1024TH); the projection composes, sanitizes, and the
  collision gate checks the namespace the target actually has (type idents
  and constants mutually, for composed scopes). Templates print idents
  verbatim, restoring 'templates do no naming'.
- gen/names.py is now a leaf vocabulary module (tokenizer, conventions,
  Name, sanitizer) below both config and plates, removing the latent
  config->plates import cycle and the TYPE_CHECKING workaround. Acronym
  sets are case-normalized.

Config tightening:
- Unknown top-level sections rejected (a [renames] typo used to silently
  drop every rename); [input]/[output]/[sounds] keys checked too.
- Removed dead surface: [layout] include-style, [reserved] policy,
  [naming] empty-value-word. The cpp decimal=Decimal mapping moved from
  language seeds to gen/cpp/config.toml ([types]).
- extends hardened: no chained bases, naming/rename sections only, and a
  base/target scope-vs-entry shape disagreement is an error instead of a
  silent wholesale replacement. String-list keys reject bare strings;
  [types] keys must name real IR primitives; empty rename entries and
  non-table rename kinds fail loud.

Plates model:
- all_members is always built for derived plates so the gate covers the
  merged chain under either strategy; ComplexPlate.member(wire, kind)
  joins content occurrences to fields; the content tree's IR particle
  types are declared part of the neutral contract; Plates.type_map dropped
  (PlateRef.ident and target_type are the published spellings);
  UnionPlateMember.name added so primitive members have a field name;
  complete C++ reserved-word list; PlateRef primitive-wire docstring.
- build_for_config in gen/plates is the one pipeline shared by CLI and
  tests.

Documented as review-round deltas in plates.md section 11, including the
always-on gate (a plain plates dump fails loud too; --check is the quiet
CI entry). New tests: a particle-tree cardinality suite (exclusive vs
co-occurring duplicates, wrappers, never-occurring particles), the real
metronome assertion across schemas, composed-scope constants and their
flat-namespace collisions, element-rename application, case-insensitive
stem collisions, derived all_members under inherit, and the config
rejection paths.
The fourth pipeline stage: python3 -m gen <config.toml> now projects the
target's Plates and renders them through a per-language backend.

- gen/emit/__init__.py: backend dispatch. A backend is a module exposing
  render(plates) -> {relative path: content} -- the dumb-renderer contract:
  identifiers, casings, type mappings, file stems, and structure all arrive
  resolved on the plates; a backend owns only its language's grammar and
  support files. Backends register by language name; no backend ships yet,
  so emitting any target fails loud with the known-backend list.
- gen/emit/writer.py: deterministic output. Files are written only when
  their content changed (stable mtimes for build systems); files the
  generator wrote previously but that left the manifest are pruned, but
  only files carrying the generated-code marker are ever deleted --
  unmarked files in the output directory are reported and left alone.
  Backends must mark every file (checked), paths are validated against
  escapes, and the marker satisfies Go's generated-code convention.
- gen/__main__.py: wire the emit command; fold the duplicated
  load-lower-patch pipeline onto gen.plates.build_for_config and one
  _lower helper (a review finding: the sounds fold existed in two places).
- gen/tests/test_emit.py: writer contract (idempotence, marker-gated
  pruning, foreign-file safety, unsafe paths) and dispatch failure modes.
The Go backend (gen/emit/go/) renders the four value shapes -- enum-class,
numeric-wrapper, string-wrapper, tagged-variant -- one template per shape,
plus the static runtime support file. Every generated type exposes the same
surface so the complex-type templates can call them uniformly:

    TryParse<T>(s) (T, bool)   strict membership
    Parse<T>(s) T              lenient (corpus fixup policies)
    (T) String() string        the wire spelling

Leniency policies, in one place (runtime.go) and per-type clamps generated
from the plates' bounds: unknown enum literal -> first variant; unparseable
number -> 0; decimal-looking integers truncate; every number clamps into its
declared range, with primitive-implied lower bounds (positive_integer >= 1,
non_negative_integer >= 0) and exclusive bounds clamping to the nearest
representable value (decimal: +/- 1e-6, matching the corpus duration fixup).
Unions try members strictly in schema order; literal members become
payload-free kinds; an unmatched input is absorbed by the first member's
lenient parse.

Backends own only Go grammar: identifiers (including the composed enum
constants), casings, stems, and structure arrive final on the plates.
Rendered output is piped through gofmt once (the formatter owns tabwriter
alignment, per Go codegen convention; required on PATH, fail loud), and the
emit is byte-idempotent.

gen/test/go/mx/ now holds the 132 generated files (131 value types +
runtime.go), committed per project convention. They compile (go build) and
vet clean. gen/test/go/corert/values_smoke_test.go exercises the generated
types end to end: wire -> typed -> wire across the policies above (run:
go test -run TestValueSmoke ./corert/).
The C backend (gen/emit/c/) renders the four value shapes as header/impl
pairs (the documented one-FileId-to-two-files mapping for C), mirroring the
Go backend's surface in C idiom:

    bool mx_<t>_try_parse(const char *s, MxT *out)   strict membership
    MxT  mx_<t>_parse(const char *s)                 lenient (fixup policies)
    to_string: enums return static storage; numbers and unions return
    malloc'd strings; string types ARE char* values (parse strdups).

Leniency policies match Go exactly (one policy, two spellings): unknown
enum literal -> first variant; unparseable number -> 0 with decimal-looking
integers truncating; clamps from the plates' bounds plus primitive-implied
lower bounds; exclusive decimal bounds clamp +/- 1e-6. mx_format_decimal
prints the shortest no-exponent spelling (8.5, 0.000001). Unions are tagged
structs whose kinds cover ref members and literals; the open instrument-
sound union (sound-id enum | open string) falls through to a strdup'd
string member, with mx_<t>_free owning the string-bearing kinds.

Per-type headers include their dependencies' headers via the plates'
include graph, which flushed out a real bug: union/member deps were
filtered by stem lookup, so the open string PRIMITIVE member of
instrument-sound matched the complex TYPE named 'string' and fabricated an
include of a not-yet-emitted header. _type_deps now excludes refs by
category, with a regression test pinning instrument-sound's includes to
exactly mx_sound_id.

The runtime pair (mx_runtime.h/c) carries the shared parse/format helpers;
its symbol and file prefixes come from the target config (TargetInfo gains
file_prefix). The backend also emits sources.cmake, the explicit build
manifest; gen/test/c/CMakeLists.txt now builds the generated model as the
mx-c static library and links it into corert-c, plus a values-smoke
executable mirroring the Go TestValueSmoke (all checks pass; zero compile
warnings). gen/test/c/mx/ holds the 269 generated files, committed per
project convention.
The Go backend now renders all four complex shapes and the document entry
points; the corert harness drives the generated model instead of the stub.
Every eligible corpus file round-trips: ~777 pass, 52 skip (see gating).

Representation (the Go spelling of the plate facts, chosen for round-trip
fidelity):
- Attributes are presence-tracked pointer fields, required or not: the
  contract is 'write back exactly what was parsed', and corpus files do
  omit required attributes.
- A composite stores children as ONE ordered list (Children []XChild, a
  struct of typed pointers where exactly one is non-nil). Interleaved
  choice content (measure's music-data, note's grace/cue branches,
  metronome's repeated beat-unit) round-trips in document order for free,
  which per-member vectors cannot do. No kind discriminator: harmony has a
  child element literally named 'kind', so a synthetic field would collide.
- Parsing is strict about NAMES (unknown attribute/element -> error: the
  version gate keeps newer documents out, so an unknown name is a generator
  gap, not data) and lenient about VALUES (the typed Parse* policies).
- A derived type embeds its base; Go field promotion gives the flat view
  and one merged parse/serialize pass.
- Document/FromXDoc/ToXDoc are generated from the plates' roots; the
  root's xmlns declarations are preserved through the model (a few corpus
  files declare xmlns:xlink).

Version gating: Go targets MusicXML 3.1; documents whose root declares a
newer version are skipped (reported, not failed) -- MusicXML is backward
compatible, so older documents parse; newer ones may use types the model
cannot represent.

Harness changes (gen/test/go/corert/):
- stub package deleted; the generated mx package is the implementation.
- Normalization strips whitespace-only character data everywhere (MusicXML
  has no mixed content; pretty-printing indentation is not content, and an
  empty <measure> holds only its own indentation). Applied to expected and
  actual alike, so the comparison stays symmetric.
- The loader transcodes UTF-16 (BOM-detected) and ISO-8859-1 documents to
  UTF-8: pugixml and libxml2 auto-detect these; Go's encoding/xml does not.

Corpus adjustments, each encoding a documented fact:
- data/synthetic/extend.3.0.xml and elision.3.0.xml used 3.0-only
  attributes (extend lost its font attributes in 3.1; elision lost its
  text-decoration attributes): MusicXML itself broke backward compatibility
  there, so no current target schema (3.1 or 4.0) can represent those
  attributes. The synthetic files now exercise the attribute set valid
  across 3.0/3.1/4.0.
- data/lysuite/ly75a fixup updated to the uniform clamp policy ('', 'test',
  and '0' all clamp to accordion-middle's minimum 1); the old expectations
  encoded the legacy implementation's inconsistency (unparseable values
  escaped the clamp). The policy is now documented in data/README.md.
…re review)

Two independent reviews of the emit stage and its Go/C value backends. The
accepted findings, by theme:

The clamp policy is now data on the plates (the load-bearing change):
- NumberPlate carries family (decimal|integer) and resolved ClampStep rules
  (facets merged with primitive-implied lower bounds, tightest wins,
  exclusive bounds clamping past by 1 or 1e-6), computed once in
  gen/plates/build.clamp_steps and unit-tested there (tie-breaks, exclusive
  max, implied minimums). Both backends' hand-mirrored _clamp_steps copies
  are deleted; 'one policy, two spellings' is now structural, not a comment.
- The policy hole the duplication hid is closed: a primitive numeric union
  member (positive-integer-or-empty's integer) now carries and applies the
  implied clamp, so unions enforce the same leniency as named number types.
- The int/float clamp mode comes from the IR base, not from string-matching
  the spelled target type (a [types] override no longer flips clamp mode).
- TryParse's contract is pinned: lexically strict, then clamps; generated
  doc comments only claim clamping when clamp steps exist.

Union discriminators are projected, not template-composed:
- UnionPlateMember.tag is a Variant scoped and renameable like an enum
  value; literal variants double as their own tags; the flat-namespace
  collision gate now covers every constant the backends emit. The 'Kind'
  infix is gone from generated constants (FontSizeDecimal,
  MX_INSTRUMENT_SOUND_SOUND_ID).
- An open string union member must be last (it matches anything): both
  backends fail loud instead of emitting unreachable members.

C runtime hardening:
- mx_format_decimal: sized buffer + snprintf return check (no truncated
  digit strings for extreme magnitudes); mx_strdup aborts on OOM instead of
  memcpy through NULL; mx_try_parse_int rejects ERANGE (aligning strictness
  with Go's ParseInt); generated parse entry points are NULL-safe (NULL
  means ""); include guards use _H_INCLUDED so they stay out of the
  constant namespace the gate certifies; dead variant_const helper removed.
- Go formatDecimal canonicalizes negative zero to "0" (parity with the C
  runtime and the corert normalizer).

Robustness around the edges:
- Backends reject schema types that project onto their reserved support
  stems (runtime/document/sources) instead of silently overwriting them.
- The writer overwrites an unreadable file at a manifest path instead of
  crashing; the gofmt scratch dir handles subdirectories.
- The emit CLI reports config errors, missing files, and a missing gofmt
  through the error path rather than a traceback.

Smoke tests extended in both languages: strict-rejection paths, negative
formatting, negative-zero canonicalization, and the union implied-min
clamp. Both targets regenerate; the Go corert suite stays green end to end;
documented as the second review round in plates.md section 11.
The C backend now renders all four complex shapes and the document entry
points; the corert harness drives the generated model instead of the stub.
Both secondary targets now round-trip the corpus: 776 pass, 0 fail, 52
version-skipped, in C and Go alike.

The C spelling of the same representation Go uses: presence-tracked
attributes (bool has_x + value), children as ONE ordered array of structs
whose typed pointers discriminate by non-NULL, strict about names, lenient
about values. C has no inheritance, so derived types flatten the plates'
all_members view into self-contained structs. gen/emit/c/api.py is the
value calling convention -- the single place that knows how generated C
parses, prints, stores, and frees each plate kind (enum to_string static,
number/union malloc'd, string values own themselves) -- consumed at every
attribute, text body, and leaf child instead of inline ownership reasoning.

Parse errors flow through a runtime message channel (mx_error_set/mx_error)
so parse functions return NULL with context instead of threading buffers.
Serialization returns the created node (parent NULL -> free node) so the
document root and nested elements share one code path; root namespace
declarations are preserved through MxDocument (libxml2 keeps them in nsDef,
so attribute loops never see them).

Harness changes (gen/test/c/):
- stub.h/stub.c deleted; roundtrip.c drives mx_document_* and gates
  documents declaring MusicXML > 3.1 (counted as skipped).
- normalize.c strips whitespace-only text nodes everywhere (mirroring Go)
  and sorts attributes by QUALIFIED name.
- compare.c compares each element's DIRECT text only (xmlNodeGetContent's
  subtree concatenation re-compared every leaf at every ancestor, so one
  numerically-equivalent reformat failed all its ancestors) and compares
  attributes by qualified name with entity-resolved values (a parsed
  xlink:href is (ns, href); a serialized one is the literal name).
- mx-c carries the libxml2 include path; corert-c links the model.

Corpus: lysuite/ly33d_Spanners_OctaveShifts.xml is marked .invalid -- it
begins with stray bytes before the XML declaration, so it is not
well-formed XML; strict parsers (libxml2) are entitled to reject it (Go's
etree merely happens to tolerate leading garbage).
…record

The architecture review of the complex-type milestones accepted findings:

The decision record (the review's top insistence): plates.md section 11
gains the round-3 entry -- the ordered-children representation and why the
original per-member-field sketch cannot round-trip MusicXML (interleaved
music-data, metronome's repeated beat-unit), the no-discriminator rationale
(harmony's <kind>), presence-tracked required attributes, the
strict-names/lenient-structure/lenient-values contract (the generated
packages are order-faithful typed DOMs, not validating bindings; content
and cardinality stay on the plates for the C++ backend and the JSON Schema
forcing function), and an explicit instruction that C++ should use a real
sum type rather than copying this encoding. Sections 8.1/8.7 now point at
it instead of contradicting it. AGENTS.md's normalization-pipeline section
documents the whitespace stripping, qualified-name attribute handling,
direct-text comparison, and encoding transcoding the C++ harness will need.

Single-sourced facts:
- Plates.schema_version (parsed from the source stem) is emitted into each
  runtime (SupportedMusicXMLVersion, MX_SUPPORTED_MUSICXML_VERSION) and the
  corert harnesses read it: retargeting a schema cannot leave a stale
  version gate. The hand-kept 3.1 constants are gone.
- The duplicated shape queries moved beside the data: attribute_members /
  element_members / value_member, ComplexPlate.members_view() (the
  strategy-resolved member list), and Plates.children_owner() (base-chain
  walking was schema reasoning inside a template) live in gen/plates/model;
  both backends consume them, and a third backend will too.

Identifier guards: the few names the backends still compose (per-type Child
structs, Children/has_/children_count fields, the document support types)
are now guarded at render time -- a schema name landing on one fails loud
with a rename suggestion instead of surfacing as a compile error in
generated code. Serializing a child with zero or multiple fields set is
documented as undefined on the Child types.

Also: status docs refreshed (both corert suites green; generated models
committed; C++ backend the remaining gap). Verified after regeneration:
83 unit tests, gofmt-clean Go build/vet, zero-warning C build, both smoke
binaries, and both corert suites green (776/0/52). Valgrind over the entire
C corert run: 52.9M allocations, zero leaks, zero errors.
…delity

The complex-type code review's accepted findings, and the real defect the
first of them exposed:

Harness soundness (the review's top three):
- The C comparison now checks namespace declarations: libxml2 keeps
  xmlns/xmlns:foo in nsDef, never in the attribute list, so the model's
  namespace preservation was previously unverifiable. Turning the check on
  immediately caught a real round-trip defect: serialization built the tree
  DETACHED (the document was attached last), so when libxml2 resolved the
  reserved xml: prefix for xml:lang/xml:space with no document context it
  fabricated an xmlns:xml declaration on the carrying element -- 35 corpus
  files diverged from their inputs invisibly. The document template now
  serializes under a scratch parent attached to the document, where the
  implicit xml namespace resolves without inventing declarations.
- The Go harness sorts and compares attributes by QUALIFIED name
  (FullKey): a defect dropping a prefix (xlink:href -> href) can no longer
  pass on the local name. The C side's qualified-name buffers abort on
  truncation rather than letting two truncated names compare equal.
- Numeric-equivalence scope and the version-pinning rewrite were flagged;
  both are the documented corert design (AGENTS.md), shared with the C++
  reference harness, and are deliberately unchanged.

Template fixes:
- String-plate children are stored unboxed: CValue carries an explicit
  is_pointer_value flag (raw string primitives AND char* typedefs like
  MxMode) instead of sniffing the spelled type, removing a needless
  allocation per string-valued child.
- The Go backend fails loud on the one derivation shape its inherit
  template cannot render (element members spread across the base chain --
  no MusicXML schema has one; the C flatten path is immune).
- Serialize paths abort on libxml2 allocation failure, matching the
  runtime's OOM policy; the redundant Go string-primitive special cases
  collapsed into the shared parse expression (byte-identical output).

Verified end to end after regeneration: 83 unit tests, go vet + Go corert
green, zero-warning C build, both smoke binaries, C corert 776/0/52, and
valgrind across the full C corert run: 0 leaks, 0 errors.
Comment thread AGENTS.md Outdated
sibling marker.
2. For each file:
a. Load the XML into a DOM.
b. Set the root `version` attribute to `"3.0"`.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems wrong. Seems like this should be 3.0, 3.1 or 4.0

Comment thread AGENTS.md Outdated

1. Set XML declaration: `<?xml version="1.0" encoding="UTF-8" standalone="no"?>`.
2. Set DOCTYPE based on root element name (`score-timewise` vs `score-partwise`).
3. Set root `version` attribute to `"3.0"`.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, there are multiple versions in play in the repo and we will move this to 4.0

Comment thread docs/sounds-4.1-0d56097.xml Outdated
<!--
MusicXML sounds file

Version 4.1 Draft

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MusicXML 4.1 is not released and I believe the sounds version tracks 1:1 with the MusicXML version. Thus I believe this should be deleted.

Comment thread data/lysuite/ly33d_Spanners_OctaveShifts.xml.invalid Outdated
Comment thread data/lysuite/ly75a_AccordionRegistrations.fixup.xml
Comment thread data/synthetic/elision.3.0.xml
Comment thread data/synthetic/extend.3.0.xml
Comment thread gen/cpp/config.toml

[target]
language = "cpp"
namespace = "mx::core"

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a "hardcoded" property, e.g. target.namespace is part of the prescribed structure for the toml file? If so, it seems a bit language-specific. Or, is it dynamic in the sense that I could say target.foo="bar"?

Comment thread gen/emit/go/__init__.py Outdated
@@ -0,0 +1,115 @@
"""Go backend: render the Plates into the Go test target package.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no. This is very bad and not what I wanted at all.

It should not be required to write a different python program for Go than for C. The whole point of the exercise was to make sure we did not have to write bespoke logic/code for specific targets. What is this doing and why is it needed?

We need to analyze what's being done in these bespoke Go and C backends and re-design so that these backends do not exist.

Comment thread gen/plates/languages.py Outdated
# one flat namespace and carry the type's name (Go package-level constants,
# C's single global namespace). This is a language fact, not configuration;
# the composition itself happens in the projection so Variant.ident is final.
VARIANT_SCOPES: dict[str, str] = {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I do not want the generator to know about languages or have a list of languages that it supports.

claude added 4 commits June 11, 2026 05:43
… redesign

New cardinal rule (from the PR #168 review): the generator is language
agnostic; adding a new language target must not require edits to the
generator's Python files. The current implementation violates it -- roughly
2,500 lines of Python ARE Go and C (gen/emit/go/, gen/emit/c/, the language
tables in gen/plates/languages.py, the BACKENDS registry, and prescribed
config keys like [target] namespace that exist only because specific
languages want them).

The design doc specifies the redesign:

- A target becomes a PACK: config.toml + a templates/ directory; the
  generator cannot tell which language it is emitting (no language registry,
  no language name anywhere in Python).
- Config splits into the prescribed projection contract (every key definable
  without naming a language: conventions, renames, symbol-prefix,
  variant-scope, [types], [reserved]) and a freeform [vars] table passed
  verbatim to templates -- answering the review question about
  target.foo="bar" directly. languages.py's tables become required config.
- The PRESS (gen/press/): a deliberately minimal, stdlib-only, mustache-class
  template engine -- variables, sections, inverted sections, recursive
  partials, loop metadata, one quoted-literal modifier, and constitutionally
  nothing else (no expressions, filters, or string functions). Its poverty is
  load-bearing: if a template cannot express something, the plates must carry
  it. Dispatch is by manifest (one template file per shape) and by mechanical
  discriminant expansion in the context builder, never by in-template logic.
- A render MANIFEST in config declares template -> output-pattern rows,
  absorbing file layout generically: C's header/impl pairs are two rows,
  partitioning is implicit, [layout] dies, support files are templates, and
  the gofmt pass becomes a generic [render] format hook.
- Small neutral plate additions so templates need no lookups: PlateRef gains
  the referenced type's name bundle and kind; plates gain deps for
  include/import composition; doc text arrives pre-wrapped.
- Migration in six phases, each green, with a hard byte-parity gate
  (regenerate, git diff --exit-code over committed output) before each
  Python backend is deleted -- ending with the JSON Schema target added as a
  pure pack plus a CI assertion that the change touched no Python.
- Rejected alternatives recorded: per-target Python plugins (satisfy the
  letter, not the spirit), Jinja2 (expressiveness lets backends reconstitute
  inside templates), AST emitters, and keeping languages.py 'as data'.

AGENTS.md now states the cardinal rule up front and flags the current code
as violating it, pointing here.
…nguage)

Answers the review question 'why hand-roll instead of using a library?'
properly -- the original draft argued only against Jinja2 and never weighed
the real alternative, an existing Mustache implementation.

The load-bearing commitment is now stated as Mustache-the-LANGUAGE: the
press implements the published spec's interpolation/sections/inverted/
partials core with exactly three documented deviations (missing keys are
render errors with template:line -- the spec's silent empty output is
disqualifying for a generator; no HTML escaping; no lambdas), and is tested
against the official Mustache spec suite, buying the spec authors' edge-case
coverage (especially the whitespace rules) without their code.

The engine extensions the draft had invented (@first/@last loop metadata,
the :q quote modifier) move out of the engine into the context builder as
injected fields (is_first/is_last/index0, wire_q companions), so template
syntax stays pure Mustache and the engine is swappable behind it.

Section 9 now weighs chevron/pystache as the close call it is (spec-mandated
silent missing keys, HTML escaping, weak diagnostics, unmaintained state, vs
the repo's no-Python-deps precedent) and pre-commits the reversal trigger:
if the press exceeds ~600 lines or cannot pass the spec suite in phase 1,
vendor chevron and patch strictness/escaping/diagnostics -- zero template
changes either way.
…lates'

A target is a directory containing config.toml and templates/ -- the term
'target' already exists throughout the codebase and needs no companion
word. The template collection is simply the target's templates.
Phase 1 of the generator-agnosticism redesign (commit series: engine ->
data motions -> context/manifest -> C port -> Go port -> proof target).

gen/press/engine.py implements the Mustache core -- interpolation (incl.
dotted names and implicit iterators), sections, inverted sections, partials
with call-site indentation, comments, set-delimiters, and the spec's
standalone-line whitespace rules -- with the three documented deviations
for code generation: missing keys are render errors carrying template:line
(present-but-None renders empty and is falsey; only absence errors), no
HTML escaping ({{{x}}}/{{&x}} are accepted synonyms), and no lambdas (a
callable in the context is an error).

Conformance is tested, not asserted: the five core modules of the official
Mustache spec test suite (mustache/spec, MIT) are vendored under
gen/tests/mustache_spec/ and all 122 cases pass with zero skips -- the
engine's strict/escape constructor knobs let the suite run under the
spec's own semantics while the production pipeline uses the defaults.
Deviation and robustness tests (error locations, recursion depth limit,
callable partial loaders, context-stack fallthrough) cover the rest.

Also sweeps the last 'pack' stragglers out of the design doc.
claude added 2 commits June 11, 2026 06:14
…es need

Phase 2 of the generator-agnosticism redesign: pure data motion plus the
neutral plate additions templates require, with the legacy backends still
in place and both corert suites green.

The generator loses its per-language tables: gen/plates/languages.py is
DELETED. Each target's config.toml now carries the whole projection input
as data -- the full [types] primitive->spelling map, the full [reserved]
words list, and [target] variant-scope (bare|composed). The projection
takes no defaults from anywhere but config; a target omitting [types]
gets primitive passthrough, which is what a neutral target wants.

New config surface per the design:
- [vars]: freeform string key-values passed verbatim to templates and
  never interpreted by the generator -- where anything that cannot be
  defined without naming a language belongs.
- [reserved] members / type-suffixes: names a target's TEMPLATES
  synthesize (Go's Children field; Child/Kind type compositions), now
  enforced by the collision gate (gen/plates/check.py) instead of by
  per-backend Python guards.
- [docs] wrap is the wrapped doc TEXT width, excluding comment syntax
  (default 97; a 3-character prefix lands at the 100-column house style);
  [docs] style and the DocStyle machinery die -- comment syntax is
  template content.

Plate additions so templates never compute or look anything up:
- PlateRef carries the referenced type's name bundle and kind
  (enum/number/string/union/complex, or family-qualified
  primitive-decimal/-integer/-string), denormalized at projection.
- Every plate carries deps (the unique non-primitive references, sorted),
  the data include/import lines are composed from.
- Every plate carries doc_lines (greedy-wrapped at [docs] wrap); the
  backends now consume them, proving byte-equivalence of the wrapping.

The only generated-output change is deliberate and visible: string-plate
pattern notes are their own comment line instead of being re-wrapped into
the doc prose (the old flow even split a regex across lines). 18 files;
everything else regenerates byte-identical. go vet + Go corert green,
zero-warning C build, values-smoke, C corert 776/0/52.
…it path

Phase 3 of the generator-agnosticism redesign. The press can now render a
whole target from config + templates, with the legacy backends untouched
and still serving the unported targets (the [render] section's presence
selects the pipeline; the transitional dispatch dies with the last port).

gen/press/context.py -- plates to plain dicts, with three mechanical
enrichments and zero decisions: discriminant expansion (every closed
enumerated field gets ALL its boolean companions, so strict mode never
trips on a legitimate branch), quoted companions (<field>_q, the JSON
escape repertoire valid verbatim across the C-family languages), and loop
metadata (is_first/is_last/index0; bare-string list items are lifted to
{value, value_q}). Plus the pre-split member views (attributes/elements/
value, own and merged), flattened Name casings ({{name.snake}}), a
self-reference for inner scopes, and the generated-file banner text.

gen/press/render.py -- manifest expansion and rendering: [[render.type]]
rows render every plate whose strategy matches into an output pattern
composed from the plate's casings ({snake}.go, mx_{snake}.h -- C's
header/impl pairs are just two rows); [[render.once]] rows render against
the whole target with the complete output list in context (outputs,
outputs_by_ext for build manifests). Fail-loud checks: unknown strategies,
uncovered strategies, case-insensitive output collisions, unknown
placeholders, missing generated-file markers. The optional [render] format
command (gofmt and friends) runs over a scratch directory before the
writer's idempotence diff -- target data, generically executed.

The writer moved to gen/press/writer.py (a shim keeps the legacy backends
importing); config grows the [render] section; the CLI gains
"python3 -m gen render --config C --type N" for template debugging.
14 new tests cover the context enrichments end to end against real
template files, every manifest failure mode, and the format hook.
claude and others added 14 commits June 11, 2026 06:39
Phase 4 of the generator-agnosticism redesign: the C backend no longer
exists. Everything C-shaped lives in gen/test/c/templates/ (fourteen
Mustache templates: one per value shape as header/impl pairs, one complex
pair covering all five complex strategies, the runtime and document
support files, and sources.cmake composed from the manifest's own output
list) and in the [render] manifest in the target's config.toml. The
mx_/MX_ spellings, ownership idioms, include lines, and libxml2 grammar
are template text; identifiers, casings, clamp steps, union cases, member
views, and dependency lists all arrive as plate data.

The parity gate: 674 of 677 files regenerate byte-identical through the
press. The three deviations are the OLD output's warts, now fixed --
mx_encoding.h/mx_key.h carried a double space from the legacy backend's
child-field spacing bug ('MxYyyyMmDd  encoding_date'), and
mx_score_instrument.c spelled a pointer as '&(*ch->x)'. Verified beyond
bytes: zero-warning build, values-smoke, C corert 776/0/52, and valgrind
across the full suite (0 leaks, 0 errors).

Supporting changes, all neutral:
- The flattened union  view in the context builder (per-case
  tag_ident with loop metadata at the granularity the kind enum actually
  has) and UnionPlate.open_ended; the open-string-member ordering guard
  moved from backend Python into the plates' collision gate, where union
  parse semantics (not language) put it.
- Discriminant expansion uses earlier-field-wins so PlateRef's category
  and kind vocabularies (which share 'value'/'complex') stay consistent.
- Dotted resolution through a present-but-None value is falsey rather
  than a strict-mode error, matching the engine's None deviation.
- has_<list> companions and the 'family' vocabulary in the context.

The legacy BACKENDS registry now knows only 'go'; it dies with the Go
port next.
Phase 5 of the generator-agnosticism redesign: the last per-language
backend is gone. The cardinal rule now holds structurally and is enforced
by a test.

The Go target is eight Mustache templates (gen/test/go/templates/) plus
its [render] manifest: one per value shape, one complex template covering
value-class/composite-class/flag/attrs-class, a separate inherit template
(base embedding with field promotion is a Go idiom, so it is Go template
text), the document entry points, and the runtime. gofmt runs as the
manifest's generic format hook, exactly as before. The port's parity gate
came out exact: all 336 files regenerate byte-identical through the press
(the one template-shaped wrinkle: Go composite literals after an
interpolation form three braces, which Mustache reads as a triple-stache;
the templates write a space that gofmt then removes).

DELETED from the generator, never to return:
- gen/emit/ entirely (the Go backend, the BACKENDS registry, the writer
  shim -- the writer lives in gen/press/writer.py).
- [target] language: nothing selects on it; the generator cannot tell
  which language it is emitting.
- [target] namespace and prefix as prescribed keys: namespace was only
  ever language-flavored (the Go package name is the Go templates' own
  text; cpp's moved to [vars]); prefix survives as symbol-prefix, the
  projection-contract key the collision gate depends on.
- [layout] entirely, with plate.file, FileSpec, Plates.files, the
  file-stem collision check, and [naming] file-convention: output paths
  are the manifest's output patterns, and the press's expansion check
  covers their collisions.

gen/tests/test_agnosticism.py pins the rule: the generator's Python is a
closed set of packages (xsd, ir, names, config, plates, press, tests),
no module may be named after a language, and a target's templates/
directory may contain no Python. Both corert suites green; both targets
regenerate byte-idempotent; plates --check works for the templateless
C++ target.
Phase 6 of the generator-agnosticism redesign: the neutral target that has
been the plates design's forcing function since day one (plates.md section
9) now exists, and adding it touched ZERO generator Python. The target is
gen/schema/config.toml plus one template: a JSON Schema (draft 2020-12)
rendering of the MusicXML 4.0 spec (with the sounds fold), 373 $defs.

It consumes only the neutral core, exactly as the design promised: $defs
keys and properties are wire names (kebab forms, never casings); enums are
wire literals (space-separated and empty values verbatim); number facets
become minimum/maximum/exclusiveMinimum; patterns pass through; unions are
anyOf with the open-enum (instrument-sound = sound-id ref | open string)
falling out with no special case; docs become descriptions. No [types]
map, no [reserved] words, no symbol prefix -- the target binding is inert.
The generated-file marker rides in the schema's own "$comment" (JSON has
no comments; the writer's prune gate is satisfied without one).

Representation note, recorded in the config: complex types are modeled
over the merged flat member view; choice exclusivity and sequence nesting
are not encoded. The resolved content tree is on the plates whenever a
template revision wants oneOf nesting -- that will be template work, which
is the point.

make gen now runs the renderable targets (go/c/schema); gen-cpp stays
defined for when the C++ templates exist. gen/tests/test_schema.py renders
the target through the ordinary pipeline and pins the neutral facts.
Sweep the docs to match the implemented generator-agnosticism design:

- AGENTS.md: state that the cardinal rule now HOLDS and is enforced
  structurally by gen/tests/test_agnosticism.py; update the repo layout
  (plates/, press/, schema/, per-target templates/); rewrite the
  generator-architecture paragraph around the press render pipeline; add
  the render command; refresh the status section.
- gen/README.md: rewrite pipeline step 4 around gen/press, update the
  layout block (emit/ is gone), document the render debugging command.
- generator-agnosticism.md: status -> implemented; append section 11
  with implementation notes (parity outcomes, the schema proof target,
  context-builder additions, deleted transitional config keys).
- plates.md: supersession note pointing emit-stage concepts at
  generator-agnosticism.md.

https://claude.ai/code/session_01XUoGfETVUYbSedoEPx3mAV
The file began with stray bytes ("Octave.cc") before the XML declaration,
so it carried an .invalid sibling marker and every harness skipped it.
Delete the errant prefix and the marker; the file is otherwise well-formed
MusicXML and now round-trips in both the Go and C corert suites
(777 passed, 0 failed, 52 version-gated skips).

https://claude.ai/code/session_01XUoGfETVUYbSedoEPx3mAV
…butes

Answer the review question on the record: the attributes removed from
elision.3.0.xml (underline, overline, line-through, rotation,
letter-spacing, xml:lang, dir) and extend.3.0.xml (the font group) are
valid in MusicXML 3.0 ONLY. 3.1 retyped elision from text-font-color to
the new elision type (font + color + smufl) and narrowed extend from
print-style to position + color; 4.0 kept the narrowed definitions.
MusicXML broke its own backward compatibility here, so no 3.1+ model can
represent the attributes and a 4.0 copy keeping them would be invalid --
there is nothing to copy forward. Record the finding in data/README.md
beside the corpus conventions.

https://claude.ai/code/session_01XUoGfETVUYbSedoEPx3mAV
No target's config.toml references it and MusicXML 4.1 is unreleased; the
4.1 XSD stays vendored for schema diffing, but an unreferenced sounds
companion is dead weight.

https://claude.ai/code/session_01XUoGfETVUYbSedoEPx3mAV
The 3.0 root-version pin is a constant in each harness's normalize module
(musicXMLVersion in Go, MUSICXML_VERSION in C), not a property of the
corpus or the architecture. Describe it that way once and have the flow
and normalization steps refer to "the harness baseline" instead of
repeating the literal; likewise the opening line now says the generator
reads whichever XSD a target's config pins rather than naming 4.0.
Also refresh the suite count (777 after the ly33d fix).

https://claude.ai/code/session_01XUoGfETVUYbSedoEPx3mAV
The XSD constrains several string types with pattern facets (color,
comma-separated-text, ending-number, time-only, yyyy-mm-dd, the SMuFL
glyph names) and one with minLength (measure-text); until now every
target documented them as "not enforced".

Plates (language-neutral): StringPlate keeps the raw facets and gains
`pattern`, the facets re-spelled as ONE anchored regex in a portable
dialect -- XSD's implicit whole-value anchoring made explicit, same-step
facets OR-joined (XSD semantics), \i/\c name-class escapes expanded to
explicit ASCII classes, ^/$ XSD-literals escaped, and anything without a
portable spelling (class subtraction, \C/\I, \p) failing loud. The
translator covers every pattern in the 3.1 and 4.0 schemas (asserted by
test).

Go target (templates only): a type with a pattern compiles it and
TryParseX reports false on a mismatch; minLength likewise (rune count).
The lenient ParseX the deserializer uses keeps the value verbatim:
unlike a numeric bound there is no canonical replacement for a failed
pattern, and round-trip fidelity wins -- the policy is recorded in
data/README.md beside the numeric leniency rules. C deliberately leaves
its "Pattern (not enforced)" comment: enforcement there is template
work whenever wanted, no generator change required.

The other restrictions were audited and were already enforced: numeric
bounds clamp (including primitive-implied minimums and exclusive-bound
epsilon), unknown enum literals fall back to the first variant, and
union primitive members clamp like named number types. Both corert
suites stay green (777 passed, 0 failed, 52 skipped); the values smoke
test now asserts pattern acceptance/rejection and lenient passthrough.

https://claude.ai/code/session_01XUoGfETVUYbSedoEPx3mAV
Prove the manifest is architected for type-level extensibility: custom
code for one element or attribute must be a config-and-template change,
never a generator change.

The generic (language-free) mechanism: a [[render.type]] row now selects
either by `strategies` (the shape-driven stock case) or by `types` --
exact wire names. Type rows override strategy rows: a plate named by any
type row is rendered only by its type rows, so a bespoke type never
falls through to the stock template. Fail-loud checks: exactly one
selector per row, and a `types` name no plate carries is a stale
manifest entry.

The proof in the Go target: yyyy-mm-dd is claimed by a `types` row and
rendered from its own template. The wire API (TryParse/Parse/String)
matches the stock string template so the rest of the model composes
unchanged, storage stays the raw wire string (round-trip fidelity, no
harness pre/post-processing needed), and the bespoke part is typed
date-component accessors: Yyyy(), Mm(), Dd() -> int, with the model's
usual number leniency (unparseable -> 0) and BCE years handled. The
values smoke test pins components, wire fidelity, and the lenient path;
both corert suites stay green (777 passed, 0 failed, 52 skipped).

https://claude.ai/code/session_01XUoGfETVUYbSedoEPx3mAV
The counterpart proof to the Go target's bespoke yyyy-mm-dd, this time
with ZERO generator edits: the whole change is two config rows and two
templates -- exactly what the cardinal rule promises for custom handling
of a single type.

comma-separated-text keeps the stock wire API (the char* typedef and
_parse, so every consumer struct composes unchanged and round trips are
untouched) and adds an items accessor: a malloc'd NULL-terminated array
of malloc'd char* items, split on the pattern's ", ?" separator (one
optional space after each comma is consumed), with a matching free
function. NULL/empty values yield zero items.

values-smoke covers the split (both separator spellings), wire-spelling
fidelity, and the empty case; valgrind reports all heap blocks freed.
Both corert suites stay green (777 passed, 0 failed, 52 skipped).

https://claude.ai/code/session_01XUoGfETVUYbSedoEPx3mAV
webern added a commit that referenced this pull request Jun 14, 2026
Initially I tried using AI to reverse-engineer the pseudo-hand-rolled
original codegen. It worked but resulted in a 12k lines of unredeemable
Python garbage with no room to maneuver from there. So, I started over
and developed a `gen/` program from the ground up using AI. I tried to
keep it agnostic as to language target, so I think targeting other
languages and use-cases is a real possibility now. Early on, I used Go
and C as language targets to force the AI to think about extensibility.
Those targets exist under `gen/test`, but are intended more as `gen/`
program regression tests than for actual MusicXML use.

For the replaced `mx::core` code, I prioritized compile time and better
use of C++ features like `variant` and `option`. AI wanted to drop the
`ezxml` abstraction and I guess it was time to let it go, so `pugixml`
is promoted to `mx::core` interaction.

A `test-core-dev` target was used to allow the the AI to innovate on
`mx::core` without worrying about `mx::impl` and `mx::api` (in fact, I
deleted those layers during code-gen). Then I replaced those layers and
burned tokens to preserve the `mx::impl` algorithms targeting the new
set of `mx::core` classes.

## References

- Closes #157
- Closes #158
- Progresses #58
- Closes PR #167
- Closes PR #168

## Follow-ups:
- surface more features in the `mx/api` layer. top priority is probably
SMuFL
- better packaging and distribution
@webern

webern commented Jun 14, 2026

Copy link
Copy Markdown
Owner Author

Superseded by #169

@webern webern closed this Jun 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants