newgen#168
Conversation
First stages of the gen pipeline (python3 -m gen):
- gen/xsd: ElementTree parser into a model of the MusicXML XSD
subset, plus a structural analysis report ("analyze" command).
- gen/ir: lowers the XSD model to a resolved, dependency-ordered
intermediate representation ("ir" command). Names and hoists
anonymous types, collapses restriction chains, resolves refs,
and drops unreferenced types.
- gen/tests: unittest suite asserting the acyclic-graph and
no-name-collision invariants and IR referential integrity
across all MusicXML versions in docs (make test-gen).
- gen/README.md: architecture, IR glossary, and XSD analysis.
- AGENTS.md: reconcile the now-stale generator section.
Collapse the IR's preserved named structure on demand instead of forcing every emitter to re-derive it. Resolver (gen/ir/resolve.py) expands attribute groups, splices model-group refs into content, and walks the base chain, exposing attributes/all_attributes/content/ elements. build now computes deps through it, removing the duplicated group walk. Also make UnionMember hold a Ref so every type reference in the IR has one shape, and add `ir --resolve` to dump the collapsed view.
Fold the standard instrument-sound identifiers into the IR as a sound-id enum unioned with an open string (gen/ir/sounds.py): the XSD types instrument-sound as xs:string and lists the values only in the sounds.xml companion, not the schema. Opt-in per target. Vendor each MusicXML version's XSD and sounds.xml under docs/ with matching git-commit hash suffixes (3.0 5fd8eb3, 3.1 8bbe8e5, beside the existing 4.0 ed15c23 and 4.1 0d56097). Each target pins its schema and sounds policy via [input] xsd and [sounds] xml: C++ is 4.0 with sounds, C is 3.1 with sounds, Go is 3.1 without -- the C/Go pair differ only by the fold, keeping the generator honest about extensibility.
Design doc for the layer between the IR and the templates: the per-target projection that templates consume. The IR stays a pure, config-free function of the schema; the Galley is where config.toml meets it, so templates can stay dumb. Covers: the name and rejected alternatives; one-rich-layer-with-two- field-groups vs two passes, decided by a JSON Schema emitter contrast; the materialized, dumpable data shape built on the Resolver; automatic name-convention expansion with a precise tokenizer and a worked table (default-x, brass.alphorn, the empty value, midi/id); the wire string preserved separately; the two-tier rename/override system with scoped addressing, TOML schema, precedence, and IR validation; post-projection collision detection as a CI gate; the transformation catalog (representation, cardinality, primitives, derived types, default/fixed to variant, structure, files, docs, ordering); and a JSON Schema walkthrough proving the layer is neutral, not C++-shaped.
Each metadata object handed to a template (one per emitted type) is a Plate; the collection projected for a target is the Plates. Rename the design doc, rework its name-and-rationale section around the music engraving plate metaphor, rename the sketched dataclasses (EnumPlate, NumberPlate, StringPlate, UnionPlate, ComplexPlate, PlateRef, Plates) and the planned CLI to 'gen plates', and add the projection stage to the gen/README.md pipeline. The former layer name is removed entirely. https://claude.ai/code/session_01XUoGfETVUYbSedoEPx3mAV
Add the projection stage to the generator pipeline description, define plate (per-type, template-facing metadata object) and the Plates (the per-target collection), point to docs/ai/design/plates.md from the architecture section, the status note, and the repository layout tree. https://claude.ai/code/session_01XUoGfETVUYbSedoEPx3mAV
Implement docs/ai/design/plates.md as gen/plates/: the per-target projection that sits between the IR and the templates. One plate per emitted type, internally split into a neutral core (wire names, resolved structure, facets, docs) and a target binding (casings, sanitized identifiers, type mappings, strategy tags, file assignment). - gen/plates/names.py: tokenizer (separators, case transitions, digits ride), the convention registry (pascal/camel/snake/kebab/screaming), acronym set, and the reserved-word/validity sanitizer. - gen/plates/model.py: the plate dataclasses; every plate, member, and variant carries both its Name bundle and its final sanitized identifier. - gen/plates/build.py: the projection. Rename validation (fail loud on stale keys), member flattening over the Resolver (effective cardinality computed from the resolved particle tree), default/fixed literals resolved to enum variant identifiers, per-type file assignment with an include graph. - gen/plates/check.py: post-projection collision detection (type, variant, member, and case-insensitive file-stem scopes). - gen/plates/languages.py: per-language seeds (type maps, reserved words, doc styles) that config overrides. - gen/config.py: parse the new [target]/[naming]/[reserved]/[types]/[layout]/ [docs]/[rename.*] sections, including the two-tier rename scheme, scoped addressing, and the [naming] extends shared base. - gen/naming.base.toml: first real shared rename -- barline carries both elements and attributes named segno/coda, which any field casing collapses; the attributes become segno-sound/coda-sound for every target. - gen/ir/build.py: canonicalize xs:ID/xs:IDREF to token so the IR primitive set stays the documented eight. - CLI: python3 -m gen plates --config C [--type N] [--check]; --check is a CI gate like analyze. - gen/tests/test_plates.py: the design's worked conversion table, override tiers and precedence, fail-loud validation, collision detection, and spot-checked projections of all three shipped target configs. Design deltas recorded in plates.md section 11 (MemberRepr dropped, named convention fields, grouped partition deferred, group renames reserved).
…eview) Two independent reviews of 49a7c84 (one correctness-focused, one architectural). The accepted findings, by theme: Correctness (the critical find): - Resolver.flat_elements now owns the effective-cardinality view, with a co-occurrence-aware duplicate merge: occurrences of one element name in different branches of one choice are exclusive (optional), anything else can co-occur in a single instance and must be a vector. The old rule demoted every duplicate to optional, which mis-projected metronome's beat-unit (legally twice in one instance; the corpus exercises this). The flattening and the base-chain merge (all_flat_elements, base_chain) moved from the projection into gen/ir/resolve.py where schema reasoning lives once; never-occurring particles (max=0) are skipped. Naming and the collision gate: - Variant.ident is now the FINAL emitted constant. Constant scoping is a language fact (gen/plates/languages.py): bare for C++ enum class, composed for Go/C flat namespaces (NoteTypeValue1024th, MX_NOTE_TYPE_VALUE_1024TH); the projection composes, sanitizes, and the collision gate checks the namespace the target actually has (type idents and constants mutually, for composed scopes). Templates print idents verbatim, restoring 'templates do no naming'. - gen/names.py is now a leaf vocabulary module (tokenizer, conventions, Name, sanitizer) below both config and plates, removing the latent config->plates import cycle and the TYPE_CHECKING workaround. Acronym sets are case-normalized. Config tightening: - Unknown top-level sections rejected (a [renames] typo used to silently drop every rename); [input]/[output]/[sounds] keys checked too. - Removed dead surface: [layout] include-style, [reserved] policy, [naming] empty-value-word. The cpp decimal=Decimal mapping moved from language seeds to gen/cpp/config.toml ([types]). - extends hardened: no chained bases, naming/rename sections only, and a base/target scope-vs-entry shape disagreement is an error instead of a silent wholesale replacement. String-list keys reject bare strings; [types] keys must name real IR primitives; empty rename entries and non-table rename kinds fail loud. Plates model: - all_members is always built for derived plates so the gate covers the merged chain under either strategy; ComplexPlate.member(wire, kind) joins content occurrences to fields; the content tree's IR particle types are declared part of the neutral contract; Plates.type_map dropped (PlateRef.ident and target_type are the published spellings); UnionPlateMember.name added so primitive members have a field name; complete C++ reserved-word list; PlateRef primitive-wire docstring. - build_for_config in gen/plates is the one pipeline shared by CLI and tests. Documented as review-round deltas in plates.md section 11, including the always-on gate (a plain plates dump fails loud too; --check is the quiet CI entry). New tests: a particle-tree cardinality suite (exclusive vs co-occurring duplicates, wrappers, never-occurring particles), the real metronome assertion across schemas, composed-scope constants and their flat-namespace collisions, element-rename application, case-insensitive stem collisions, derived all_members under inherit, and the config rejection paths.
The fourth pipeline stage: python3 -m gen <config.toml> now projects the
target's Plates and renders them through a per-language backend.
- gen/emit/__init__.py: backend dispatch. A backend is a module exposing
render(plates) -> {relative path: content} -- the dumb-renderer contract:
identifiers, casings, type mappings, file stems, and structure all arrive
resolved on the plates; a backend owns only its language's grammar and
support files. Backends register by language name; no backend ships yet,
so emitting any target fails loud with the known-backend list.
- gen/emit/writer.py: deterministic output. Files are written only when
their content changed (stable mtimes for build systems); files the
generator wrote previously but that left the manifest are pruned, but
only files carrying the generated-code marker are ever deleted --
unmarked files in the output directory are reported and left alone.
Backends must mark every file (checked), paths are validated against
escapes, and the marker satisfies Go's generated-code convention.
- gen/__main__.py: wire the emit command; fold the duplicated
load-lower-patch pipeline onto gen.plates.build_for_config and one
_lower helper (a review finding: the sounds fold existed in two places).
- gen/tests/test_emit.py: writer contract (idempotence, marker-gated
pruning, foreign-file safety, unsafe paths) and dispatch failure modes.
The Go backend (gen/emit/go/) renders the four value shapes -- enum-class,
numeric-wrapper, string-wrapper, tagged-variant -- one template per shape,
plus the static runtime support file. Every generated type exposes the same
surface so the complex-type templates can call them uniformly:
TryParse<T>(s) (T, bool) strict membership
Parse<T>(s) T lenient (corpus fixup policies)
(T) String() string the wire spelling
Leniency policies, in one place (runtime.go) and per-type clamps generated
from the plates' bounds: unknown enum literal -> first variant; unparseable
number -> 0; decimal-looking integers truncate; every number clamps into its
declared range, with primitive-implied lower bounds (positive_integer >= 1,
non_negative_integer >= 0) and exclusive bounds clamping to the nearest
representable value (decimal: +/- 1e-6, matching the corpus duration fixup).
Unions try members strictly in schema order; literal members become
payload-free kinds; an unmatched input is absorbed by the first member's
lenient parse.
Backends own only Go grammar: identifiers (including the composed enum
constants), casings, stems, and structure arrive final on the plates.
Rendered output is piped through gofmt once (the formatter owns tabwriter
alignment, per Go codegen convention; required on PATH, fail loud), and the
emit is byte-idempotent.
gen/test/go/mx/ now holds the 132 generated files (131 value types +
runtime.go), committed per project convention. They compile (go build) and
vet clean. gen/test/go/corert/values_smoke_test.go exercises the generated
types end to end: wire -> typed -> wire across the policies above (run:
go test -run TestValueSmoke ./corert/).
The C backend (gen/emit/c/) renders the four value shapes as header/impl
pairs (the documented one-FileId-to-two-files mapping for C), mirroring the
Go backend's surface in C idiom:
bool mx_<t>_try_parse(const char *s, MxT *out) strict membership
MxT mx_<t>_parse(const char *s) lenient (fixup policies)
to_string: enums return static storage; numbers and unions return
malloc'd strings; string types ARE char* values (parse strdups).
Leniency policies match Go exactly (one policy, two spellings): unknown
enum literal -> first variant; unparseable number -> 0 with decimal-looking
integers truncating; clamps from the plates' bounds plus primitive-implied
lower bounds; exclusive decimal bounds clamp +/- 1e-6. mx_format_decimal
prints the shortest no-exponent spelling (8.5, 0.000001). Unions are tagged
structs whose kinds cover ref members and literals; the open instrument-
sound union (sound-id enum | open string) falls through to a strdup'd
string member, with mx_<t>_free owning the string-bearing kinds.
Per-type headers include their dependencies' headers via the plates'
include graph, which flushed out a real bug: union/member deps were
filtered by stem lookup, so the open string PRIMITIVE member of
instrument-sound matched the complex TYPE named 'string' and fabricated an
include of a not-yet-emitted header. _type_deps now excludes refs by
category, with a regression test pinning instrument-sound's includes to
exactly mx_sound_id.
The runtime pair (mx_runtime.h/c) carries the shared parse/format helpers;
its symbol and file prefixes come from the target config (TargetInfo gains
file_prefix). The backend also emits sources.cmake, the explicit build
manifest; gen/test/c/CMakeLists.txt now builds the generated model as the
mx-c static library and links it into corert-c, plus a values-smoke
executable mirroring the Go TestValueSmoke (all checks pass; zero compile
warnings). gen/test/c/mx/ holds the 269 generated files, committed per
project convention.
The Go backend now renders all four complex shapes and the document entry
points; the corert harness drives the generated model instead of the stub.
Every eligible corpus file round-trips: ~777 pass, 52 skip (see gating).
Representation (the Go spelling of the plate facts, chosen for round-trip
fidelity):
- Attributes are presence-tracked pointer fields, required or not: the
contract is 'write back exactly what was parsed', and corpus files do
omit required attributes.
- A composite stores children as ONE ordered list (Children []XChild, a
struct of typed pointers where exactly one is non-nil). Interleaved
choice content (measure's music-data, note's grace/cue branches,
metronome's repeated beat-unit) round-trips in document order for free,
which per-member vectors cannot do. No kind discriminator: harmony has a
child element literally named 'kind', so a synthetic field would collide.
- Parsing is strict about NAMES (unknown attribute/element -> error: the
version gate keeps newer documents out, so an unknown name is a generator
gap, not data) and lenient about VALUES (the typed Parse* policies).
- A derived type embeds its base; Go field promotion gives the flat view
and one merged parse/serialize pass.
- Document/FromXDoc/ToXDoc are generated from the plates' roots; the
root's xmlns declarations are preserved through the model (a few corpus
files declare xmlns:xlink).
Version gating: Go targets MusicXML 3.1; documents whose root declares a
newer version are skipped (reported, not failed) -- MusicXML is backward
compatible, so older documents parse; newer ones may use types the model
cannot represent.
Harness changes (gen/test/go/corert/):
- stub package deleted; the generated mx package is the implementation.
- Normalization strips whitespace-only character data everywhere (MusicXML
has no mixed content; pretty-printing indentation is not content, and an
empty <measure> holds only its own indentation). Applied to expected and
actual alike, so the comparison stays symmetric.
- The loader transcodes UTF-16 (BOM-detected) and ISO-8859-1 documents to
UTF-8: pugixml and libxml2 auto-detect these; Go's encoding/xml does not.
Corpus adjustments, each encoding a documented fact:
- data/synthetic/extend.3.0.xml and elision.3.0.xml used 3.0-only
attributes (extend lost its font attributes in 3.1; elision lost its
text-decoration attributes): MusicXML itself broke backward compatibility
there, so no current target schema (3.1 or 4.0) can represent those
attributes. The synthetic files now exercise the attribute set valid
across 3.0/3.1/4.0.
- data/lysuite/ly75a fixup updated to the uniform clamp policy ('', 'test',
and '0' all clamp to accordion-middle's minimum 1); the old expectations
encoded the legacy implementation's inconsistency (unparseable values
escaped the clamp). The policy is now documented in data/README.md.
…re review) Two independent reviews of the emit stage and its Go/C value backends. The accepted findings, by theme: The clamp policy is now data on the plates (the load-bearing change): - NumberPlate carries family (decimal|integer) and resolved ClampStep rules (facets merged with primitive-implied lower bounds, tightest wins, exclusive bounds clamping past by 1 or 1e-6), computed once in gen/plates/build.clamp_steps and unit-tested there (tie-breaks, exclusive max, implied minimums). Both backends' hand-mirrored _clamp_steps copies are deleted; 'one policy, two spellings' is now structural, not a comment. - The policy hole the duplication hid is closed: a primitive numeric union member (positive-integer-or-empty's integer) now carries and applies the implied clamp, so unions enforce the same leniency as named number types. - The int/float clamp mode comes from the IR base, not from string-matching the spelled target type (a [types] override no longer flips clamp mode). - TryParse's contract is pinned: lexically strict, then clamps; generated doc comments only claim clamping when clamp steps exist. Union discriminators are projected, not template-composed: - UnionPlateMember.tag is a Variant scoped and renameable like an enum value; literal variants double as their own tags; the flat-namespace collision gate now covers every constant the backends emit. The 'Kind' infix is gone from generated constants (FontSizeDecimal, MX_INSTRUMENT_SOUND_SOUND_ID). - An open string union member must be last (it matches anything): both backends fail loud instead of emitting unreachable members. C runtime hardening: - mx_format_decimal: sized buffer + snprintf return check (no truncated digit strings for extreme magnitudes); mx_strdup aborts on OOM instead of memcpy through NULL; mx_try_parse_int rejects ERANGE (aligning strictness with Go's ParseInt); generated parse entry points are NULL-safe (NULL means ""); include guards use _H_INCLUDED so they stay out of the constant namespace the gate certifies; dead variant_const helper removed. - Go formatDecimal canonicalizes negative zero to "0" (parity with the C runtime and the corert normalizer). Robustness around the edges: - Backends reject schema types that project onto their reserved support stems (runtime/document/sources) instead of silently overwriting them. - The writer overwrites an unreadable file at a manifest path instead of crashing; the gofmt scratch dir handles subdirectories. - The emit CLI reports config errors, missing files, and a missing gofmt through the error path rather than a traceback. Smoke tests extended in both languages: strict-rejection paths, negative formatting, negative-zero canonicalization, and the union implied-min clamp. Both targets regenerate; the Go corert suite stays green end to end; documented as the second review round in plates.md section 11.
The C backend now renders all four complex shapes and the document entry points; the corert harness drives the generated model instead of the stub. Both secondary targets now round-trip the corpus: 776 pass, 0 fail, 52 version-skipped, in C and Go alike. The C spelling of the same representation Go uses: presence-tracked attributes (bool has_x + value), children as ONE ordered array of structs whose typed pointers discriminate by non-NULL, strict about names, lenient about values. C has no inheritance, so derived types flatten the plates' all_members view into self-contained structs. gen/emit/c/api.py is the value calling convention -- the single place that knows how generated C parses, prints, stores, and frees each plate kind (enum to_string static, number/union malloc'd, string values own themselves) -- consumed at every attribute, text body, and leaf child instead of inline ownership reasoning. Parse errors flow through a runtime message channel (mx_error_set/mx_error) so parse functions return NULL with context instead of threading buffers. Serialization returns the created node (parent NULL -> free node) so the document root and nested elements share one code path; root namespace declarations are preserved through MxDocument (libxml2 keeps them in nsDef, so attribute loops never see them). Harness changes (gen/test/c/): - stub.h/stub.c deleted; roundtrip.c drives mx_document_* and gates documents declaring MusicXML > 3.1 (counted as skipped). - normalize.c strips whitespace-only text nodes everywhere (mirroring Go) and sorts attributes by QUALIFIED name. - compare.c compares each element's DIRECT text only (xmlNodeGetContent's subtree concatenation re-compared every leaf at every ancestor, so one numerically-equivalent reformat failed all its ancestors) and compares attributes by qualified name with entity-resolved values (a parsed xlink:href is (ns, href); a serialized one is the literal name). - mx-c carries the libxml2 include path; corert-c links the model. Corpus: lysuite/ly33d_Spanners_OctaveShifts.xml is marked .invalid -- it begins with stray bytes before the XML declaration, so it is not well-formed XML; strict parsers (libxml2) are entitled to reject it (Go's etree merely happens to tolerate leading garbage).
…record The architecture review of the complex-type milestones accepted findings: The decision record (the review's top insistence): plates.md section 11 gains the round-3 entry -- the ordered-children representation and why the original per-member-field sketch cannot round-trip MusicXML (interleaved music-data, metronome's repeated beat-unit), the no-discriminator rationale (harmony's <kind>), presence-tracked required attributes, the strict-names/lenient-structure/lenient-values contract (the generated packages are order-faithful typed DOMs, not validating bindings; content and cardinality stay on the plates for the C++ backend and the JSON Schema forcing function), and an explicit instruction that C++ should use a real sum type rather than copying this encoding. Sections 8.1/8.7 now point at it instead of contradicting it. AGENTS.md's normalization-pipeline section documents the whitespace stripping, qualified-name attribute handling, direct-text comparison, and encoding transcoding the C++ harness will need. Single-sourced facts: - Plates.schema_version (parsed from the source stem) is emitted into each runtime (SupportedMusicXMLVersion, MX_SUPPORTED_MUSICXML_VERSION) and the corert harnesses read it: retargeting a schema cannot leave a stale version gate. The hand-kept 3.1 constants are gone. - The duplicated shape queries moved beside the data: attribute_members / element_members / value_member, ComplexPlate.members_view() (the strategy-resolved member list), and Plates.children_owner() (base-chain walking was schema reasoning inside a template) live in gen/plates/model; both backends consume them, and a third backend will too. Identifier guards: the few names the backends still compose (per-type Child structs, Children/has_/children_count fields, the document support types) are now guarded at render time -- a schema name landing on one fails loud with a rename suggestion instead of surfacing as a compile error in generated code. Serializing a child with zero or multiple fields set is documented as undefined on the Child types. Also: status docs refreshed (both corert suites green; generated models committed; C++ backend the remaining gap). Verified after regeneration: 83 unit tests, gofmt-clean Go build/vet, zero-warning C build, both smoke binaries, and both corert suites green (776/0/52). Valgrind over the entire C corert run: 52.9M allocations, zero leaks, zero errors.
…delity The complex-type code review's accepted findings, and the real defect the first of them exposed: Harness soundness (the review's top three): - The C comparison now checks namespace declarations: libxml2 keeps xmlns/xmlns:foo in nsDef, never in the attribute list, so the model's namespace preservation was previously unverifiable. Turning the check on immediately caught a real round-trip defect: serialization built the tree DETACHED (the document was attached last), so when libxml2 resolved the reserved xml: prefix for xml:lang/xml:space with no document context it fabricated an xmlns:xml declaration on the carrying element -- 35 corpus files diverged from their inputs invisibly. The document template now serializes under a scratch parent attached to the document, where the implicit xml namespace resolves without inventing declarations. - The Go harness sorts and compares attributes by QUALIFIED name (FullKey): a defect dropping a prefix (xlink:href -> href) can no longer pass on the local name. The C side's qualified-name buffers abort on truncation rather than letting two truncated names compare equal. - Numeric-equivalence scope and the version-pinning rewrite were flagged; both are the documented corert design (AGENTS.md), shared with the C++ reference harness, and are deliberately unchanged. Template fixes: - String-plate children are stored unboxed: CValue carries an explicit is_pointer_value flag (raw string primitives AND char* typedefs like MxMode) instead of sniffing the spelled type, removing a needless allocation per string-valued child. - The Go backend fails loud on the one derivation shape its inherit template cannot render (element members spread across the base chain -- no MusicXML schema has one; the C flatten path is immune). - Serialize paths abort on libxml2 allocation failure, matching the runtime's OOM policy; the redundant Go string-primitive special cases collapsed into the shared parse expression (byte-identical output). Verified end to end after regeneration: 83 unit tests, go vet + Go corert green, zero-warning C build, both smoke binaries, C corert 776/0/52, and valgrind across the full C corert run: 0 leaks, 0 errors.
| sibling marker. | ||
| 2. For each file: | ||
| a. Load the XML into a DOM. | ||
| b. Set the root `version` attribute to `"3.0"`. |
There was a problem hiding this comment.
Seems wrong. Seems like this should be 3.0, 3.1 or 4.0
|
|
||
| 1. Set XML declaration: `<?xml version="1.0" encoding="UTF-8" standalone="no"?>`. | ||
| 2. Set DOCTYPE based on root element name (`score-timewise` vs `score-partwise`). | ||
| 3. Set root `version` attribute to `"3.0"`. |
There was a problem hiding this comment.
Again, there are multiple versions in play in the repo and we will move this to 4.0
| <!-- | ||
| MusicXML sounds file | ||
|
|
||
| Version 4.1 Draft |
There was a problem hiding this comment.
MusicXML 4.1 is not released and I believe the sounds version tracks 1:1 with the MusicXML version. Thus I believe this should be deleted.
|
|
||
| [target] | ||
| language = "cpp" | ||
| namespace = "mx::core" |
There was a problem hiding this comment.
Is this a "hardcoded" property, e.g. target.namespace is part of the prescribed structure for the toml file? If so, it seems a bit language-specific. Or, is it dynamic in the sense that I could say target.foo="bar"?
| @@ -0,0 +1,115 @@ | |||
| """Go backend: render the Plates into the Go test target package. | |||
There was a problem hiding this comment.
Oh no. This is very bad and not what I wanted at all.
It should not be required to write a different python program for Go than for C. The whole point of the exercise was to make sure we did not have to write bespoke logic/code for specific targets. What is this doing and why is it needed?
We need to analyze what's being done in these bespoke Go and C backends and re-design so that these backends do not exist.
| # one flat namespace and carry the type's name (Go package-level constants, | ||
| # C's single global namespace). This is a language fact, not configuration; | ||
| # the composition itself happens in the projection so Variant.ident is final. | ||
| VARIANT_SCOPES: dict[str, str] = { |
There was a problem hiding this comment.
No. I do not want the generator to know about languages or have a list of languages that it supports.
… redesign New cardinal rule (from the PR #168 review): the generator is language agnostic; adding a new language target must not require edits to the generator's Python files. The current implementation violates it -- roughly 2,500 lines of Python ARE Go and C (gen/emit/go/, gen/emit/c/, the language tables in gen/plates/languages.py, the BACKENDS registry, and prescribed config keys like [target] namespace that exist only because specific languages want them). The design doc specifies the redesign: - A target becomes a PACK: config.toml + a templates/ directory; the generator cannot tell which language it is emitting (no language registry, no language name anywhere in Python). - Config splits into the prescribed projection contract (every key definable without naming a language: conventions, renames, symbol-prefix, variant-scope, [types], [reserved]) and a freeform [vars] table passed verbatim to templates -- answering the review question about target.foo="bar" directly. languages.py's tables become required config. - The PRESS (gen/press/): a deliberately minimal, stdlib-only, mustache-class template engine -- variables, sections, inverted sections, recursive partials, loop metadata, one quoted-literal modifier, and constitutionally nothing else (no expressions, filters, or string functions). Its poverty is load-bearing: if a template cannot express something, the plates must carry it. Dispatch is by manifest (one template file per shape) and by mechanical discriminant expansion in the context builder, never by in-template logic. - A render MANIFEST in config declares template -> output-pattern rows, absorbing file layout generically: C's header/impl pairs are two rows, partitioning is implicit, [layout] dies, support files are templates, and the gofmt pass becomes a generic [render] format hook. - Small neutral plate additions so templates need no lookups: PlateRef gains the referenced type's name bundle and kind; plates gain deps for include/import composition; doc text arrives pre-wrapped. - Migration in six phases, each green, with a hard byte-parity gate (regenerate, git diff --exit-code over committed output) before each Python backend is deleted -- ending with the JSON Schema target added as a pure pack plus a CI assertion that the change touched no Python. - Rejected alternatives recorded: per-target Python plugins (satisfy the letter, not the spirit), Jinja2 (expressiveness lets backends reconstitute inside templates), AST emitters, and keeping languages.py 'as data'. AGENTS.md now states the cardinal rule up front and flags the current code as violating it, pointing here.
…nguage) Answers the review question 'why hand-roll instead of using a library?' properly -- the original draft argued only against Jinja2 and never weighed the real alternative, an existing Mustache implementation. The load-bearing commitment is now stated as Mustache-the-LANGUAGE: the press implements the published spec's interpolation/sections/inverted/ partials core with exactly three documented deviations (missing keys are render errors with template:line -- the spec's silent empty output is disqualifying for a generator; no HTML escaping; no lambdas), and is tested against the official Mustache spec suite, buying the spec authors' edge-case coverage (especially the whitespace rules) without their code. The engine extensions the draft had invented (@first/@last loop metadata, the :q quote modifier) move out of the engine into the context builder as injected fields (is_first/is_last/index0, wire_q companions), so template syntax stays pure Mustache and the engine is swappable behind it. Section 9 now weighs chevron/pystache as the close call it is (spec-mandated silent missing keys, HTML escaping, weak diagnostics, unmaintained state, vs the repo's no-Python-deps precedent) and pre-commits the reversal trigger: if the press exceeds ~600 lines or cannot pass the spec suite in phase 1, vendor chevron and patch strictness/escaping/diagnostics -- zero template changes either way.
…lates' A target is a directory containing config.toml and templates/ -- the term 'target' already exists throughout the codebase and needs no companion word. The template collection is simply the target's templates.
Phase 1 of the generator-agnosticism redesign (commit series: engine ->
data motions -> context/manifest -> C port -> Go port -> proof target).
gen/press/engine.py implements the Mustache core -- interpolation (incl.
dotted names and implicit iterators), sections, inverted sections, partials
with call-site indentation, comments, set-delimiters, and the spec's
standalone-line whitespace rules -- with the three documented deviations
for code generation: missing keys are render errors carrying template:line
(present-but-None renders empty and is falsey; only absence errors), no
HTML escaping ({{{x}}}/{{&x}} are accepted synonyms), and no lambdas (a
callable in the context is an error).
Conformance is tested, not asserted: the five core modules of the official
Mustache spec test suite (mustache/spec, MIT) are vendored under
gen/tests/mustache_spec/ and all 122 cases pass with zero skips -- the
engine's strict/escape constructor knobs let the suite run under the
spec's own semantics while the production pipeline uses the defaults.
Deviation and robustness tests (error locations, recursion depth limit,
callable partial loaders, context-stack fallthrough) cover the rest.
Also sweeps the last 'pack' stragglers out of the design doc.
…es need Phase 2 of the generator-agnosticism redesign: pure data motion plus the neutral plate additions templates require, with the legacy backends still in place and both corert suites green. The generator loses its per-language tables: gen/plates/languages.py is DELETED. Each target's config.toml now carries the whole projection input as data -- the full [types] primitive->spelling map, the full [reserved] words list, and [target] variant-scope (bare|composed). The projection takes no defaults from anywhere but config; a target omitting [types] gets primitive passthrough, which is what a neutral target wants. New config surface per the design: - [vars]: freeform string key-values passed verbatim to templates and never interpreted by the generator -- where anything that cannot be defined without naming a language belongs. - [reserved] members / type-suffixes: names a target's TEMPLATES synthesize (Go's Children field; Child/Kind type compositions), now enforced by the collision gate (gen/plates/check.py) instead of by per-backend Python guards. - [docs] wrap is the wrapped doc TEXT width, excluding comment syntax (default 97; a 3-character prefix lands at the 100-column house style); [docs] style and the DocStyle machinery die -- comment syntax is template content. Plate additions so templates never compute or look anything up: - PlateRef carries the referenced type's name bundle and kind (enum/number/string/union/complex, or family-qualified primitive-decimal/-integer/-string), denormalized at projection. - Every plate carries deps (the unique non-primitive references, sorted), the data include/import lines are composed from. - Every plate carries doc_lines (greedy-wrapped at [docs] wrap); the backends now consume them, proving byte-equivalence of the wrapping. The only generated-output change is deliberate and visible: string-plate pattern notes are their own comment line instead of being re-wrapped into the doc prose (the old flow even split a regex across lines). 18 files; everything else regenerates byte-identical. go vet + Go corert green, zero-warning C build, values-smoke, C corert 776/0/52.
…it path
Phase 3 of the generator-agnosticism redesign. The press can now render a
whole target from config + templates, with the legacy backends untouched
and still serving the unported targets (the [render] section's presence
selects the pipeline; the transitional dispatch dies with the last port).
gen/press/context.py -- plates to plain dicts, with three mechanical
enrichments and zero decisions: discriminant expansion (every closed
enumerated field gets ALL its boolean companions, so strict mode never
trips on a legitimate branch), quoted companions (<field>_q, the JSON
escape repertoire valid verbatim across the C-family languages), and loop
metadata (is_first/is_last/index0; bare-string list items are lifted to
{value, value_q}). Plus the pre-split member views (attributes/elements/
value, own and merged), flattened Name casings ({{name.snake}}), a
self-reference for inner scopes, and the generated-file banner text.
gen/press/render.py -- manifest expansion and rendering: [[render.type]]
rows render every plate whose strategy matches into an output pattern
composed from the plate's casings ({snake}.go, mx_{snake}.h -- C's
header/impl pairs are just two rows); [[render.once]] rows render against
the whole target with the complete output list in context (outputs,
outputs_by_ext for build manifests). Fail-loud checks: unknown strategies,
uncovered strategies, case-insensitive output collisions, unknown
placeholders, missing generated-file markers. The optional [render] format
command (gofmt and friends) runs over a scratch directory before the
writer's idempotence diff -- target data, generically executed.
The writer moved to gen/press/writer.py (a shim keeps the legacy backends
importing); config grows the [render] section; the CLI gains
"python3 -m gen render --config C --type N" for template debugging.
14 new tests cover the context enrichments end to end against real
template files, every manifest failure mode, and the format hook.
Phase 4 of the generator-agnosticism redesign: the C backend no longer
exists. Everything C-shaped lives in gen/test/c/templates/ (fourteen
Mustache templates: one per value shape as header/impl pairs, one complex
pair covering all five complex strategies, the runtime and document
support files, and sources.cmake composed from the manifest's own output
list) and in the [render] manifest in the target's config.toml. The
mx_/MX_ spellings, ownership idioms, include lines, and libxml2 grammar
are template text; identifiers, casings, clamp steps, union cases, member
views, and dependency lists all arrive as plate data.
The parity gate: 674 of 677 files regenerate byte-identical through the
press. The three deviations are the OLD output's warts, now fixed --
mx_encoding.h/mx_key.h carried a double space from the legacy backend's
child-field spacing bug ('MxYyyyMmDd encoding_date'), and
mx_score_instrument.c spelled a pointer as '&(*ch->x)'. Verified beyond
bytes: zero-warning build, values-smoke, C corert 776/0/52, and valgrind
across the full suite (0 leaks, 0 errors).
Supporting changes, all neutral:
- The flattened union view in the context builder (per-case
tag_ident with loop metadata at the granularity the kind enum actually
has) and UnionPlate.open_ended; the open-string-member ordering guard
moved from backend Python into the plates' collision gate, where union
parse semantics (not language) put it.
- Discriminant expansion uses earlier-field-wins so PlateRef's category
and kind vocabularies (which share 'value'/'complex') stay consistent.
- Dotted resolution through a present-but-None value is falsey rather
than a strict-mode error, matching the engine's None deviation.
- has_<list> companions and the 'family' vocabulary in the context.
The legacy BACKENDS registry now knows only 'go'; it dies with the Go
port next.
Phase 5 of the generator-agnosticism redesign: the last per-language backend is gone. The cardinal rule now holds structurally and is enforced by a test. The Go target is eight Mustache templates (gen/test/go/templates/) plus its [render] manifest: one per value shape, one complex template covering value-class/composite-class/flag/attrs-class, a separate inherit template (base embedding with field promotion is a Go idiom, so it is Go template text), the document entry points, and the runtime. gofmt runs as the manifest's generic format hook, exactly as before. The port's parity gate came out exact: all 336 files regenerate byte-identical through the press (the one template-shaped wrinkle: Go composite literals after an interpolation form three braces, which Mustache reads as a triple-stache; the templates write a space that gofmt then removes). DELETED from the generator, never to return: - gen/emit/ entirely (the Go backend, the BACKENDS registry, the writer shim -- the writer lives in gen/press/writer.py). - [target] language: nothing selects on it; the generator cannot tell which language it is emitting. - [target] namespace and prefix as prescribed keys: namespace was only ever language-flavored (the Go package name is the Go templates' own text; cpp's moved to [vars]); prefix survives as symbol-prefix, the projection-contract key the collision gate depends on. - [layout] entirely, with plate.file, FileSpec, Plates.files, the file-stem collision check, and [naming] file-convention: output paths are the manifest's output patterns, and the press's expansion check covers their collisions. gen/tests/test_agnosticism.py pins the rule: the generator's Python is a closed set of packages (xsd, ir, names, config, plates, press, tests), no module may be named after a language, and a target's templates/ directory may contain no Python. Both corert suites green; both targets regenerate byte-idempotent; plates --check works for the templateless C++ target.
Phase 6 of the generator-agnosticism redesign: the neutral target that has been the plates design's forcing function since day one (plates.md section 9) now exists, and adding it touched ZERO generator Python. The target is gen/schema/config.toml plus one template: a JSON Schema (draft 2020-12) rendering of the MusicXML 4.0 spec (with the sounds fold), 373 $defs. It consumes only the neutral core, exactly as the design promised: $defs keys and properties are wire names (kebab forms, never casings); enums are wire literals (space-separated and empty values verbatim); number facets become minimum/maximum/exclusiveMinimum; patterns pass through; unions are anyOf with the open-enum (instrument-sound = sound-id ref | open string) falling out with no special case; docs become descriptions. No [types] map, no [reserved] words, no symbol prefix -- the target binding is inert. The generated-file marker rides in the schema's own "$comment" (JSON has no comments; the writer's prune gate is satisfied without one). Representation note, recorded in the config: complex types are modeled over the merged flat member view; choice exclusivity and sequence nesting are not encoded. The resolved content tree is on the plates whenever a template revision wants oneOf nesting -- that will be template work, which is the point. make gen now runs the renderable targets (go/c/schema); gen-cpp stays defined for when the C++ templates exist. gen/tests/test_schema.py renders the target through the ordinary pipeline and pins the neutral facts.
Sweep the docs to match the implemented generator-agnosticism design: - AGENTS.md: state that the cardinal rule now HOLDS and is enforced structurally by gen/tests/test_agnosticism.py; update the repo layout (plates/, press/, schema/, per-target templates/); rewrite the generator-architecture paragraph around the press render pipeline; add the render command; refresh the status section. - gen/README.md: rewrite pipeline step 4 around gen/press, update the layout block (emit/ is gone), document the render debugging command. - generator-agnosticism.md: status -> implemented; append section 11 with implementation notes (parity outcomes, the schema proof target, context-builder additions, deleted transitional config keys). - plates.md: supersession note pointing emit-stage concepts at generator-agnosticism.md. https://claude.ai/code/session_01XUoGfETVUYbSedoEPx3mAV
The file began with stray bytes ("Octave.cc") before the XML declaration,
so it carried an .invalid sibling marker and every harness skipped it.
Delete the errant prefix and the marker; the file is otherwise well-formed
MusicXML and now round-trips in both the Go and C corert suites
(777 passed, 0 failed, 52 version-gated skips).
https://claude.ai/code/session_01XUoGfETVUYbSedoEPx3mAV
…butes Answer the review question on the record: the attributes removed from elision.3.0.xml (underline, overline, line-through, rotation, letter-spacing, xml:lang, dir) and extend.3.0.xml (the font group) are valid in MusicXML 3.0 ONLY. 3.1 retyped elision from text-font-color to the new elision type (font + color + smufl) and narrowed extend from print-style to position + color; 4.0 kept the narrowed definitions. MusicXML broke its own backward compatibility here, so no 3.1+ model can represent the attributes and a 4.0 copy keeping them would be invalid -- there is nothing to copy forward. Record the finding in data/README.md beside the corpus conventions. https://claude.ai/code/session_01XUoGfETVUYbSedoEPx3mAV
No target's config.toml references it and MusicXML 4.1 is unreleased; the 4.1 XSD stays vendored for schema diffing, but an unreferenced sounds companion is dead weight. https://claude.ai/code/session_01XUoGfETVUYbSedoEPx3mAV
The 3.0 root-version pin is a constant in each harness's normalize module (musicXMLVersion in Go, MUSICXML_VERSION in C), not a property of the corpus or the architecture. Describe it that way once and have the flow and normalization steps refer to "the harness baseline" instead of repeating the literal; likewise the opening line now says the generator reads whichever XSD a target's config pins rather than naming 4.0. Also refresh the suite count (777 after the ly33d fix). https://claude.ai/code/session_01XUoGfETVUYbSedoEPx3mAV
The XSD constrains several string types with pattern facets (color, comma-separated-text, ending-number, time-only, yyyy-mm-dd, the SMuFL glyph names) and one with minLength (measure-text); until now every target documented them as "not enforced". Plates (language-neutral): StringPlate keeps the raw facets and gains `pattern`, the facets re-spelled as ONE anchored regex in a portable dialect -- XSD's implicit whole-value anchoring made explicit, same-step facets OR-joined (XSD semantics), \i/\c name-class escapes expanded to explicit ASCII classes, ^/$ XSD-literals escaped, and anything without a portable spelling (class subtraction, \C/\I, \p) failing loud. The translator covers every pattern in the 3.1 and 4.0 schemas (asserted by test). Go target (templates only): a type with a pattern compiles it and TryParseX reports false on a mismatch; minLength likewise (rune count). The lenient ParseX the deserializer uses keeps the value verbatim: unlike a numeric bound there is no canonical replacement for a failed pattern, and round-trip fidelity wins -- the policy is recorded in data/README.md beside the numeric leniency rules. C deliberately leaves its "Pattern (not enforced)" comment: enforcement there is template work whenever wanted, no generator change required. The other restrictions were audited and were already enforced: numeric bounds clamp (including primitive-implied minimums and exclusive-bound epsilon), unknown enum literals fall back to the first variant, and union primitive members clamp like named number types. Both corert suites stay green (777 passed, 0 failed, 52 skipped); the values smoke test now asserts pattern acceptance/rejection and lenient passthrough. https://claude.ai/code/session_01XUoGfETVUYbSedoEPx3mAV
Prove the manifest is architected for type-level extensibility: custom code for one element or attribute must be a config-and-template change, never a generator change. The generic (language-free) mechanism: a [[render.type]] row now selects either by `strategies` (the shape-driven stock case) or by `types` -- exact wire names. Type rows override strategy rows: a plate named by any type row is rendered only by its type rows, so a bespoke type never falls through to the stock template. Fail-loud checks: exactly one selector per row, and a `types` name no plate carries is a stale manifest entry. The proof in the Go target: yyyy-mm-dd is claimed by a `types` row and rendered from its own template. The wire API (TryParse/Parse/String) matches the stock string template so the rest of the model composes unchanged, storage stays the raw wire string (round-trip fidelity, no harness pre/post-processing needed), and the bespoke part is typed date-component accessors: Yyyy(), Mm(), Dd() -> int, with the model's usual number leniency (unparseable -> 0) and BCE years handled. The values smoke test pins components, wire fidelity, and the lenient path; both corert suites stay green (777 passed, 0 failed, 52 skipped). https://claude.ai/code/session_01XUoGfETVUYbSedoEPx3mAV
The counterpart proof to the Go target's bespoke yyyy-mm-dd, this time with ZERO generator edits: the whole change is two config rows and two templates -- exactly what the cardinal rule promises for custom handling of a single type. comma-separated-text keeps the stock wire API (the char* typedef and _parse, so every consumer struct composes unchanged and round trips are untouched) and adds an items accessor: a malloc'd NULL-terminated array of malloc'd char* items, split on the pattern's ", ?" separator (one optional space after each comma is consumed), with a matching free function. NULL/empty values yield zero items. values-smoke covers the split (both separator spellings), wire-spelling fidelity, and the empty case; valgrind reports all heap blocks freed. Both corert suites stay green (777 passed, 0 failed, 52 skipped). https://claude.ai/code/session_01XUoGfETVUYbSedoEPx3mAV
Initially I tried using AI to reverse-engineer the pseudo-hand-rolled original codegen. It worked but resulted in a 12k lines of unredeemable Python garbage with no room to maneuver from there. So, I started over and developed a `gen/` program from the ground up using AI. I tried to keep it agnostic as to language target, so I think targeting other languages and use-cases is a real possibility now. Early on, I used Go and C as language targets to force the AI to think about extensibility. Those targets exist under `gen/test`, but are intended more as `gen/` program regression tests than for actual MusicXML use. For the replaced `mx::core` code, I prioritized compile time and better use of C++ features like `variant` and `option`. AI wanted to drop the `ezxml` abstraction and I guess it was time to let it go, so `pugixml` is promoted to `mx::core` interaction. A `test-core-dev` target was used to allow the the AI to innovate on `mx::core` without worrying about `mx::impl` and `mx::api` (in fact, I deleted those layers during code-gen). Then I replaced those layers and burned tokens to preserve the `mx::impl` algorithms targeting the new set of `mx::core` classes. ## References - Closes #157 - Closes #158 - Progresses #58 - Closes PR #167 - Closes PR #168 ## Follow-ups: - surface more features in the `mx/api` layer. top priority is probably SMuFL - better packaging and distribution
|
Superseded by #169 |
No description provided.