diff --git a/.claude/skills/arch/SKILL.md b/.claude/skills/arch/SKILL.md new file mode 100644 index 000000000..e652f6600 --- /dev/null +++ b/.claude/skills/arch/SKILL.md @@ -0,0 +1,144 @@ +--- +name: architect +description: Prime the model with senior architect principles during design or coding. Invoke when the user says "pay attention to the architecture", "think like an architect", or when working on design, module layout, or API contracts. +--- +# Architect Mode + +When this skill is invoked, shift into the mindset of a senior software architect. Hold these +principles as active constraints while writing code, suggesting approaches, or discussing design. +Flag violations when they appear, propose alternatives, and steer decisions toward good structure. +Be direct and opinionated. + +## mx Project Architecture + +The codegen program at `gen/` should be designed in such a way that novel use-cases and edge cases +in the desired code do not need to edit the python program. The plates layer and configuration layer +should provide enough flexibility to generate just about any code from the MusicXML XSD spec. + +### C++ Goals + +The `mx/core` layer should... + +Cardinal requirements: +- Spec correctness: it should be impossible to use the C++ code to construct a document that is not + valid to the Music XML 4.0 spec. +- Modern, safe C++. Use C++20. No bare pointers, no bare free or malloc. Smart pointers. S Meyers + Effective Modern C+ + +C++ code optimization priorities +1. Compile time +2. Runtime memory usage +3. Runtime speed +4. Binary size + +## Core Principles (in priority order) + +### 1. Domain Boundaries & Separation of Concerns + +Keep domain boundaries clear, concerns properly separated, and API contracts well-defined. When a +design muddles domains together or leaks responsibilities across boundaries, that is the primary +concern to raise before proceeding. + +### 2. Simple, Deep Abstractions with Information Hiding + +Information should be hidden inside simple, deep abstractions. Narrow interfaces with rich internals +are best. Flag leaking internals and wide, shallow interfaces. Prefer introducing abstractions +early, since good abstractions reduce future blast radius. + +### 3. Minimize Blast Radius of Change + +A single change of behavior should not affect many files all over the codebase. Make sure future +behavioral changes will be confined to the module that owns the behavior. Reject approaches that +spread change widely. Good modularity keeps most changes local. + +### 4. Clarity Over Cleverness + +Code must not be hard to understand or exhibit surprising behavior. Naming must reduce cognitive +burden. It should be obvious what something does from its name. Follow local conventions. Push back +hard on anything confusing or unexpected. + +## Architectural Positions + +### Monolith vs Services + +- Monoliths are fine and preferred for smaller teams. +- Tipping point is `~20+`` engineers in one repository. + +### Dependency Injection + +- DI should serve separation of concerns, information hiding, and simplicity. +- DI used solely to support unit testing is not a great use if it increases or cognitive load. +- DI where it genuinely simplifies the design. + +### Testing Strategy + +- Unit tests are table stakes but tell you almost nothing about whether the system works. +- End-to-end tests are the real payoff — test the system the way a customer uses it. +- Design for end-to-end testability as a first-class concern. + +### Event-Driven & Async Patterns + +- Queues, pub/sub, and event sourcing are a necessary evil acceptable when solving a real + architectural need — never because it's fashionable. +- If a simpler procedural approach works, prefer it. + +### Error Handling + +- Prefer explicit error handling (Rust-style Result types) where the ecosystem supports it. +- In languages where exceptions are idiomatic (e.g., Java), grudgingly accept them pragmatically. + +### Performance + +- Correctness and good design come first. +- Performance matters on hot paths but only after correctness is assured. +- A good design can be optimized later. + +### Configuration & Feature Flags + +- Configuration is behavioral surface area. Never expose it unless you must. +- Unnecessary configuration paints you into a corner when customers depend on it. +- Feature flags are a necessary evil for migrations — separate from config, not customer-facing. + +### API Contracts & Code Generation + +- Generate from a single spec whenever possible (OpenAPI, protobuf, XSD, Smithy, etc.). +- A single source of truth for API surfaces is critical. + +### Data Ownership + +- Greenfield: each service owns its data, or merge the systems. +- Legacy: be pragmatic about existing databases. + +### Backwards Compatibility + +- Breaking changes must come with clear migration paths. +- Customers must not be painfully impacted. + +### Shared Libraries + +- Consistent library use across a codebase is preferred for consistency and binary size. + +### Composition vs Inheritance + +- Lean toward composition, but inheritance has excellent use cases. + +### Observability + +- Leave the door open if it doesn't harm the design. Don't compromise design quality for it. + +## When to Flag for Splitting a Module + +- Excessive size +- Interface scope growing too wide +- Internals leaking out +- High code churn (many unrelated changes hitting the same module) + +## Behavior in This Mode + +- Apply these principles as a continuous lens, not a one-shot review. +- When writing or suggesting code, favor the architecturally sound path without being asked. +- When a decision point arises, name the tradeoff and state a recommendation. +- Frame concerns as: "This would [violate principle / increase blast radius / leak internals] + because [reason]. Consider [alternative]." +- If the current direction is already good, say so and proceed — don't invent problems. +- Always ask: "Will this keep changes local and the system understandable as it grows?" diff --git a/.claude/skills/grill-me/SKILL.md b/.claude/skills/grill-me/SKILL.md new file mode 100644 index 000000000..e3d9233bf --- /dev/null +++ b/.claude/skills/grill-me/SKILL.md @@ -0,0 +1,26 @@ +--- +name: grill-me +description: > + Interview the user relentlessly about a plan or design until reaching + shared understanding, resolving each branch of the decision tree. Use + when user wants to stress-test a plan, get grilled on their design, or + mentions "grill me". +argument-hint: "" +disable-model-invocation: false +user-invocable: true +--- + +Never use the `AskUserQuestion` tool. Never render a numbered option picker. Ask every question as +plain text in the chat, then stop and wait for the answer. + +Interview me relentlessly about every aspect of this plan until +we reach a shared understanding. Walk down each branch of the design +tree resolving dependencies between decisions one by one. + +If a question can be answered by exploring the codebase, explore +the codebase instead. + +For each question, provide your recommended answer. + +Use the /questions skill to avoid sending more than one question at +once. diff --git a/.claude/skills/questions/SKILL.md b/.claude/skills/questions/SKILL.md new file mode 100644 index 000000000..31bd459eb --- /dev/null +++ b/.claude/skills/questions/SKILL.md @@ -0,0 +1,52 @@ +--- +name: questions +description: > + Ask the user clarifying questions one at a time to refine a plan or + task. Invoke with `/questions` or automatically when more information + is needed. +--- +# /questions + +## Non-negotiable: plain chat only + +Never use the `AskUserQuestion` tool. Never render a numbered option picker. Ask every question as +plain text in the chat, then stop and wait. + +## The core rule: one question per turn + +Ask exactly **one question**, then stop and wait for the answer. Do not bundle multiple questions +into one turn — not as a numbered list, not as "and also," not as a parenthetical follow-up. The +user answers one question at a time; batching forces them to scroll back and juggle context, and +answers get lost. + +This rule holds even when several questions feel related or obvious. One turn, one question. + +**Wrong:** + +> A few things to clarify: +> +> 1. What's the target platform? +> 2. Should it support offline mode? +> 3. What's the expected user count? + +**Right:** + +> What's the target platform? + +*(wait for answer, then next turn:)* + +> Got it. Does it need to work offline? + +## Usage + +- `/questions` +- `/questions ` — e.g., `/questions about the design of the flubber async module` + +## Flow + +1. Ask one question, grounded in existing context and the optional prompt. If there's no context, + open with "What would you like to work on?" +2. Wait for the answer. Use it to shape the next question. +3. Repeat until the user says to stop. +4. When they stop, produce a plan summarizing their answers. If it's unclear what they want done + with the plan, ask — one question. diff --git a/.gitignore b/.gitignore index d7418a892..a6de4d80b 100644 --- a/.gitignore +++ b/.gitignore @@ -13,3 +13,7 @@ package-lock.json # Python bytecode cache __pycache__/ *.pyc + +# Gen test build artifacts +gen/test/c/build/ +gen/test/go/build/ diff --git a/AGENTS.md b/AGENTS.md index 5ec4097c2..fa031e18f 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -1,4 +1,235 @@ # mx -We are implementing a generator from the MusicXML XSD specification. +A code generator that reads a MusicXML XSD specification (each target's `config.toml` pins which +version) and emits typed document serialization/deserialization libraries in multiple languages. C++ is the primary target; Go and C +are secondary targets that keep the generator architecture honest about extensibility. +**CARDINAL RULE: the generator is language agnostic.** Adding a new language target must not +require edits to the generator's Python files; all language knowledge lives in the target's own +directory as `config.toml` data and Mustache templates. The rule HOLDS and is enforced +structurally by `gen/tests/test_agnosticism.py` (the generator's Python is a closed set; no +module is named after a language; targets contain no Python). Design and decision record: +`docs/ai/design/generator-agnosticism.md`. The proof: the JSON Schema target (`gen/schema/`) +was added without touching a single `.py` file. + +Ignore git history prior to `b01288`. Reading anything before that commit will only confuse you and +degrade your performance. + +## Repository layout + +``` +mx/ + AGENTS.md <- you are here + Makefile <- top-level build driver (native + Docker-gated targets) + Dockerfile <- mx-sdk toolchain image (Ubuntu 24.04, GCC 14, Go, libxml2, Python 3) + CMakeLists.txt <- C++ project: ezxml library + corert test harness + data/ <- MusicXML test corpus (~1,347 files, see data/README.md) + docs/ai/design/ <- design docs (plates.md: the Plates; generator-agnosticism.md: + the cardinal rule and the targets/press redesign) + src/private/ <- C++ source + mx/ezxml/ <- vendored pugixml-backed XML layer + mx/core/ <- generated C++ typed model + mx/utility/ <- generated C++ utilities + mxtest/corert/ <- C++ core roundtrip test harness (Catch2, dynamic registration) + mxtest/import/ <- normalization helpers (sort attrs, strip decimal zeros) + mxtest/file/ <- PathRoot.h (CMake-generated, gitignored) + cpul/ <- vendored Catch2 test runner + gen/ <- code generator system (see gen/README.md) + __main__.py <- CLI: analyze | ir | plates | render | + README.md <- architecture, IR glossary, XSD analysis + xsd/ <- XSD parser + structural analysis + ir/ <- resolved intermediate representation (IR) + plates/ <- the per-target projection (casings, idents, policy data) + press/ <- the Mustache engine, context builder, manifest renderer, writer + cpp/config.toml <- C++ target configuration (no templates yet) + schema/ <- JSON Schema target: config.toml + templates/ + out/ (committed) + test/go/ <- Go corert test target + config.toml <- Go target configuration incl. the [render] manifest + templates/ <- the Go templates (all Go knowledge lives here + config.toml) + go.mod, go.sum <- Go module (etree dependency, vendored) + vendor/ <- vendored Go deps + corert/ <- test package (discover, fixer, normalize, roundtrip, test) + mx/ <- the GENERATED Go model (committed; do not edit) + test/c/ <- C corert test target + config.toml <- C target configuration incl. the [render] manifest + templates/ <- the C templates (all C knowledge lives here + config.toml) + CMakeLists.txt <- CMake project using libxml2 + src/ <- C source (main, discover, fixer, normalize, compare, roundtrip) + mx/ <- the GENERATED C model (committed; do not edit) +``` + +## Build system + +### Docker (mx-sdk) + +All Docker-gated targets auto-build the `mx-sdk` image on first use. The workspace is bind-mounted +at `/workspace`. A named Docker volume `mx-build` persists CMake/ccache state across runs. + +The `MX_RUNNING_IN_DOCKER` env var switches the Makefile between in-container (direct tool +invocation) and outside-container (docker run wrapper) behavior. + +### Makefile targets + +Run `make help` for the full, current target list (native C++, generator, Go/C test targets, +housekeeping). Docker-gated targets auto-build and run via mx-sdk. + +## The corert (core roundtrip) test + +The corert test is the primary correctness gate. It exercises the generated parser by round-tripping +every eligible XML file in `data/` through the typed model and comparing the output to a normalized +form of the input. Both sides of a comparison get their root `version` attribute pinned to one +baseline so the attribute itself never produces a mismatch; the baseline is a constant in each +harness's normalize module (`musicXMLVersion` in `gen/test/go/corert/normalize.go`, +`MUSICXML_VERSION` in `gen/test/c/src/normalize.c` -- currently `3.0`), a harness choice, not a +property of the corpus or the architecture. + +### Flow (same in all three languages) + +1. **Discover** eligible `.xml`/`.musicxml` files under `data/`, excluding directories `expected`, + `testOutput`, `generalxml`, `smufl`, and files matching `*.fixup.xml` or having a `.invalid` + sibling marker. +2. For each file: + a. Load the XML into a DOM. + b. Set the root `version` attribute to the harness baseline (see above). + c. **Parse** into the typed model via `fromXDoc` (the generated code). + d. **Serialize** back to XML via `toXDoc`. + e. **Normalize** the actual output (see Normalization pipeline below). + f. Load a fresh expected document from disk, apply the same normalization. + g. Apply **fixups** from `.fixup.xml` sidecars to the expected document. + h. **Compare** the two DOMs depth-first: element names, text content, attributes (with numeric + equivalence for ints/floats). +3. Report pass/fail per file. + +### Current state + +The Go and C suites are GREEN: 777 files pass, 0 fail, and 52 skip (they declare MusicXML 4.0; +those targets generate from the 3.1 schema, and while MusicXML is backward compatible, a newer +document may use types an older model cannot represent -- the harnesses gate on the root's +declared version). The C++ target still has no generated `mx/core`, so corert C++ does not +compile yet; that is the remaining expected gap. + +### Data directory conventions + +- `data/README.md` - documents marker file conventions. +- `*.xml.invalid` - sibling marker meaning the file is not valid MusicXML; skip it. +- `*.fixup.xml` - sidecar describing value substitutions for the expected document. Used when mx + clamps out-of-bounds values on import (e.g. MIDI channel 0 -> 1). Format: + ```xml + + + element + midi-channel + 0 + 1 + + + ``` +- `data/testOutput/corert/` - debug output directory for failure diffs (gitignored via build/). + +### Normalization pipeline + +Applied to both expected and actual documents before comparison: + +1. Set XML declaration: ``. +2. Set DOCTYPE based on root element name (`score-timewise` vs `score-partwise`). +3. Set the root `version` attribute to the harness baseline version. +4. Strip whitespace-only text nodes from every element (pretty-printing indentation is not + content; MusicXML has no mixed content, and the rule is applied to both sides, so it stays + symmetric). +5. Strip trailing zeros from decimal fields (the list lives in `DecimalFields.h`). +6. Sort attributes alphabetically by QUALIFIED name (`xlink:href`, not `href`; must be last). + +Comparison details that took debugging to get right (the C++ harness will need the same when its +generated core lands): compare each element's DIRECT text only, never the subtree concatenation +(a numerically-equivalent leaf reformat would otherwise fail at every ancestor); compare +attributes by qualified name with entity-resolved values (a parsed `xlink:href` is (ns, href) +while a serialized one may be the literal name); the Go loader transcodes UTF-16 and ISO-8859-1 +to UTF-8 (libxml2 and pugixml auto-detect these; Go's encoding/xml does not). Documents whose +root declares a version newer than the target's generated `SupportedMusicXMLVersion` / +`MX_SUPPORTED_MUSICXML_VERSION` constant are skipped, not failed. + +### Numeric equivalence + +Text comparisons use numeric equivalence: if both strings parse as integers (or floats), compare +their values instead of their string representations. Float comparison uses epsilon `< 0.00000001`. + +## Generator architecture + +The generator (`gen/`) is a Python program structured as a pipeline: parse the MusicXML XSD into a +model (`gen/xsd/`), lower that into a resolved intermediate representation (`gen/ir/`), project the +IR onto a target as the Plates (`gen/plates/`), then render the target's own Mustache templates +through the press (`gen/press/`) per its `[render]` manifest. The generator has no concept of any +language; a target IS a directory of config and templates. The IR data model preserves the schema's named structure +(model groups, attribute groups, inheritance edges); `gen/ir/resolve.py` collapses it on demand into +the flattened view an emitter consumes (attribute groups expanded, group refs spliced into content), +so that splicing-and-deduping reasoning lives once rather than once per language. See `gen/README.md` +for the architecture, IR glossary, the resolution layer, and a structural analysis of the schema. + +Vocabulary: a **plate** is the per-type metadata object handed to a template -- one per emitted +type, carrying the target's identifier casings, type mappings, emit strategy, and file assignment. +The **Plates** is the full collection projected for one target; it is where config.toml meets the +IR, so templates stay dumb renderers. Specified in `docs/ai/design/plates.md`. + +Commands: +- `python3 -m gen analyze [xsd]` - print a structural analysis of the XSD. +- `python3 -m gen ir [--type NAME] [--resolve] [--config C] [xsd]` - lower the XSD to the IR and + print it as JSON; `--resolve` prints the collapsed (group-spliced, attribute-flattened) view of + complex types; `--config` applies a target config's companion patches (the sounds.xml fold) first. +- `python3 -m gen plates --config C [--type NAME] [--check]` - project the IR onto the target the + config describes and print the Plates as JSON; `--check` validates renames and detects identifier + collisions (a CI gate, like analyze). +- `python3 -m gen render --config C --type NAME` - render one type through the target's templates + to stdout (template debugging). +- `python3 -m gen ` - emit the target: project the plates, render the [render] + manifest's templates, run the optional format hook, write with marker-gated pruning. + +Each target has a `config.toml` specifying the MusicXML XSD it generates from (`[input] xsd`), the +output directory (`[output] dir`, relative to the config file), an optional `[sounds] xml` companion +file (see below), and, eventually, language-specific settings. Each path is relative to the config +file. The three targets deliberately span the matrix: C++ is 4.0 with sounds, C is 3.1 with sounds, +and Go is 3.1 without sounds (the C/Go pair differ only by the companion fold). + +### Companion data + +`instrument-sound` is `xs:string` in the XSD; the standard sound identifiers live only in the +separately versioned `sounds.xml` (vendored as `docs/sounds-.xml`). When a target's +`config.toml` sets `[sounds] xml`, `gen/ir/sounds.py` folds them into the IR as a `sound-id` enum +unioned with an open string (element `instrument-sound` retyped from `string` to that union). This is +the only place the IR depends on an input beyond the XSD; it is opt-in per target, so the base IR +stays a pure function of the schema. + +**Status.** The parse, IR, analysis, Plates, and press stages exist; the generator is fully +language agnostic. The Go, C, and JSON Schema targets render from their own templates and both +corert suites are green. Generated output is committed (`gen/test/go/mx/`, `gen/test/c/mx/`, +`gen/schema/out/`). The C++ target has config but no templates yet; writing them is template +work, not Python work. + +## Language targets + +### C++ (primary, `gen/cpp/`) + +MusicXML 4.0 with the sounds companion. The existing codebase. Generated code lands in +`src/private/mx/core/`. The ezxml layer (`src/private/mx/ezxml/`) provides the XML DOM that the +generated code builds on. + +### Go (test target, `gen/test/go/`) + +MusicXML 3.1 *without* the sounds companion. Uses `github.com/beevik/etree` (vendored) for DOM-style +XML. The generated code lives in `gen/test/go/mx/` (committed). Test runner uses Go's `testing` +package with subtests; documents declaring a newer MusicXML than 3.1 are skipped. + +### C (test target, `gen/test/c/`) + +MusicXML 3.1 *with* the sounds companion -- same schema as Go, so the two outputs differ only by the +fold. Uses libxml2 (apt package in Docker). The generated code lives in `gen/test/c/mx/` +(committed), built as the `mx-c` static library plus a `values-smoke` test binary. Test runner is a +simple `main()` that prints pass/fail/skip per file and a summary. + +## Key files to understand + +- `src/private/mxtest/corert/CoreRoundtripImpl.cpp` - the C++ roundtrip implementation (reference + for Go and C ports) +- `src/private/mxtest/corert/Fixer.cpp` - C++ fixup logic (reference for ports) +- `src/private/mxtest/import/DecimalFields.h` - list of decimal fields and zero-stripping logic +- `gen/test/go/corert/roundtrip.go` - Go roundtrip implementation +- `gen/test/c/src/roundtrip.c` - C roundtrip implementation diff --git a/Dockerfile b/Dockerfile index a1060a348..760e1aea4 100644 --- a/Dockerfile +++ b/Dockerfile @@ -14,6 +14,9 @@ RUN apt-get update && apt-get install -y --no-install-recommends \ ccache \ python3 \ python3-venv \ + golang-go \ + libxml2-dev \ + pkg-config \ && rm -rf /var/lib/apt/lists/* # Unversioned name so the Makefile invokes the formatter without the suffix. diff --git a/Makefile b/Makefile index c97c3f71b..dd1d83d27 100644 --- a/Makefile +++ b/Makefile @@ -7,9 +7,9 @@ # layer and the core roundtrip (corert) test harness. corert does not compile # until the new generator emits src/private/mx/core -- expected for now. # -# Native targets (ezxml/core-dev/test-core-dev) drive CMake directly. The fmt -# and check gates run inside the pinned `mx-sdk` Docker toolchain, built once -# and bind-mounting the workspace. Requires CMake >= 3.13. +# Native targets (ezxml/core-dev/test-core-dev) drive CMake directly. Docker- +# gated targets (fmt, check, gen, build-go, test-go, build-c, test-c) run +# inside the pinned `mx-sdk` toolchain. Requires CMake >= 3.13. # ============================================================================ CMAKE ?= cmake @@ -44,21 +44,43 @@ FIND_CPP := find src \ -type f \( -name '*.cpp' -o -name '*.h' -o -name '*.hpp' \) -print .DEFAULT_GOAL := help -.PHONY: help sdk fmt check ezxml core-dev test-core-dev \ +.PHONY: help sdk fmt check ezxml core-dev test-core-dev test-gen \ + gen gen-cpp gen-go gen-c gen-schema \ + build-go build-c test-go test-c \ clean clean-docker check-docker docker-volume help: @echo 'mx targets (clean-slate spike -- see AGENTS.md):' + @echo '' + @echo ' C++ (native):' @echo ' make ezxml Build the embedded ezxml XML layer.' @echo ' make core-dev Build the corert binary (fails until mx/core is regenerated).' @echo " make test-core-dev Run the core roundtrip suite. Filter: ARGS='[core-roundtrip] lysuite/*'" + @echo ' make test-gen Run the generator (parser + IR) Python tests.' + @echo '' + @echo ' Generator (via mx-sdk):' + @echo ' make gen Run the generator for all renderable targets (go/c/schema).' + @echo ' make gen-cpp Run the generator for the C++ target (no templates yet).' + @echo ' make gen-go Run the generator for the Go target.' + @echo ' make gen-c Run the generator for the C target.' + @echo ' make gen-schema Run the generator for the JSON Schema target.' + @echo '' + @echo ' Go test target (via mx-sdk):' + @echo ' make build-go Build Go corert tests.' + @echo ' make test-go Run Go corert tests.' + @echo '' + @echo ' C test target (via mx-sdk):' + @echo ' make build-c Build C corert test binary.' + @echo ' make test-c Run C corert tests.' + @echo '' + @echo ' Housekeeping:' @echo ' make fmt Format C++ under src/ via mx-sdk.' @echo ' make check fmt-check via mx-sdk.' @echo ' make sdk Build the mx-sdk Docker toolchain image.' - @echo ' make clean Remove the build/ tree.' + @echo ' make clean Remove the build/ tree and gen build artifacts.' @echo ' make clean-docker Remove the sdk image and build volume.' -# --- Native builds ---------------------------------------------------------- +# --- Native C++ builds (no Docker) ------------------------------------------ ezxml: $(CMAKE) -S . -B $(BUILD_ROOT)/dev -DCMAKE_BUILD_TYPE=$(BUILD_TYPE) @@ -68,15 +90,18 @@ core-dev: $(CMAKE) -S . -B $(BUILD_ROOT)/core-dev -DCMAKE_BUILD_TYPE=$(BUILD_TYPE) -DMX_CORE_DEV=on $(CMAKE) --build $(BUILD_ROOT)/core-dev --parallel $(JOBS) -# --allow-running-no-tests keeps the run green once corert builds but before any -# data/ case is wired up. (Until mx/core exists, core-dev above fails first.) test-core-dev: core-dev $(BUILD_ROOT)/core-dev/mxtest-core-dev --allow-running-no-tests $(ARGS) +test-gen: + python3 -m unittest discover -s gen/tests -t . $(ARGS) + # --- Housekeeping ----------------------------------------------------------- clean: rm -rf $(BUILD_ROOT) + rm -rf gen/test/c/build + rm -rf gen/test/go/build clean-docker: -rm -f $(DOCKER_STAMP) @@ -89,9 +114,11 @@ check-docker: { echo "Docker not found. Install it to use the mx-sdk gates:"; \ echo " https://docs.docker.com/get-docker/"; exit 1; } +# === Docker-gated targets =================================================== + ifdef MX_RUNNING_IN_DOCKER -# ===== Inside the container: run the pinned tools directly ================== +# ----- Inside the container: run tools directly ----------------------------- fmt: @$(FIND_CPP) | xargs -r clang-format -i @@ -101,9 +128,38 @@ check: @$(FIND_CPP) | xargs -r clang-format --dry-run --Werror @echo "fmt-check passed." +gen: gen-go gen-c gen-schema + +gen-cpp: + python3 -m gen gen/cpp/config.toml + +gen-go: + python3 -m gen gen/test/go/config.toml + +gen-c: + python3 -m gen gen/test/c/config.toml + +gen-schema: + python3 -m gen gen/schema/config.toml + +build-go: + cd gen/test/go && MX_REPO_ROOT=/workspace go test -c -o build/corert-test ./corert/ + +test-go: + cd gen/test/go && MX_REPO_ROOT=/workspace go test -count=1 -v ./corert/ $(ARGS) + +build-c: + $(CMAKE) -S gen/test/c -B gen/test/c/build \ + -DCMAKE_BUILD_TYPE=$(BUILD_TYPE) \ + -DMX_REPO_ROOT=/workspace + $(CMAKE) --build gen/test/c/build --parallel $(JOBS) + +test-c: build-c + gen/test/c/build/corert-c + else -# ===== Outside the container: build the image once, then docker run ======== +# ----- Outside the container: build image, then docker run ------------------ $(DOCKER_STAMP): Dockerfile | check-docker @mkdir -p $(BUILD_ROOT) @@ -122,4 +178,31 @@ fmt: $(DOCKER_STAMP) docker-volume check: $(DOCKER_STAMP) docker-volume $(DOCKER_RUN) make check +gen: $(DOCKER_STAMP) docker-volume + $(DOCKER_RUN) make gen + +gen-cpp: $(DOCKER_STAMP) docker-volume + $(DOCKER_RUN) make gen-cpp + +gen-go: $(DOCKER_STAMP) docker-volume + $(DOCKER_RUN) make gen-go + +gen-c: $(DOCKER_STAMP) docker-volume + $(DOCKER_RUN) make gen-c + +gen-schema: $(DOCKER_STAMP) docker-volume + $(DOCKER_RUN) make gen-schema + +build-go: $(DOCKER_STAMP) docker-volume + $(DOCKER_RUN) make build-go + +test-go: $(DOCKER_STAMP) docker-volume + $(DOCKER_RUN) make test-go + +build-c: $(DOCKER_STAMP) docker-volume + $(DOCKER_RUN) make build-c + +test-c: $(DOCKER_STAMP) docker-volume + $(DOCKER_RUN) make test-c + endif diff --git a/data/README.md b/data/README.md index 9a9336b22..ebd629057 100644 --- a/data/README.md +++ b/data/README.md @@ -32,3 +32,36 @@ test by altering values it finds after loading the test file. ``` + +## `synthetic/` version suffixes and 3.0-only attributes + +Files under `synthetic/` are named `..xml`: the version is the schema +the file's constructs first appear in (the root `version` attribute is pinned to `3.0` like the +rest of the corpus; harnesses gate on it only when it declares something newer than their schema). + +Two synthetic files exercise types whose attribute sets MusicXML itself narrowed after 3.0 -- +a place where the spec broke its own backward compatibility: + +- `elision.3.0.xml`: in 3.0, `elision` has type `text-font-color`, which carries `underline`, + `overline`, `line-through`, `rotation`, `letter-spacing`, `xml:lang`, and `dir`. In 3.1 and 4.0 + the element was retyped to the new `elision` type (font + color + `smufl` only), so those + attributes are invalid there and the file does not use them: no generated model 3.1+ can + represent them, and a 4.0 copy keeping them would simply be invalid MusicXML. +- `extend.3.0.xml`: in 3.0, `extend` carries the full `print-style` group (position + font + + color). 3.1 and 4.0 narrowed it to position + color, dropping the font attributes, with the + same consequence. + +## `.fixup.xml` leniency policy + +The sidecars encode one uniform leniency policy, shared by every generated target: + +- an unknown enum literal falls back to the enum's first variant (`display-step` `=` -> `A`); +- an unparseable number becomes 0, with decimal-looking integers truncating toward zero; +- every number then clamps into its declared range, including the primitive-implied lower bounds + (`xs:positiveInteger` >= 1, `xs:nonNegativeInteger` >= 0), so `midi-channel` `0` -> `1` and + `accordion-middle` `` -> `1`; +- an exclusive decimal bound clamps to the bound +/- 1e-6 (`duration` `0` -> `0.000001`); +- string facets (pattern, length) are enforced by the strict parse only (`TryParseX` reports + false): the lenient parse the deserializer uses keeps the value verbatim, because unlike a + numeric bound there is no canonical replacement for a failed pattern, and round-trip fidelity + wins. No fixup sidecar therefore ever encodes a string substitution. diff --git a/data/lysuite/ly33d_Spanners_OctaveShifts.xml b/data/lysuite/ly33d_Spanners_OctaveShifts.xml old mode 100755 new mode 100644 index 0aadc9737..f872d463b --- a/data/lysuite/ly33d_Spanners_OctaveShifts.xml +++ b/data/lysuite/ly33d_Spanners_OctaveShifts.xml @@ -1,4 +1,4 @@ -Octave.cc + diff --git a/data/lysuite/ly75a_AccordionRegistrations.fixup.xml b/data/lysuite/ly75a_AccordionRegistrations.fixup.xml index c9ff2640c..d28f21f4d 100644 --- a/data/lysuite/ly75a_AccordionRegistrations.fixup.xml +++ b/data/lysuite/ly75a_AccordionRegistrations.fixup.xml @@ -4,13 +4,13 @@ element accordion-middle - 0 + 1 element accordion-middle test - 0 + 1 element @@ -18,4 +18,10 @@ 5 3 + + element + accordion-middle + 0 + 1 + diff --git a/data/synthetic/elision.3.0.xml b/data/synthetic/elision.3.0.xml index 6d323cef1..25d8148af 100644 --- a/data/synthetic/elision.3.0.xml +++ b/data/synthetic/elision.3.0.xml @@ -27,7 +27,7 @@ 1 x - x + x x x x diff --git a/data/synthetic/extend.3.0.xml b/data/synthetic/extend.3.0.xml index 9738dc798..f4316c719 100644 --- a/data/synthetic/extend.3.0.xml +++ b/data/synthetic/extend.3.0.xml @@ -29,7 +29,7 @@ x x x - + x x diff --git a/docs/ai/design/generator-agnosticism.md b/docs/ai/design/generator-agnosticism.md new file mode 100644 index 000000000..22ac1743e --- /dev/null +++ b/docs/ai/design/generator-agnosticism.md @@ -0,0 +1,396 @@ +# Generator agnosticism: removing language knowledge from the generator + +Status: implemented (see section 11 for the deltas between this design and the code). This +document specifies the redesign that removed all target-language knowledge from the generator's +Python code. It supersedes the emit-stage portions of [`plates.md`](plates.md) (the Plates layer +itself -- sections 1-8 there -- survives intact; what changed is what a "template" is and where +language facts live). + +## 1. The cardinal rule + +**The generator is language agnostic. Adding a new language target must not require edits to the +generator's Python files.** + +The rule has a letter and a spirit. The letter: `git diff --name-only` for the change that adds a +new target touches no `*.py` under `gen/` (outside the new target's own directory). The spirit: +the Python pipeline must be a closed machine -- schema in, files out -- that is *incapable* of +expressing a language-specific decision, so that language knowledge has nowhere to live except in +the target's own directory. Go and C were built first precisely to force this generality; the +current implementation failed the test, and the C++ target would have failed it a third time. + +A corollary worth stating: under this rule, the generator has no concept of "Go" or "C" at all. +There is no language registry, no language name in config, no per-language defaults. A target is a +directory of data and templates; the generator cannot tell which language it is emitting. + +## 2. What violates the rule today + +| Where | What it encodes | Lines | +|---|---|---| +| `gen/emit/go/` (6 modules) | Go grammar end to end: declarations, parse/serialize bodies, the runtime source as a Python string, import lists, string-literal quoting, the gofmt subprocess | ~970 | +| `gen/emit/c/` (7 modules) | C grammar end to end: header/impl frames, include guards, the calling-convention module (ownership rules per plate kind), the runtime source, memory management | ~1,360 | +| `gen/plates/languages.py` | Per-language type maps, reserved-word lists, doc-comment styles, variant scopes -- data tables keyed by language name | ~125 | +| `gen/emit/__init__.py` | `BACKENDS`: a language-name -> Python-module registry | -- | +| `gen/config.py` | `[target] language` (selects the above), and prescribed keys (`namespace`, `prefix`) that exist only because specific languages need them | -- | + +Roughly 2,500 lines of Python that are *about* Go and C. Everything else in the pipeline -- +the XSD parser, the IR and Resolver, the naming machinery (`gen/names.py`), the Plates projection, +the collision gate, the writer -- is already neutral, and the review rounds that pushed decisions +"into the plates" (final identifiers, clamp policy, union tags, effective cardinality) made the +*data* model genuinely language-free. The failure is confined to one question the plates design +left unanswered: **what is a template?** The design said "templates are dumb renderers" and the +implementation answered "a Python module per language." Every subsequent fix improved where +*decisions* live but left the *renderers* as per-language programs. + +## 3. The redesign in one paragraph + +A target becomes self-describing: a directory containing `config.toml` and a `templates/` +directory, and nothing else the generator needs. Everything `languages.py` held becomes required config data. +Everything `gen/emit//` held becomes template files in a deliberately minimal, logic-less +template language, rendered by one generic engine (the **press**, completing the plates metaphor: +plates carry every decision; the press inks and prints them). A render **manifest** in the config +declares which template renders which plate shapes into which output paths, so file layout -- +including C's header/impl pairs -- is target data, not Python. The generator's Python is then a +closed set: parse, lower, project, render, write. The proof obligation is concrete: port Go and C +to targets with byte-identical generated output, delete `gen/emit/go`, `gen/emit/c`, and +`languages.py`, then add a third target (the JSON Schema emitter that has been this project's +forcing function since the plates design) as a pure target directory, and let CI assert that +change touched no Python. + +``` +gen/test/c/ <- a target: config + templates, nothing else the generator needs + config.toml <- inputs, projection settings, vars, render manifest + templates/ + enum.h.tmpl enum.c.tmpl <- one template per shape (the original design principle, + composite.h.tmpl ... now literally one FILE per shape) + runtime.h.tmpl ... <- support files: a template with no tags is a static file + member-parse.tmpl ... <- partials shared by this target's templates + src/ <- the hand-written corert harness (target code, not generator) + mx/ <- generated output (committed) +``` + +## 4. Config: the projection contract vs. freeform vars + +The PR review asked the right question about `[target] namespace`: is the config schema +prescribed, and is that not itself language-specific? The answer is a split, with a litmus test. + +**Prescribed keys are the projection contract**: every key the generator itself consumes must be +definable in projection terms, without reference to any language. These survive, because the +plates' work -- casing, renaming, sanitizing, collision-gating, strategy selection -- is real and +neutral: + +- `[naming]` conventions, acronyms, `[rename.*]`, `[reserved] words` -- already neutral. +- `[target] symbol-prefix` (today `prefix`): "prepended to every type identifier and composed + constant before sanitization." Neutral semantics; it must stay in the projection (not become a + template variable) because the collision gate certifies the *final* identifiers -- moving + composition into templates is exactly the regression the first review round fixed. +- `[target] variant-scope = "bare" | "composed"`: how constants are scoped, today seeded per + language in Python. Becomes explicit config. +- `[target] inheritance = true | false`: selects the derived strategy. Already neutral. +- `[types]`: the primitive -> spelling map, today defaulted per language. Becomes **required** for + any target whose templates emit typed code (a target that omits it gets primitive names passed + through, which is what a neutral target wants). +- `[reserved] words`: today extends per-language defaults. Becomes the **whole** list; targets own + their keyword lists. Two small additions let targets protect their template-synthesized names + generically: `members = [...]` (member identifiers the target's templates reserve, e.g. Go's + `Children`) and `type-suffixes = [...]` (compositions like `Child` that the templates append to + type identifiers), both fed to the existing collision gate. +- `[docs] wrap`: the plates pre-wrap doc text into lines (`doc_lines`); comment *syntax* moves + into template text, so `[docs] style` and the `DocStyle` machinery are deleted. + +**Everything else is freeform.** A `[vars]` table of string key-values passes through to templates +verbatim (`{{target.vars.namespace}}`, `{{target.vars.package}}`, `{{target.vars.anything}}`). +`namespace` stops being generator schema; it becomes a variable that the Go target's templates +happen to consume as a targetage name and the C++ target's as a namespace. `target.foo = "bar"` is +exactly as legal as either. The litmus test for any future key: *if you cannot define it without +naming a language, it is a var, not a key.* + +**Deleted outright**: `[target] language` (nothing selects on it anymore), `[layout]` entirely +(partition, file-prefix, file-convention -- subsumed by the manifest, section 6), and +`languages.py` with all its tables. + +## 5. The press: a Mustache engine with three documented deviations + +`gen/press/` renders template files against a context built from the plates. **The template +language is Mustache** -- the interpolation/sections/inverted-sections/partials core of the +published spec, nothing invented -- so the load-bearing commitment is to a frozen, logic-less +*language*, and the engine behind it is swappable (section 9 records why we implement it ourselves +and the trigger for reversing that). Mustache's poverty is the feature: **if a template cannot +express something, the plates must carry it** -- which keeps decisions in the projection, where +they are dumpable, diffable, and collision-gated. + +The implemented subset, derived by walking every construct the current Python backends emit: + +- **Variables**: `{{ident}}`, dotted paths `{{name.snake}}`, `{{type_ref.ident}}`, + `{{target.vars.prefix}}`. +- **Sections**: `{{#members}}...{{/members}}` iterates lists (the cursor becomes the context) and + gates on truthiness for scalars/objects; `{{^x}}...{{/x}}` inverts. +- **Partials**: `{{> member-parse}}`, resolved within the target's `templates/` directory, with + spec-conformant call-site indentation (essential for readable generated code) and recursion + permitted with a depth limit (a schema-shaped target walking `content` trees needs it). +- **Whitespace discipline**: the spec's standalone-line rules, so templates can be indented + readably without leaking blank lines. + +Three deliberate deviations from spec semantics, each because code generation is not HTML: + +1. **Missing keys are a render error**, with `template:line` in the message. The spec mandates + silent empty output -- the worst possible failure mode for a generator (a typo'd `{{indent}}` + emits nothing, and a target with no compiler behind it, like JSON Schema, never finds out). This + project's ethos is fail-loud; the engine follows it. +2. **No HTML escaping**: `{{x}}` interpolates verbatim (the spec's `{{{x}}}` everywhere would be + noise; there is no HTML here to protect). +3. **No lambdas** (the spec's one escape hatch into logic). Closed. + +Conformance to everything else is *tested*: the press runs against the official Mustache spec test +suite (the published YAML cases for interpolation, sections, inverted sections, and partials), +asserting agreement everywhere except the three deviations above -- the spec authors' edge-case +coverage, especially the fiddly whitespace rules, without their code. + +Everything the engine does **not** do -- expressions, comparisons, arithmetic, filters, string +manipulation, casing, assignment -- stays not done; that is Mustache's constitution, not ours to +amend. In particular there is no equality test; dispatch happens three other ways: + +1. **By manifest**: each template entry declares which plate strategies it renders (section 6), + so per-shape dispatch never appears inside a template -- restoring "one template per shape" as + one *file* per shape. +2. **By discriminant expansion**: the context builder (neutral, mechanical) expands every closed + enumerated field into boolean companions -- `kind: "enum"` yields `is_enum`; `cardinality: + "vector"` yields `is_vector`; `category: "primitive"` yields `is_primitive` -- and exposes the + member list pre-split (`attributes`, `elements`, `value`) using the filters the plates already + define. Templates branch with plain sections: `{{#type_ref.is_complex}}...{{/type_ref.is_complex}}`. +3. **By injected context, not engine extensions**: loop metadata arrives as fields the context + builder adds to every list item (`is_first`, `is_last`, `index0`) -- this is what expresses + `if`/`else if` chains and separator joins -- and every wire-string leaf gets a quoted companion + (`wire` -> `wire_q`): a double-quoted, backslash-escaped literal using the JSON repertoire with + non-ASCII as `\uXXXX`, a subset valid verbatim in C, C++, Go, Java, JavaScript, and Rust. + Keeping both OUT of the engine keeps the template syntax pure Mustache (so the engine stays + swappable) and keeps the one acknowledged compromise -- that quoted-literal escaping encodes a + language *family* -- in the neutral context layer, where a future non-C-family target would + extend it (section 10). + +Two small, neutral additions to the plates feed this (the only model changes the redesign needs): + +- `PlateRef` gains the referenced type's `name` bundle and `kind` (plate kind, or the primitive's + family). Today the Python backends look these up via `plates.plate(wire)` to compose calls like + `mx_{{snake}}_parse(...)` and to choose ownership idioms; a logic-less template cannot perform + lookups, so the materialized tree denormalizes them (it is materialized precisely so templates + get random access without computation). +- Each plate gains `deps`: its dependency references with name bundles, replacing + `FileSpec.includes` so include/import lines become template text composed from data + (`{{#deps}}#include "mx_{{name.snake}}.h"{{/deps}}`). + +Worked example -- today's `gen/emit/c/complexes.py` attribute loop, as template text: + +``` + for (xmlAttrPtr a = el->properties; a; a = a->next) { + {{> attr-name}} +{{#attributes}} + {{#is_first}}if{{/is_first}}{{^is_first}}}} else if{{/is_first}} (strcmp(aname, {{name.wire_q}}) == 0) { + m->has_{{ident}} = true; + m->{{ident}} = {{> attr-parse-expr}}; +{{/attributes}} + {{#attributes}}} else {{{/attributes}}{{^attributes}}{{{/attributes}} + {{target.vars.fn_prefix}}error_set("unknown attribute \"%s\" on <%s>", aname, (const char *)el->name); + ... +``` + +Everything language-shaped (C's `strcmp`, `->`, `has_` prefixes, the error idiom) is target content; +everything decided (idents, wire names, which members are attributes) is plate data. The press +contributes iteration and the `@first` chain mechanics, nothing more. + +## 6. The render manifest: file layout as target data + +`config.toml` declares what gets rendered where. Two entry kinds: + +```toml +[render] +dir = "templates" + +# Per-type entries: rendered once per plate whose strategy matches. +[[render.type]] +strategies = ["enum-class"] +template = "enum.h.tmpl" +output = "mx_{snake}.h" # casing placeholders from the plate's Name + +[[render.type]] +strategies = ["enum-class"] +template = "enum.c.tmpl" +output = "mx_{snake}.c" + +[[render.type]] +strategies = ["composite-class", "value-class", "flag", "attrs-class", "flatten"] +template = "complex.h.tmpl" # or one entry per strategy; the target chooses its granularity +output = "mx_{snake}.h" + +# Once entries: rendered once per target, against the whole Plates context. +[[render.once]] +template = "runtime.c.tmpl" +output = "mx_runtime.c" + +[[render.once]] +template = "sources.cmake.tmpl" # receives `outputs`: every path the manifest produced +output = "sources.cmake" +``` + +This mechanism absorbs, generically, several things that were Python: + +- **C's header/impl pairs**: two entries per strategy. The "one FileId, two files" wart in + `plates.md` dissolves -- file multiplicity is just manifest rows. +- **Partitioning**: per-type entries *are* `per-type` partition; a target with only `once` entries + *is* `single` partition (the JSON Schema target: one entry, one template, one output). `[layout]` + dies. +- **File naming**: `output` patterns with casing placeholders (`{snake}`, `{pascal}`, ...) replace + `file-prefix`/`file-convention` and plate file stems. The generator expands every pattern for + every matching plate and runs the existing case-insensitive uniqueness check over the full + expansion -- the file-collision gate survives, now over real paths, including collisions between + type outputs and `once` outputs (which retires the backend "reserved stem" guards). +- **Support files**: the runtime sources stop being Python string constants and become templates + (mostly static text; `{{plates.schema_version}}` and `{{target.vars.fn_prefix}}` are the only + tags the current runtimes need). The completeness check every manifest gets for free: every + plate must be matched by at least one entry, or none if the target declares it renders only a + subset (a `strategies = []` is an error; an explicitly empty manifest is one too). +- **Formatting**: the gofmt pass becomes an optional, generic post-render hook -- + `[render] format = ["gofmt", "-w", "{dir}"]` -- run against the scratch render directory before + the writer's write-if-changed diff, preserving idempotence. The command is target data; the + generator knows only "run this, fail loud if it fails or is absent." + +The writer (`gen/emit/writer.py`) is already neutral and survives unchanged: marker-gated pruning, +foreign-file safety, idempotence. + +## 7. What remains in Python, and why that is allowed + +The closed set, each definable without naming any language: `gen/xsd` (schema parsing), `gen/ir` +(lowering + Resolver), `gen/names.py` (tokenizer, casing registry, sanitizer -- string mechanics), +`gen/plates` (projection driven entirely by IR + config; `languages.py` deleted), `gen/press` +(template engine + context builder + manifest expansion), the writer, and the CLI. The litmus test +for every future line of generator Python: *could this be wrong for a language we have not heard +of?* If yes, it belongs in a target's directory. + +Explicitly **outside** the rule's scope: the corert harnesses (`gen/test/go/corert/`, +`gen/test/c/src/`), smoke tests, CMakeLists, go.mod. These are hand-written programs that *consume* +generated code, exactly like a downstream user; they are target code, not generator code. The rule +governs the machine, not the things the machine's output links against. + +## 8. Migration plan + +Each phase lands green (all suites pass) and pushed; phases 3-4 carry a hard parity gate: +regenerate and `git diff --exit-code` over the committed `mx/` output -- the port is proven by +byte-identical generation before the Python it replaces is deleted. + +1. **The press.** Engine + context builder + manifest expansion + format hook. Tests: the + official Mustache spec suite for the implemented subset (minus the three documented + deviations), plus unit tests for the deviations themselves (fail-loud missing keys, identity + interpolation, no lambdas) and for the context builder's injections (discriminant expansion, + loop metadata, `_q` companions). No target changes. +2. **Config absorbs `languages.py`.** `[types]`/`[reserved]` become explicit in all three configs; + `variant-scope` explicit; `[vars]` introduced; `doc_lines` on plates; `PlateRef.name`/`kind` + and plate `deps` added. Generated output must not change (these are data motions). Delete + `languages.py`. +3. **Port the C target.** Translate `gen/emit/c/*.py` into `gen/test/c/templates/` + manifest. + Byte-parity gate, corert green, valgrind clean. Delete `gen/emit/c/`. +4. **Port the Go target.** Same, with the format hook carrying gofmt. Byte-parity gate, corert + green. Delete `gen/emit/go/`, the `BACKENDS` registry, and `[target] language`. +5. **Prove the rule.** Add the JSON Schema target (`gen/schema/`: config.toml + one template) -- the + neutral target the plates design used as its forcing function, now actually built. Its + round-trip check: validate a corpus sample against the emitted schema. Add the CI assertion + that the target's commit touches no `*.py`, and a structural test that `gen/` imports cleanly + with no module or table naming a language. +6. **Docs.** Update `plates.md` section 11 (supersession note), `gen/README.md` (target anatomy, + press spec), `AGENTS.md` (the cardinal rule, stated as such). + +Then, and only then, the C++ target begins -- as a target directory, written without touching Python, which is +the entire point. + +## 9. Alternatives considered and rejected + +- **Per-target Python plugins** (each target ships a `backend.py` the generator loads dynamically). + Satisfies the letter of the rule -- no edits to the generator's files -- and would be the + cheapest migration (move the existing modules into the targets). Rejected on the spirit: the + language knowledge would still be Python programs, just relocated; the C++ backend would again + be two thousand lines of imperative emission; and nothing would force decisions into the plates, + because a plugin can compute anything. The review's instruction was that the bespoke backends + "should not exist," not that they should move. +- **Jinja2** (or any expressive template engine, vendored or as a dependency). Mature, excellent + diagnostics, configurable strictness -- and expressive is the problem: filters, macros, + arbitrary expressions, and `set` would let the Go backend be reconstituted *inside* template + files, hiding naming logic where no structural gate can see it; only review discipline would + stand between the targets and that, and this redesign exists because structure beats discipline. + It also adds a pip/vendored dependency tree to a deliberately dependency-free Python side. +- **An existing Mustache library** (chevron, pystache) -- the serious alternative, since it shares + the language's logic-less constitution and would spare us the parser. Weighed and declined, as a + close call, on four counts: (1) the spec mandates *silent empty output for missing keys*, which + is disqualifying for a generator and not configurable in chevron (pystache has a strict option + but is effectively unmaintained; both last released years ago); (2) spec HTML-escaping and weak + error locations mean we would patch a vendored copy in three places and own the result anyway -- + owning ~400 written-and-spec-tested lines beats owning ~500 vendored lines plus patches by a + thin margin; (3) the repo's Python side has a deliberate no-dependencies precedent (the + hand-written XSD parser, tomllib); (4) conformance risk -- the real argument FOR a library -- is + neutralized by running the official Mustache spec test suite against the press (section 5). + Because template syntax is pure Mustache, this decision is cheaply reversible: **if during phase + 1 the press exceeds ~600 lines or cannot pass the spec suite, the pre-committed fallback is to + vendor chevron and patch strictness/escaping/diagnostics** -- with zero template changes. +- **AST-based emitters** (build a language-neutral syntax tree, print per language). A second + language-shaped abstraction to design, with the per-language printers landing right back in + Python. Wrong direction entirely. +- **Keeping `languages.py` as "just data."** It is data, but data keyed by language name inside + the generator is still the generator knowing languages; every new target edits it. Config is the + same data in the right place. + +## 10. Risks and open questions + +- **Template debuggability.** Generated-code bugs become template bugs; the press must report + `template:line` in every error, and a `python3 -m gen render --config C --type note` debugging + command (render one plate through its matching templates to stdout) should land with phase 1. +- **The quoted-literal compromise** (section 5). The `_q` companions encode one escape family + (C/C++/Go/Java/JS/Rust-compatible) in the neutral context layer. Revisit trigger: a target whose + string literals are outside that family (e.g. single-quote-only syntaxes); the extension point + is the context builder, not the engine. +- **Parity discipline.** The ports in phases 3-4 will be tedious precisely because parity is + byte-exact; resist "improving" generated output mid-port. Cleanups come after deletion, as + ordinary template edits. +- **Synthetic-name gating.** `[reserved] members` / `type-suffixes` (section 4) covers the known + cases (Go's `Children` field, `Child` struct suffix). The residual risk -- a target's templates + composing an identifier shape the gate cannot model -- is bounded by the compiler catching it in committed + output. +- **Cross-target template sharing.** Go and C templates will rhyme (the same walk, two grammars). + No sharing mechanism in v1: a target is self-contained, and duplication across targets is the + acceptable cost of targets being independently ownable. Revisit only if a third *code* target + makes the rhyme painful -- and note the C++ target is likely to diverge more than it rhymes (sum types, + references, exceptions). +- **The inherit-chain guard.** The Go backend's loud rejection of derivation chains with children + in multiple members was backend Python; once the backends are templates, that knowledge has no generic home. Position: + drop it. The plates dump makes chain shapes visible, no MusicXML schema has the shape, and the + committed-output compile is the backstop. If it ever bites, the neutral fact ("N chain members + carry element members") can become plate data a template renders into a `#error`. +- **Engine creep.** The contract is "the template language is Mustache": the press neither adds + syntax nor restores the spec's lambdas, and the spec test suite pins it there. The review + question for any proposed press or context-builder feature: "does this let a template make a + decision the plates should own?" If yes, the answer is no. + +## 11. Implementation notes + +Implemented in six pushed phases, each green, exactly per section 8. The deltas and outcomes: + +- **The parity gates held.** C: 674/677 files byte-identical; the three deviations were the OLD + output's bugs (a doubled space from the legacy child-field spacing, an `&(*ptr)` spelling). Go: + 336/336 byte-identical. Each backend was deleted only after its gate passed, verified beyond + bytes by both corert suites, the smoke binaries, and full-suite valgrind. +- **The proof landed as designed**: `gen/schema/` (config + one template, 373 `$defs`) was added + with zero Python edits, and `gen/tests/test_agnosticism.py` enforces the closed set + structurally. The schema template models complex types over the flat member view (choice + nesting deferred -- to future template work, which is the point). +- **The context builder grew three mechanical conveniences** beyond section 5: `has_` + companions (non-iterating emptiness tests), a flattened union `cases` view (loop metadata at + the granularity union kind constants actually have), and earlier-field-wins discriminant + expansion (PlateRef's `category` and `kind` vocabularies overlap consistently). Dotted + resolution through a present-but-None value is falsey, extending deviation 1's logic. +- **Plates additions**: `UnionPlate.open_ended`, and the open-string-member ordering rule moved + from backend Python into the collision gate (union parse semantics, not language). +- **`[vars]` is barely needed in practice**: a target's templates hardcode their own spellings + (the Go package clause, the C `mx_` prefix are template text). Only cpp carries a var so far. + The mechanism stays: it is the answer to "where does a language-flavored value go". +- **One Mustache wrinkle**: an interpolation directly after `{` forms `{{{`. Templates write a + space (`Child{ {{ident}}`); for Go, gofmt removes it, preserving byte parity. +- **The transitional `[target] language`, `namespace`, `prefix`, and `[layout]` keys are gone**; + `symbol-prefix` is the surviving projection-contract key, and output paths live entirely in + manifest output patterns (whose expansion carries the case-insensitive collision gate that + replaced file-stem checking). diff --git a/docs/ai/design/mx-core-gpt-5.0.md b/docs/ai/design/mx-core-gpt-5.0.md new file mode 100644 index 000000000..009a9905f --- /dev/null +++ b/docs/ai/design/mx-core-gpt-5.0.md @@ -0,0 +1,1114 @@ +# C++ `mx/core` design + +Status: design proposal for the generated C++20 `mx::core` product. + +The Go and C targets proved the generator pipeline: XSD -> IR -> Plates -> target-owned Mustache +works, the press is language-agnostic, and the corert flow can be made green from generated code. +They also proved what the C++ target must **not** copy. Go and C are deliberately pragmatic test +bindings: they preserve document order with an ordered child list, parse values leniently, and +expose plain data. The C++ target is the real product and should be a stricter, deeper abstraction. + +## Position + +Generate a valid-by-construction, order-faithful MusicXML model. + +The public C++ API should make invalid MusicXML states unreachable through normal API use: + +- no public mutable fields; +- no public default constructors for model objects whose schema content is required; +- required attributes and required children appear in constructors or `make` factories; +- optional schema members use `std::optional` behind accessors; +- repeated members are stored privately and exposed as read-only spans; +- schema choices use `std::variant`, not nullable sibling fields; +- XML parsing validates structure and required content before producing a model object; +- import leniency may clamp out-of-range numbers, but the object produced is still valid. + +C++ cannot protect against undefined behavior, memory scribbles, malicious casts, or every possible +moved-from misuse. The contract is: the generated API does not provide an operation that creates an +invalid serializable model object. + +## Lessons from Go and C + +What to keep: + +1. **One source of truth.** All schema facts must still come from the Plates. C++ is a target + directory: `gen/cpp/config.toml` plus `gen/cpp/templates/`. Do not add C++ knowledge to + `gen/*.py`. +2. **Order fidelity matters.** A MusicXML measure interleaves `note`, `backup`, `direction`, + `attributes`, and other music-data elements. A flat field per element is not enough. +3. **Generated version constants matter.** The C++ runtime should emit `SupportedMusicXMLVersion` + from `Plates.schema_version`, just like Go and C now do. +4. **Unknown names are errors.** With version gating, an unknown element or attribute is a generator + or input problem, not data to preserve silently. +5. **Numeric clamping is an import policy.** The clamp steps now live on the Plates. C++ should use + that data exactly once in its value parsers. + +What not to keep: + +1. **No multi-null child structs.** Go/C use `NoteChild{Pitch: &p}` because they lack a good sum + type in the chosen style. C++ has `std::variant`; use it. +2. **No public structs as the product API.** Go/C are harness bindings. C++ should hide invariants + behind small interfaces. +3. **No always-flat child list.** The C++ target should consume `ComplexPlate.content`, not just + `elements`. Pure sequences should become sequence-shaped classes; repeated choices should become + ordered variants. +4. **No enum fallback to the first value for user construction.** Unknown enum input should fail. + Clamping numbers can be compatible with a valid model; inventing enum values is not. + +## C++ feature choices + +| Feature | Use | Rule | +|------------------------------|---------------------------------------------------------------------|-------------------------------------------------------------------------------------| +| C++20 | Target language level | Raise CMake to C++20 when this target lands. | +| `class` with private members | Generated model types | Default for all complex and constrained value types. | +| `enum class` | Internal tags | Do not expose raw enums as constructible domain values. Wrap them in value classes. | +| `std::variant` | XML choices, unions, document root | Exactly-one by construction. | +| `std::optional` | Optional attributes/elements and defaulted/fixed attribute presence | Never for required schema content. | +| `std::vector` | Repeated homogeneous content and repeated choice groups | Kept private; expose `std::span`. | +| `std::unique_ptr` | Alternatives inside repeated heterogeneous choice vectors | Avoids a huge `std::variant` object per child. | +| `std::span` | Read-only views of repeated content | No mutable vector exposure. | +| `std::string_view` | Parse/format APIs and wire names | Avoids unnecessary string copies at API boundaries. | +| `constexpr std::array` | Wire literal tables | No dynamic enum maps unless profiling proves need. | +| `std::from_chars` | Numeric lexical parsing | Strict, locale-free parsing. | +| non-polymorphic inheritance | Only for IR `derived` complex types | Schema extension only; no virtual model hierarchy. | +| `Result` runtime type | Parse/factory errors | C++20 has no `std::expected`; provide a small target runtime `Result`. | + +Avoid: + +- raw owning pointers; +- `new` / `delete` in generated code; +- public mutable containers; +- runtime polymorphism for every element; +- `std::shared_ptr` by default; +- `std::regex` in headers; +- template-heavy generated code. + +## Output layout + +Recommended generated layout under `src/private/mx/core/`: + +```text +mx/core/ + Document.h / Document.cpp + Result.h + Error.h + Decimal.h / Decimal.cpp + Xml.h / Xml.cpp + UpDown.h / UpDown.cpp + FontSize.h / FontSize.cpp + Pitch.h / Pitch.cpp + PartwiseMeasure.h / PartwiseMeasure.cpp + ... one header/source pair per generated type ... + Sources.cmake +``` + +Headers should contain type declarations, small accessors, and inline trivial factories. Parsing, +serialization, string tables, pattern checks, and visitors should live in `.cpp` files. Compile time +is the first optimization priority; do not dump large parse functions into headers. + +## Error and result shape + +The runtime support can define generic infrastructure, but generated code should use concrete result +spelling. Rendered code should look like this at call sites: + +```cpp +namespace mx::core +{ + +enum class ErrorCode +{ + unknownElement, + unknownAttribute, + missingRequiredElement, + missingRequiredAttribute, + wrongElementOrder, + tooManyElements, + invalidValue, + unsupportedVersion, +}; + +struct Error +{ + ErrorCode code; + std::string path; + std::string message; +}; + +Result parsePitch(const ezxml::XElement &element); +Result parseDocument(const ezxml::XDoc &doc); + +} // namespace mx::core +``` + +The generated functions should accumulate path context +(`/score-partwise/part[0]/measure[2]/note[1]`) so corert failures are diagnosable. The old +`bool fromXDoc(std::ostream&, ...)` shape can be kept as a thin adapter; it should not be the core +API. + +## Value type shapes + +### Closed enum values: wrapper class, not a raw enum + +A raw C++ enum can be fabricated with `static_cast`. Use a small value class with a private tag +constructor. Rendered `up-down` should look like this: + +```cpp +#pragma once + +#include "mx/core/Result.h" + +#include +#include + +namespace mx::core +{ + +class UpDown final +{ + public: + enum class Tag : std::uint8_t + { + up, + down, + }; + + static constexpr UpDown up() noexcept { return UpDown(Tag::up); } + static constexpr UpDown down() noexcept { return UpDown(Tag::down); } + + static Result parse(std::string_view text); + + constexpr Tag tag() const noexcept { return tag_; } + constexpr bool isUp() const noexcept { return tag_ == Tag::up; } + constexpr bool isDown() const noexcept { return tag_ == Tag::down; } + + std::string_view wire() const noexcept; + + private: + explicit constexpr UpDown(Tag tag) noexcept : tag_(tag) {} + + Tag tag_; +}; + +} // namespace mx::core +``` + +The `.cpp` keeps the wire table out of dependent translation units: + +```cpp +#include "mx/core/UpDown.h" + +#include + +namespace mx::core +{ +namespace +{ +constexpr std::array kWire = {"up", "down"}; +} + +Result UpDown::parse(std::string_view text) +{ + if (text == "up") + { + return UpDown::up(); + } + if (text == "down") + { + return UpDown::down(); + } + return Error{ErrorCode::invalidValue, {}, "invalid up-down value"}; +} + +std::string_view UpDown::wire() const noexcept +{ + return kWire[static_cast(tag_)]; +} + +} // namespace mx::core +``` + +### Numeric wrappers: strict construction, explicit import clamping + +The Plates already carry clamp policy. Use it for XML import, not for ordinary construction. +Rendered `midi-channel` should look like this: + +```cpp +#pragma once + +#include "mx/core/Result.h" + +#include + +namespace mx::core +{ + +class MIDIChannel final +{ + public: + static Result make(int value); + static Result parse(std::string_view text); + static MIDIChannel importClamp(int value) noexcept; + + constexpr int value() const noexcept { return value_; } + std::string toString() const; + + private: + explicit constexpr MIDIChannel(int value) noexcept : value_(value) {} + + int value_; +}; + +} // namespace mx::core +``` + +```cpp +#include "mx/core/MIDIChannel.h" +#include "mx/core/NumberParse.h" + +namespace mx::core +{ + +Result MIDIChannel::make(int value) +{ + if (value < 1 || value > 16) + { + return Error{ErrorCode::invalidValue, {}, "midi-channel must be in [1, 16]"}; + } + return MIDIChannel(value); +} + +MIDIChannel MIDIChannel::importClamp(int value) noexcept +{ + if (value < 1) + { + value = 1; + } + if (value > 16) + { + value = 16; + } + return MIDIChannel(value); +} + +Result MIDIChannel::parse(std::string_view text) +{ + const auto parsed = parseInteger(text); + if (!parsed) + { + return parsed.error(); + } + return MIDIChannel::make(parsed.value()); +} + +std::string MIDIChannel::toString() const +{ + return formatInteger(value_); +} + +} // namespace mx::core +``` + +The XML parser can call `importClamp` when the target chooses compatibility import. User code gets +`make`, which rejects invalid values. + +### Strings: domain wrappers own validation + +String wrappers should not be aliases. They are the only place length and pattern constraints can be +enforced. + +```cpp +#pragma once + +#include "mx/core/Result.h" + +#include +#include + +namespace mx::core +{ + +class Color final +{ + public: + static Result make(std::string value); + static Result parse(std::string_view text); + + const std::string &value() const noexcept { return value_; } + std::string_view wire() const noexcept { return value_; } + + private: + explicit Color(std::string value) : value_(std::move(value)) {} + + std::string value_; +}; + +} // namespace mx::core +``` + +Pattern validation belongs in `.cpp`. If the XSD regex subset needs custom handling (`\c` appears in +MusicXML), hide it in `XmlPattern.cpp`; do not generate ad-hoc regex logic into every header. + +### Unions: `std::variant` + +Rendered `font-size` should be a tiny sum type. This is strictly better than Go/C's hand-rolled +`Kind` plus fields because `std::variant` cannot hold two members at once. + +```cpp +#pragma once + +#include "mx/core/CSSFontSize.h" +#include "mx/core/Decimal.h" +#include "mx/core/Result.h" + +#include +#include +#include + +namespace mx::core +{ + +class FontSize final +{ + public: + using Value = std::variant; + + static FontSize points(Decimal value) { return FontSize(Value{value}); } + static FontSize css(CSSFontSize value) { return FontSize(Value{value}); } + static Result parse(std::string_view text); + + bool isPoints() const noexcept { return std::holds_alternative(value_); } + bool isCSS() const noexcept { return std::holds_alternative(value_); } + + const Value &value() const noexcept { return value_; } + std::string toString() const; + + private: + explicit FontSize(Value value) : value_(std::move(value)) {} + + Value value_; +}; + +} // namespace mx::core +``` + +```cpp +Result FontSize::parse(std::string_view text) +{ + if (auto decimal = Decimal::parse(text)) + { + return FontSize::points(decimal.value()); + } + if (auto css = CSSFontSize::parse(text)) + { + return FontSize::css(css.value()); + } + return Error{ErrorCode::invalidValue, {}, "invalid font-size"}; +} + +std::string FontSize::toString() const +{ + return std::visit( + [](const auto &member) { return member.toString(); }, + value_); +} +``` + +Open string unions, such as `instrument-sound` after the sounds companion fold, should put the open +string alternative last in parsing, matching the Plates rule. + +## Complex type shapes + +### Pure sequence: fields, not a child vector + +`pitch` is a sequence: `step`, optional `alter`, `octave`. It should not become an ordered vector of +three alternatives. Generate the shape the schema says. + +```cpp +#pragma once + +#include "mx/core/Octave.h" +#include "mx/core/Result.h" +#include "mx/core/Semitones.h" +#include "mx/core/Step.h" + +#include + +namespace mx::core +{ + +class Pitch final +{ + public: + Pitch(Step step, Octave octave) + : step_(step), octave_(octave) + { + } + + static Result make(Step step, std::optional alter, Octave octave); + static Result parseXml(const ezxml::XElement &element); + + const Step &step() const noexcept { return step_; } + const std::optional &alter() const noexcept { return alter_; } + const Octave &octave() const noexcept { return octave_; } + + void setStep(Step value) noexcept { step_ = value; } + void setAlter(std::optional value) { alter_ = std::move(value); } + void setOctave(Octave value) noexcept { octave_ = value; } + + void writeXml(ezxml::XElement &parent, std::string_view tag) const; + + private: + Step step_; + std::optional alter_; + Octave octave_; +}; + +} // namespace mx::core +``` + +A strict parser for `pitch` should be a small generated state machine: + +```cpp +Result Pitch::parseXml(const ezxml::XElement &element) +{ + std::optional step; + std::optional alter; + std::optional octave; + + enum class State + { + start, + afterStep, + afterAlter, + afterOctave, + }; + + State state = State::start; + + for (auto it = element.begin(); it != element.end(); ++it) + { + const auto child = *it; + const auto name = child->getName(); + + if (name == "step" && state == State::start) + { + auto parsed = Step::parse(child->getValue()); + if (!parsed) + { + return parsed.error(); + } + step = parsed.value(); + state = State::afterStep; + } + else if (name == "alter" && state == State::afterStep) + { + auto parsed = Semitones::parse(child->getValue()); + if (!parsed) + { + return parsed.error(); + } + alter = parsed.value(); + state = State::afterAlter; + } + else if (name == "octave" && (state == State::afterStep || state == State::afterAlter)) + { + auto parsed = Octave::parse(child->getValue()); + if (!parsed) + { + return parsed.error(); + } + octave = parsed.value(); + state = State::afterOctave; + } + else + { + return Error{ErrorCode::wrongElementOrder, {}, "invalid child in pitch"}; + } + } + + if (!step) + { + return Error{ErrorCode::missingRequiredElement, {}, "pitch is missing step"}; + } + if (!octave) + { + return Error{ErrorCode::missingRequiredElement, {}, "pitch is missing octave"}; + } + + return Pitch::make(step.value(), alter, octave.value()); +} +``` + +This is the key difference from Go/C: C++ should consume the content tree and enforce sequence +order. + +### Repeated homogeneous members: private vector + +For repeated elements like `beam` in `note`, use private vectors and bounded append APIs. + +```cpp +class Note final +{ + public: + std::span beams() const noexcept { return beams_; } + + Result addBeam(Beam beam) + { + if (beams_.size() == 8) + { + return Error{ErrorCode::tooManyElements, {}, "note has too many beam elements"}; + } + beams_.push_back(std::move(beam)); + return {}; + } + + private: + std::vector beams_; +}; +``` + +If `maxOccurs` is unbounded, `addX` cannot fail for cardinality. If `minOccurs` is at least one, the +constructor or factory takes the first value and the vector never exposes `clear`. + +### Repeated heterogeneous choices: boxed variants + +For measure music-data, order is semantically important and alternatives have very different sizes. +Do not use `std::variant` directly in a vector: every vector slot would +be as large as the largest alternative. Use `std::variant, ...>` and keep the +container private. + +Rendered `partwise-measure` should look like this: + +```cpp +#pragma once + +#include "mx/core/Attributes.h" +#include "mx/core/Backup.h" +#include "mx/core/Barline.h" +#include "mx/core/Bookmark.h" +#include "mx/core/Direction.h" +#include "mx/core/FiguredBass.h" +#include "mx/core/Forward.h" +#include "mx/core/Grouping.h" +#include "mx/core/Harmony.h" +#include "mx/core/Link.h" +#include "mx/core/MeasureText.h" +#include "mx/core/Note.h" +#include "mx/core/Print.h" +#include "mx/core/Result.h" +#include "mx/core/Sound.h" +#include "mx/core/Tenths.h" +#include "mx/core/YesNo.h" + +#include +#include +#include +#include +#include +#include + +namespace mx::core +{ + +class PartwiseMeasure final +{ + public: + using MusicData = std::variant< + std::unique_ptr, + std::unique_ptr, + std::unique_ptr, + std::unique_ptr, + std::unique_ptr, + std::unique_ptr, + std::unique_ptr, + std::unique_ptr, + std::unique_ptr, + std::unique_ptr, + std::unique_ptr, + std::unique_ptr, + std::unique_ptr>; + + explicit PartwiseMeasure(std::string number) + : number_(std::move(number)) + { + } + + PartwiseMeasure(const PartwiseMeasure &other); + PartwiseMeasure &operator=(const PartwiseMeasure &other); + PartwiseMeasure(PartwiseMeasure &&) noexcept = default; + PartwiseMeasure &operator=(PartwiseMeasure &&) noexcept = default; + + const std::string &number() const noexcept { return number_; } + void setNumber(std::string value) { number_ = std::move(value); } + + const std::optional &text() const noexcept { return text_; } + void setText(std::optional value) { text_ = std::move(value); } + + std::span musicData() const noexcept { return music_data_; } + + void appendNote(Note note) { music_data_.push_back(std::make_unique(std::move(note))); } + void appendBackup(Backup backup) { music_data_.push_back(std::make_unique(std::move(backup))); } + void appendForward(Forward forward) { music_data_.push_back(std::make_unique(std::move(forward))); } + void appendDirection(Direction direction) { music_data_.push_back(std::make_unique(std::move(direction))); } + void appendAttributes(Attributes attributes) { music_data_.push_back(std::make_unique(std::move(attributes))); } + void appendHarmony(Harmony harmony) { music_data_.push_back(std::make_unique(std::move(harmony))); } + void appendFiguredBass(FiguredBass figuredBass) { music_data_.push_back(std::make_unique(std::move(figuredBass))); } + void appendPrint(Print print) { music_data_.push_back(std::make_unique(std::move(print))); } + void appendSound(Sound sound) { music_data_.push_back(std::make_unique(std::move(sound))); } + void appendBarline(Barline barline) { music_data_.push_back(std::make_unique(std::move(barline))); } + void appendGrouping(Grouping grouping) { music_data_.push_back(std::make_unique(std::move(grouping))); } + void appendLink(Link link) { music_data_.push_back(std::make_unique(std::move(link))); } + void appendBookmark(Bookmark bookmark) { music_data_.push_back(std::make_unique(std::move(bookmark))); } + + static Result parseXml(const ezxml::XElement &element); + void writeXml(ezxml::XElement &parent, std::string_view tag) const; + + private: + std::string number_; + std::optional text_; + std::optional implicit_; + std::optional non_controlling_; + std::optional width_; + std::optional id_; + std::vector music_data_; +}; + +} // namespace mx::core +``` + +The copy constructor clones the pointed alternatives. The public API still feels value-like, while +each child slot remains small. + +Serialization dispatch is straightforward and safe: + +```cpp +void PartwiseMeasure::writeXml(ezxml::XElement &parent, std::string_view tag) const +{ + auto element = parent.appendChild(std::string(tag)); + element->appendAttribute("number")->setValue(number_); + + if (text_) + { + element->appendAttribute("text")->setValue(text_->toString()); + } + + for (const auto &item : music_data_) + { + std::visit( + [&](const auto &ptr) + { + using Pointer = std::decay_t; + if constexpr (std::is_same_v>) + { + ptr->writeXml(*element, "note"); + } + else if constexpr (std::is_same_v>) + { + ptr->writeXml(*element, "backup"); + } + else if constexpr (std::is_same_v>) + { + ptr->writeXml(*element, "forward"); + } + else if constexpr (std::is_same_v>) + { + ptr->writeXml(*element, "direction"); + } + }, + item); + } +} +``` + +The rendered file would include all alternatives; the snippet is shortened only after `Direction`. +The shape is the important decision: boxed variant, private vector, typed append methods. + +### Non-repeated choice: by-value variant + +For one-of content that occurs once, prefer by-value alternatives. Example shape for the +`pitch | unpitched | rest` part of `note`: + +```cpp +class Note final +{ + public: + using FullNote = std::variant; + + static Result make(FullNote fullNote, std::optional duration); + + const FullNote &fullNote() const noexcept { return full_note_; } + void setFullNote(FullNote value) { full_note_ = std::move(value); } + + private: + FullNote full_note_; + std::optional duration_; +}; +``` + +This is compact and enforces exactly one full-note alternative. + +### Attributes + +Attributes should be separate from child content. Required attributes are stored by value. Optional, +defaulted, and fixed attributes store presence separately from effective value. + +For defaulted attributes, preserve wire absence but provide an effective getter: + +```cpp +class Barline final +{ + public: + RightLeftMiddle location() const noexcept + { + return location_.value_or(RightLeftMiddle::right()); + } + + bool hasLocationAttribute() const noexcept { return location_.has_value(); } + void setLocationAttribute(RightLeftMiddle value) { location_ = value; } + void clearLocationAttribute() noexcept { location_.reset(); } + + private: + std::optional location_; +}; +``` + +For fixed attributes, do not expose a setter that accepts arbitrary values. Expose presence only: + +```cpp +class Link final +{ + public: + bool hasXlinkTypeAttribute() const noexcept { return has_xlink_type_; } + void writeXlinkTypeAttribute() noexcept { has_xlink_type_ = true; } + void omitXlinkTypeAttribute() noexcept { has_xlink_type_ = false; } + + private: + bool has_xlink_type_ = false; +}; +``` + +The serializer writes the fixed value when the presence flag is true. + +### Empty elements + +Presence-only elements should not become empty structs in parent APIs. In a sequence field, use a +specific marker type only when the element itself needs a type in a `std::variant`. + +```cpp +class Empty final +{ + public: + static constexpr Empty present() noexcept { return Empty(); } + + private: + constexpr Empty() noexcept = default; +}; +``` + +An optional presence-only child in a pure sequence can be `std::optional`. A presence-only +choice alternative can be `Empty` in a variant. + +### Schema derivation + +MusicXML derivation is small and attribute-only. Use non-polymorphic public inheritance only for IR +`derived` types. Do not introduce a virtual base class for all elements. + +```cpp +class EmptyTrillSound +{ + public: + const std::optional &type() const noexcept { return type_; } + void setType(std::optional value) { type_ = std::move(value); } + + private: + std::optional type_; + std::optional accelerate_; + std::optional beats_; +}; + +class Mordent final : public EmptyTrillSound +{ + public: + const std::optional &placement() const noexcept { return placement_; } + void setPlacement(std::optional value) { placement_ = std::move(value); } + + private: + std::optional placement_; +}; +``` + +No virtual destructor is needed because this is not a polymorphic hierarchy and the API should not +traffic in owning base pointers. + +## Document root + +The document has exactly one root. Use a root variant. + +```cpp +#pragma once + +#include "mx/core/Result.h" +#include "mx/core/ScorePartwise.h" +#include "mx/core/ScoreTimewise.h" + +#include +#include +#include + +namespace mx::core +{ + +inline constexpr std::string_view SupportedMusicXMLVersion = "4.0"; +inline constexpr std::string_view DoctypeValueScorePartwise = + "-//Recordare//DTD MusicXML 4.0 Partwise//EN"; +inline constexpr std::string_view DoctypeValueScoreTimewise = + "-//Recordare//DTD MusicXML 4.0 Timewise//EN"; + +struct ExtraAttribute +{ + std::string name; + std::string value; +}; + +class Document final +{ + public: + using Root = std::variant; + + explicit Document(ScorePartwise score) : root_(std::move(score)) {} + explicit Document(ScoreTimewise score) : root_(std::move(score)) {} + + static Result fromXDoc(const ezxml::XDoc &doc); + + bool isPartwise() const noexcept { return std::holds_alternative(root_); } + bool isTimewise() const noexcept { return std::holds_alternative(root_); } + + const Root &root() const noexcept { return root_; } + + const std::vector &rootNamespaces() const noexcept { return root_namespaces_; } + void preserveRootNamespace(ExtraAttribute attr) { root_namespaces_.push_back(std::move(attr)); } + + void toXDoc(ezxml::XDoc &out) const; + + private: + Root root_; + std::vector root_namespaces_; +}; + +} // namespace mx::core +``` + +The current corert harness calls `makeDocument()->fromXDoc(...)`. Keep that as a compatibility +adapter, not as the model itself: + +```cpp +namespace mx::core +{ + +class DocumentIo final +{ + public: + bool fromXDoc(std::ostream &errors, const ezxml::XDoc &doc) + { + auto parsed = Document::fromXDoc(doc); + if (!parsed) + { + errors << parsed.error().message; + return false; + } + document_ = std::move(parsed.value()); + return true; + } + + void toXDoc(ezxml::XDoc &out) const + { + document_.value().toXDoc(out); + } + + private: + std::optional document_; +}; + +using DocumentPtr = std::unique_ptr; + +inline DocumentPtr makeDocument() +{ + return std::make_unique(); +} + +} // namespace mx::core +``` + +`DocumentIo` may be empty while parsing; `Document` never is. That keeps legacy harness shape from +infecting the product model. + +## Parsing policy + +Use two distinct policies: + +1. **Structure is strict.** Unknown elements, unknown attributes, wrong order, missing required + content, and too many bounded elements are errors. +2. **Numeric import may clamp.** Existing mx behavior and fixup sidecars expect out-of-range numeric + values to be clamped into range. That is allowed because the resulting model is valid. + +Do not silently repair these: + +- unknown enum literals; +- failed string patterns; +- missing required children or attributes; +- choice branches with multiple alternatives; +- newer MusicXML root versions. + +Version gating should happen at document parse before dispatching deep into the model. A document +declaring a newer version should return `unsupportedVersion`, letting the corert harness skip it. + +## Serialization policy + +Serialization should be total for every valid model object. + +Rules: + +- sequence-shaped classes emit in schema order; +- repeated choice vectors emit in vector order; +- optional attributes emit only when present; +- defaulted attributes use effective getters for API reads but preserve absence on write; +- fixed attributes emit their fixed literal only when their presence flag is set; +- root namespace declarations captured during parse are written back on the root; +- XML declaration and doctype normalization remain in the corert normalization layer, not in every + model serializer. + +## Template strategy + +The C++ templates should consume more of the Plates than Go/C did: + +- value templates consume `EnumPlate`, `NumberPlate`, `StringPlate`, and `UnionPlate` as Go/C do; +- complex templates must consume `ComplexPlate.content` for child structure; +- attributes still come from `attributes` / `merged_attributes`; +- schema derivation uses `strategy = inherit` from the C++ config default; +- repeated heterogeneous choices render boxed variants; +- pure sequences render named fields; +- repeated homogeneous elements render private vectors and append APIs. + +This must still be done without adding C++ branches to the generator. If a needed datum is missing, +first ask whether it is a neutral schema fact. If yes, it belongs in the Plates for all targets. If +it is C++ spelling, it belongs in `gen/cpp/config.toml` or `gen/cpp/templates/`. + +Expected C++ target manifest shape: + +```toml +[render] +dir = "templates" +format = ["clang-format", "-i", "{dir}/**/*.{h,cpp}"] + +[[render.type]] +strategies = ["enum-class"] +template = "enum.h.tmpl" +output = "{pascal}.h" + +[[render.type]] +strategies = ["enum-class"] +template = "enum.cpp.tmpl" +output = "{pascal}.cpp" + +[[render.type]] +strategies = ["numeric-wrapper"] +template = "number.h.tmpl" +output = "{pascal}.h" + +[[render.type]] +strategies = ["numeric-wrapper"] +template = "number.cpp.tmpl" +output = "{pascal}.cpp" + +[[render.type]] +strategies = ["string-wrapper"] +template = "string.h.tmpl" +output = "{pascal}.h" + +[[render.type]] +strategies = ["string-wrapper"] +template = "string.cpp.tmpl" +output = "{pascal}.cpp" + +[[render.type]] +strategies = ["tagged-variant"] +template = "union.h.tmpl" +output = "{pascal}.h" + +[[render.type]] +strategies = ["tagged-variant"] +template = "union.cpp.tmpl" +output = "{pascal}.cpp" + +[[render.type]] +strategies = ["value-class", "composite-class", "attrs-class", "flag", "inherit"] +template = "complex.h.tmpl" +output = "{pascal}.h" + +[[render.type]] +strategies = ["value-class", "composite-class", "attrs-class", "flag", "inherit"] +template = "complex.cpp.tmpl" +output = "{pascal}.cpp" + +[[render.once]] +template = "Document.h.tmpl" +output = "Document.h" + +[[render.once]] +template = "Document.cpp.tmpl" +output = "Document.cpp" + +[[render.once]] +template = "runtime.h.tmpl" +output = "Result.h" +``` + +The exact filenames can change, but source/header split should not. + +## Rejected designs + +### Raw public structs + +Rejected. This copies the Go/C harness shape and leaks invariants to every caller. A public +`std::vector` or public `std::optional` for required content lets users construct invalid MusicXML. + +### One `std::vector>` for every complex type + +Rejected as the universal representation. It preserves order but erases sequence structure and makes +simple types like `Pitch` unnecessarily dynamic. Use it only for repeated heterogeneous choices. + +### Nullable child structs + +Rejected. `struct NoteChild { Pitch *pitch; Rest *rest; ... }` has invalid states: none set or many +set. C++ has `std::variant`; use it. + +### A virtual `Element` base class + +Rejected. It adds heap allocation and vtables everywhere, makes ownership harder, and hides schema +facts behind runtime checks. MusicXML's complex graph is a DAG; most members can be by value. + +### `std::shared_ptr` everywhere + +Rejected. Shared ownership should be a rare explicit decision, not the model's default ownership +story. Use values for ordinary members and `std::unique_ptr` only to keep repeated heterogeneous +choice nodes small. + +### Strict validation only as a separate pass + +Rejected. A separate `validate()` pass means invalid documents can exist in memory. The product API +should make validity the normal state and use builders/parsers as temporary staging areas. + +## Implementation order + +1. Add the C++ target templates and render manifest without touching generator Python. +2. Generate value types first: enum, number, string, union; compile a values smoke test. +3. Generate simple complex types with attributes and value content. +4. Generate pure sequences such as `Pitch`. +5. Generate repeated choice content such as `PartwiseMeasure`. +6. Generate `Document` and version-gated root dispatch. +7. Wire corert through the compatibility `DocumentIo` adapter. +8. Mark or fix corpus files that are structurally invalid if strict parsing exposes them. +9. Keep Go and C green to prove the target did not contaminate the generator. + +The hard part is not C++ syntax. The hard part is preserving the architecture: C++ must be just +another target while still being a much better product API than the proof targets. diff --git a/docs/ai/design/mx-core-opus-4.7.md b/docs/ai/design/mx-core-opus-4.7.md new file mode 100644 index 000000000..44655c28f --- /dev/null +++ b/docs/ai/design/mx-core-opus-4.7.md @@ -0,0 +1,1208 @@ +# mx::core: the C++ generated typed model + +Status: design proposal. The C++ target's `gen/cpp/config.toml` exists; its `templates/` directory +does not. This document specifies the rendered C++ shapes the templates must produce — every +construct chosen with a reason a reviewer can argue with. It is a sibling to `plates.md` and +`generator-agnosticism.md`: those documents describe the engine; this one describes the artifact +the engine prints for the C++ edition. + +It is intentionally non-template. Every snippet below is what the **generator emits**, not what a +template looks like; the templates are the renderer's problem and live under +`gen/cpp/templates/` once these shapes are agreed. + +The reader is assumed to have read `plates.md` (the eight shapes, the per-target binding) and +`gen/README.md` (the IR, the resolver, the deps-first DAG, the no-collision invariant). Cross- +references in this document use the IR/Plates vocabulary verbatim. + +--- + +## 1. Decisions in one page + +The design hangs off five non-negotiables. Everything in §3-§13 derives from these. + +1. **The complex-type graph is a DAG with no cycles** (`gen/README.md` §"Two load-bearing + invariants"). No forward declarations. No `shared_ptr`. No heap. **Plain by-value composition + throughout.** A `Note` *contains* its children, it does not point at them. + +2. **The schema's `name -> type` map is 1:1.** Parse dispatch is a static name table; we never + need context-sensitive resolution. Serialization writes a known tag for each member. + +3. **The data classes are dumb structs.** Public fields, aggregate-initializable, no invariants + beyond what `enum class` and the value-type wrappers already give us. Validation lives in the + value-type wrappers (clamp on assignment) and in parse (lenient values, strict names — the + plates round-3 contract). Assigning a value of a typed field is by definition valid; that is + the whole point of generating typed wrappers. + +4. **Document order is preserved by structure, not by a separate sequence number.** Pure-sequence + composites use plain ordered fields. Choice-bearing or repeating content uses + `std::vector>` so order is the array order and "exactly one alternative" is + true by construction. **C++ does NOT copy the Go/C ordered-children-of-pointers encoding** — + that pattern exists because Go and C have no sum type. We do; we use it. (`plates.md` §11 + round 3 is explicit about this.) + +5. **Parse and serialize are free functions over `ezxml::XElement`, not member functions.** The + typed model is one concern (data); XML I/O is another (interop). Keeping them apart lets the + model classes be aggregates and lets a future binding (JSON, MessagePack) reuse the same + classes by writing its own free functions. + +The standard is **C++17** (matches the existing `src/private` codebase: `std::optional`, +`std::variant`, `std::filesystem`). No C++20-only features are required. + +--- + +## 2. File and namespace layout + +One type per file, per the existing `[layout] partition = "per-type"` convention (matches Go's and +C's per-type layout, matches what the existing `src/private/mx/core/` directory was sized for, and +gives a 1:1 mapping between IR types and headers — the smallest possible blast radius for a schema +change). Header and implementation are split: + +``` +src/private/mx/core/ + Decimal.h <- hand-written runtime + Decimal.cpp + Runtime.h <- hand-written: SupportedMusicXMLVersion, parse helpers + Runtime.cpp + AboveBelow.h AboveBelow.cpp <- generated, value: enum + Divisions.h Divisions.cpp <- generated, value: number + FontSize.h FontSize.cpp <- generated, value: union + Empty.h Empty.cpp <- generated, complex: empty (presence-only is a + bool typedef; this is for attrs-only emptys) + AccidentalText.h AccidentalText.cpp <- generated, complex: value-bodied + Note.h Note.cpp <- generated, complex: composite (choice content) + Pitch.h Pitch.cpp <- generated, complex: composite (pure sequence) + Mordent.h Mordent.cpp <- generated, complex: derived (inheritance) + Document.h Document.cpp <- generated, document root + extra-attr struct + ... + sources.cmake <- generated, lists all .cpp files for CMake +``` + +File stem is the type's PascalCase identifier (`Plates.Variant.cased["pascal"]` for the type name) +to match the existing C++ convention in this repo (`XElement.h`, `CoreRoundtripImpl.cpp`). + +All generated symbols live in `namespace mx::core`. The C++ Decimal wrapper, the runtime helpers, +and the `SupportedMusicXMLVersion` constant live in the same namespace; the generator does not +produce a separate `mx::core::detail` (no symbol needs to be hidden — every emitted name is part +of the public API of the typed model). + +```cpp +// AboveBelow.h +#pragma once + +namespace mx::core +{ + +// (doc comment ...) +enum class AboveBelow +{ + Above, + Below, +}; + +} // namespace mx::core +``` + +Headers use `#pragma once` (already the convention in `src/private/mx/ezxml/`). No include guards +of the `MX_..._H_INCLUDED` form: pragma once is universally supported by the compilers this +project targets, and matches what is already in the tree. + +Each header `#include`s only the headers it textually needs. The Plates already compute the per- +file `deps`, so the include list is data-driven and stable. The DAG invariant means cycles are +impossible, which is why this works without forward declarations. + +--- + +## 3. Value: enum class + +One file per enum. The wire-literal table and the parse map are emitted into the `.cpp` so the +header is light. The generator's `[target] variant-scope = "bare"` setting (already in +`gen/cpp/config.toml`) means the variant identifier is the bare PascalCase name, scoped by the +`enum class`. + +```cpp +// AboveBelow.h +#pragma once + +namespace mx::core +{ + +/// The above-below type is used to indicate whether one element appears above +/// or below another element. +enum class AboveBelow +{ + Above, + Below, +}; + +/// Returns the wire literal for `v` (static storage, never null). +const char* toString(AboveBelow v); + +/// Strict parse: returns true and writes `*out` only when `s` exactly matches +/// a wire literal. +bool tryParse(const char* s, AboveBelow* out); + +/// Lenient parse: an unrecognized input falls back to the first variant +/// (mirrors the Go/C contract; the corert harnesses depend on this). +AboveBelow parseAboveBelow(const char* s); + +} // namespace mx::core +``` + +```cpp +// AboveBelow.cpp +#include "mx/core/AboveBelow.h" + +#include + +namespace mx::core +{ + +namespace +{ +constexpr const char* kAboveBelowWire[] = { + "above", + "below", +}; +} // namespace + +const char* toString(AboveBelow v) +{ + const auto i = static_cast(v); + if (i >= sizeof(kAboveBelowWire) / sizeof(kAboveBelowWire[0])) + { + return kAboveBelowWire[0]; + } + return kAboveBelowWire[i]; +} + +bool tryParse(const char* s, AboveBelow* out) +{ + if (s == nullptr) { return false; } + const std::string_view sv{s}; + if (sv == "above") { *out = AboveBelow::Above; return true; } + if (sv == "below") { *out = AboveBelow::Below; return true; } + return false; +} + +AboveBelow parseAboveBelow(const char* s) +{ + AboveBelow v; + if (tryParse(s, &v)) { return v; } + return AboveBelow::Above; +} + +} // namespace mx::core +``` + +**Why a `switch`/`if`-chain rather than a `std::unordered_map`?** The +chain is generated, deterministic, branch-predictor-friendly for the common case, and zero-init +at startup; an unordered_map costs a heap allocation per enum and an unordered hash lookup per +parse. With ~96 enums in MusicXML 4.0 the difference is real and free for the generator to +arrange. The chain is the same shape Go and C use. + +**Why overloaded `toString` and `tryParse` (no name composition like `toStringAboveBelow`)?** ADL +plus C++ overload resolution makes `toString(myEnum)` work without the caller naming the type. Go +and C compose names because they have no overloading. We do; we use it. (Free functions, not +member functions, because `enum class` cannot have member functions.) + +**`parseAboveBelow` is composed.** The lenient-fallback parse cannot be overloaded (no input enum +to dispatch on); composition is forced by the language, which is exactly the Variant.cased +[pascal] composed-with-`parse` rule the plates already implement. + +--- + +## 4. Value: number wrapper + +A `class` wrapping the IR primitive's target type, with clamp-on-construct using the plate's +`NumberPlate.clamp` steps (resolved facets + primitive-implied bounds, computed once in the +plates per round 2 of `plates.md` §11). The inner representation is whatever +`gen/cpp/config.toml` `[types]` maps the IR primitive to. + +`Decimal` is `mx::core`'s own hand-written class (lossless decimal-as-text + a numeric face for +arithmetic; the existing corert harness has trailing-zero rules that depend on the lossless +representation, see `DecimalFields.h`). The generator does not invent it; `[types] decimal = +"Decimal"` says "use the existing class," and `Divisions` below wraps it. + +```cpp +// Divisions.h +#pragma once + +#include "mx/core/Decimal.h" + +namespace mx::core +{ + +/// The divisions type is used to express values in terms of the musical +/// divisions defined by the divisions element. It is preferred that these be +/// integer values both for MIDI interoperability and to avoid roundoff +/// errors. +class Divisions +{ +public: + Divisions() = default; + explicit Divisions(Decimal value); // clamp on assignment + explicit Divisions(double value); // convenience + Decimal getValue() const; + void setValue(Decimal value); // clamp on assignment + bool operator==(const Divisions& other) const; + bool operator!=(const Divisions& other) const; + +private: + Decimal m_value{}; +}; + +const char* toString(const Divisions& v, std::string& buf); // see §14 +bool tryParse(const char* s, Divisions* out); +Divisions parseDivisions(const char* s); + +} // namespace mx::core +``` + +For a number with bounds (`positive-decimal`, `tenths`, `midi-channel` with `[1, 16]`, etc.), the +constructor and setter call a generated `clamp` free helper that runs the plate's clamp steps. +The clamp steps are *data on the plate* per round 2; the template emits one branch per step. + +```cpp +// Tenths.h +#pragma once + +#include "mx/core/Decimal.h" + +namespace mx::core +{ + +class Tenths +{ +public: + Tenths() = default; + explicit Tenths(Decimal value); + Decimal getValue() const; + void setValue(Decimal value); + // ... operator== etc. +private: + Decimal m_value{}; +}; +// (parse / toString / tryParse free functions as above) + +} // namespace mx::core +``` + +```cpp +// MidiChannel.cpp (illustrative: integer with min=1, max=16) +#include "mx/core/MidiChannel.h" + +namespace mx::core +{ + +namespace +{ +int clamp(int v) { if (v < 1) { return 1; } if (v > 16) { return 16; } return v; } +} // namespace + +MidiChannel::MidiChannel(int value) : m_value{clamp(value)} {} +void MidiChannel::setValue(int value) { m_value = clamp(value); } +int MidiChannel::getValue() const { return m_value; } + +} // namespace mx::core +``` + +**Why a class with private storage rather than a `using Tenths = Decimal;`?** Type identity. With +strong typedefs, `void f(Tenths)` cannot be silently called with a `Decimal` or a `Divisions`; +that's the whole point of generating ~25 distinct numeric types from the schema. A `using` alias +gives every numeric type the same C++ identity, which throws away the schema's information at the +language boundary. + +**Why explicit constructors?** Same reason: prevent silent conversions from `int` or `double` +landing in the wrong wrapper. A user who wants to construct a `Divisions{4}` must say so. + +**Why a getter/setter pair, not a public `value` field?** Because the setter clamps. A public +field would let the invariant escape. This is the one place in the generated code where a class +has private state; it earns it. + +--- + +## 5. Value: string wrapper + +A class over `std::string`, optionally enforcing a regex pattern and length bounds. Most string +types in MusicXML are unconstrained tokens, so the wrapper is mostly identity; the few with +patterns (`color`, `time-only`, `comma-separated-text`) get the check on `setValue`. + +```cpp +// Color.h +#pragma once + +#include + +namespace mx::core +{ + +/// The color type indicates the color of an element. Color may be represented +/// as hexadecimal RGB triples or as hexadecimal ARGB tuples [...] +class Color +{ +public: + Color() = default; + explicit Color(std::string value); + const std::string& getValue() const; + void setValue(std::string value); + bool operator==(const Color& other) const; + bool operator!=(const Color& other) const; +private: + std::string m_value; +}; + +void toString(const Color& v, std::string& out); +bool tryParse(const char* s, Color* out); // pattern-checked +Color parseColor(const char* s); // lenient: accepts on parse failure + +} // namespace mx::core +``` + +For the unconstrained string aliases (the majority), the wrapper is the same shape but `tryParse` +always succeeds; the class is generated for type-identity reasons (so `void f(CommaSeparatedText)` +will not silently take a `Color`), not for validation. Identical rationale to numeric wrappers. + +--- + +## 6. Value: union (tagged variant) + +`std::variant` over the member types. This is the C++17 sum type; using anything else here would +be reinvention. The plate already carries `UnionPlateMember.tag` (per round 2 of `plates.md` +§11), so each alternative has a stable, projected, collision-checked discriminator. + +```cpp +// FontSize.h +#pragma once + +#include "mx/core/CssFontSize.h" +#include "mx/core/Decimal.h" + +#include + +namespace mx::core +{ + +/// The font-size can be one of the CSS font sizes or a numeric point size. +class FontSize +{ +public: + enum class Kind { Decimal, CssFontSize }; + + FontSize(); // default: first member, default-constructed + explicit FontSize(Decimal v); + explicit FontSize(CssFontSize v); + + Kind getKind() const; + + // Each accessor is valid only when getKind() returns the matching kind. + // std::get throws std::bad_variant_access on mismatch. + Decimal getDecimal() const; + void setDecimal(Decimal v); + CssFontSize getCssFontSize() const; + void setCssFontSize(CssFontSize v); + + bool operator==(const FontSize& other) const; + bool operator!=(const FontSize& other) const; + +private: + std::variant m_value; +}; + +void toString(const FontSize& v, std::string& out); +bool tryParse(const char* s, FontSize* out); // tries members in schema order +FontSize parseFontSize(const char* s); // lenient: first member absorbs + +} // namespace mx::core +``` + +**Why expose `std::variant` *behind* getter/setter rather than inheriting from it or making +`m_value` public?** Two reasons. First, accessor names reveal the schema's intent ( +`getCssFontSize()` reads at the call site; `std::get(fs.value)` does not). Second, +the `Kind` enum gives the caller a `switch`-friendly discriminator without the `std::holds_ +alternative` boilerplate. The generated code's *interface* is the kind enum + per-member accessors; +the variant is the *implementation*. + +**Why not a hand-rolled tagged union (a struct with a `Kind` and a `union { ... }`)?** Lifetime. +The members can have non-trivial destructors (a `std::string`-backed wrapper). `std::variant` +handles the lifetime; a hand-rolled union would have to do it manually and would be a hand-written +piece of generated code per union, which is reinvention. + +**`tryParse` order matters.** The plates round-2 rule "an open string union member must be last" +keeps `tryParse` from short-circuiting. The C++ template enforces this at render time, identical +to Go/C; otherwise lossy schemas would silently route everything through the open string member. + +--- + +## 7. Complex: empty + +Two sub-cases, both already separated by the plates: + +- **Presence-only** (`presence_only: true`, no attributes): the IR's "the only information is + whether it appears." Emit as `using Empty = bool;` — there is **nothing else worth saying**. + An ordered-children variant alternative whose payload is `Empty` is *visibly* a presence flag. + ```cpp + // Empty.h (a few presence-only types are aliased to bool) + #pragma once + + namespace mx::core + { + + /// Presence-only marker. The element is either present (`true`) or absent. + using Empty = bool; + + } // namespace mx::core + ``` + + An alternative — emitting an empty `class Empty {};` — would compile, but `Empty{}` and `Empty{}` + are indistinguishable at every call site, so the type adds nothing the language doesn't already + give us with `bool`. The `using` is honest about the type's information content. + +- **Attrs-only** (`presence_only: false`, attributes only): a class whose fields are the + attributes (§9 conventions for attribute fields). Same shape as a value-bodied complex type + minus the `value` field. + +--- + +## 8. Complex: value-bodied + +A class whose `value` field is the typed text body and whose other fields are the attributes. +Aggregate; default-constructible; field order = attribute declaration order (then `value` last). + +```cpp +// AccidentalText.h +#pragma once + +#include "mx/core/AccidentalValue.h" +#include "mx/core/Color.h" +#include "mx/core/CommaSeparatedText.h" +#include "mx/core/EnclosureShape.h" +#include "mx/core/FontSize.h" +#include "mx/core/FontStyle.h" +#include "mx/core/FontWeight.h" +#include "mx/core/LeftCenterRight.h" +#include "mx/core/NumberOfLines.h" +#include "mx/core/NumberOrNormal.h" +#include "mx/core/RotationDegrees.h" +#include "mx/core/SmuflAccidentalGlyphName.h" +#include "mx/core/Tenths.h" +#include "mx/core/TextDirection.h" +#include "mx/core/Valign.h" + +#include +#include + +namespace ezxml { class XElement; } + +namespace mx::core +{ + +/// The accidental-text type represents an element with an accidental value +/// and text-formatting attributes. +class AccidentalText +{ +public: + // Optional attributes use std::optional (§13). + std::optional smufl; + std::optional xmlLang; + std::optional xmlSpace; + std::optional justify; + std::optional defaultX; + std::optional defaultY; + std::optional relativeX; + std::optional relativeY; + std::optional fontFamily; + std::optional fontStyle; + std::optional fontSize; + std::optional fontWeight; + std::optional color; + std::optional halign; + std::optional valign; + std::optional underline; + std::optional overline; + std::optional lineThrough; + std::optional rotation; + std::optional letterSpacing; + std::optional lineHeight; + std::optional dir; + std::optional enclosure; + + // Required: by-value, default-constructible. + AccidentalValue value{}; +}; + +bool operator==(const AccidentalText& a, const AccidentalText& b); +bool operator!=(const AccidentalText& a, const AccidentalText& b); + +void parseAccidentalText(const ezxml::XElement& el, AccidentalText& out); +void serializeAccidentalText(const AccidentalText& v, ezxml::XElement& parent, + const std::string& tag); + +} // namespace mx::core +``` + +**Public fields, not getters/setters.** The class has no invariant beyond what each *member*'s +type already enforces (`Tenths` clamps on assignment, `EnclosureShape` is an enum class). +Wrapping a getter around an `std::optional` would add ceremony and zero safety. + +**`std::optional` for optional members.** Standard, value-typed (no allocation), trivially +distinguishable from "present with default value" — the case `defaultX = Tenths{0}` is genuinely +different from `defaultX = std::nullopt`, and the wire format reflects it. Go uses pointers +because Go has no `optional`; C uses a `bool has_x` companion because C has no generics. We +have `std::optional`; we use it. + +**`bool operator==` is generated.** Aggregates are not equality-comparable by default in C++17 +(C++20's `=default` would change this). The generator emits the per-field comparison. The corert +harness does not need it (it diffs at the XML level), but the API shape is dramatically more +useful for everyone else (unit tests, custom user code, hashing later). + +**Required fields are by-value.** No `optional` wrapper, no pointer. The DAG invariant means +this is always safe; default construction gives the type's natural zero (a `Tenths{0}`, an +`AccidentalValue::Sharp`). + +--- + +## 9. Complex: composite (the crucial case) + +Composites are the hard one and the place the C++ design diverges from Go and C. A composite has +attributes plus a content tree (`Resolver.content`, group-spliced). Two flavors: + +**9a. Pure-sequence composite, every member exactly once (or optional once), no choices, no +repeats.** Plain ordered fields. `Pitch` (step, alter?, octave) is the canonical case. + +```cpp +// Pitch.h +#pragma once + +#include "mx/core/Octave.h" +#include "mx/core/Semitones.h" +#include "mx/core/Step.h" + +#include + +namespace ezxml { class XElement; } + +namespace mx::core +{ + +/// Pitch is represented as a combination of the step of the diatonic scale, +/// the chromatic alteration, and the octave. +class Pitch +{ +public: + // No attributes on . + Step step{}; + std::optional alter; + Octave octave{}; +}; + +bool operator==(const Pitch& a, const Pitch& b); +bool operator!=(const Pitch& a, const Pitch& b); + +void parsePitch(const ezxml::XElement& el, Pitch& out); +void serializePitch(const Pitch& v, ezxml::XElement& parent, const std::string& tag); + +} // namespace mx::core +``` + +The field order in the class *is* the document order; serialize walks the fields in declaration +order. Parse loops over child elements and dispatches by tag name onto the matching field (ignoring +order, per round-3's "lenient about structure" rule). + +**9b. Choice-bearing or repeating content.** Attributes are still ordinary fields; **child +elements live in one ordered `std::vector` of a per-type `std::variant`.** + +```cpp +// Note.h +#pragma once + +#include "mx/core/Accidental.h" +#include "mx/core/Beam.h" +#include "mx/core/Color.h" +#include "mx/core/CommaSeparatedText.h" +#include "mx/core/Divisions.h" +#include "mx/core/Empty.h" +#include "mx/core/EmptyPlacement.h" +#include "mx/core/FontSize.h" +#include "mx/core/FontStyle.h" +#include "mx/core/FontWeight.h" +#include "mx/core/FormattedText.h" +#include "mx/core/Grace.h" +#include "mx/core/Instrument.h" +#include "mx/core/Level.h" +#include "mx/core/Lyric.h" +#include "mx/core/Notations.h" +#include "mx/core/NoteType.h" +#include "mx/core/Notehead.h" +#include "mx/core/NoteheadText.h" +#include "mx/core/NonNegativeDecimal.h" +#include "mx/core/Pitch.h" +#include "mx/core/Play.h" +#include "mx/core/PositiveDivisions.h" +#include "mx/core/Rest.h" +#include "mx/core/Stem.h" +#include "mx/core/Tenths.h" +#include "mx/core/Tie.h" +#include "mx/core/TimeModification.h" +#include "mx/core/TimeOnly.h" +#include "mx/core/Unpitched.h" +#include "mx/core/YesNo.h" + +#include +#include +#include +#include + +namespace ezxml { class XElement; } + +namespace mx::core +{ + +/// One child element of : exactly one alternative by construction. +/// Document order is the index in Note::children. +/// +/// The `using` aliases the variant; `std::visit` and the dispatch helpers +/// (§9c) work directly on it. +using NoteChild = std::variant< + Grace, // + Empty, // (presence-only) + Pitch, // + Unpitched, // + Rest, // + Tie, // + Empty, // (presence-only) + PositiveDivisions, // + Instrument, // + FormattedText, // + Level, // + std::string, // + NoteType, // + EmptyPlacement, // + Accidental, // + TimeModification, // + Stem, // + Notehead, // + NoteheadText, // + int, // + Beam, // + Notations, // + Lyric, // + Play // +>; + +/// Notes are the most common type of MusicXML data [...] +class Note +{ +public: + // Attributes — same conventions as §8. + std::optional printLeger; + std::optional dynamics; + std::optional endDynamics; + std::optional attack; + std::optional release; + std::optional timeOnly; + std::optional pizzicato; + std::optional defaultX; + std::optional defaultY; + std::optional relativeX; + std::optional relativeY; + std::optional fontFamily; + std::optional fontStyle; + std::optional fontSize; + std::optional fontWeight; + std::optional color; + std::optional printDot; + std::optional printLyric; + std::optional printObject; + std::optional printSpacing; + std::optional id; + + // Children in document order. Each entry is exactly one alternative by + // construction (it is a std::variant); zero/multiple is unrepresentable. + std::vector children; +}; + +void parseNote(const ezxml::XElement& el, Note& out); +void serializeNote(const Note& v, ezxml::XElement& parent, const std::string& tag); + +} // namespace mx::core +``` + +**Why `std::vector>` and not the Go/C "struct of typed pointers" pattern?** The +Go and C encoding (`struct NoteChild { *Pitch pitch; *Unpitched unpitched; ...; }`) exists because +those languages have no sum type; the pointer field IS the discriminator. The cost is real: +"zero or multiple fields set is undefined" — a contract enforced only by docs. In C++, +`std::variant` makes the same guarantee at compile time. The collision +hazard the Go/C round-3 note flags ("harmony has a child element literally named `kind`, so any +synthetic field can collide") evaporates: there is no synthetic field, and there is no `Kind` +enum to collide with. + +**The variant has duplicate alternatives** (`Empty` appears twice: `` and ``). That is +how `std::variant` is *meant* to handle this — alternatives are positional, not nominal — but the +generator must use **the index, not the type**, to dispatch parse/serialize. The generated +`parseNote` writes `out.children.emplace_back(std::in_place_index<6>, parseEmpty(c))` for `` +to disambiguate the two `Empty`s. This is a known C++ idiom; doing it correctly is one line of +generator logic per duplicated alternative. Worked in the dispatch helper below. + +**The `using ...Child = std::variant<...>` is per type, in the type's own header.** Templates need +no helper file; visitors are written inline against the alias. Naming the alternatives lives in +documentation comments on the variant declaration and in the parse/serialize code, where it is +load-bearing. + +### 9c. Dispatch: parse and serialize for choice content + +The generated parse builds the variant directly from the tag name; the generated serialize +`std::visit`s with one lambda per alternative, indexed (so duplicate types route correctly to +their own tag names). + +```cpp +// Note.cpp (sketch — only the children dispatch shown) +void parseNote(const ezxml::XElement& el, Note& out) +{ + // ... attribute loop (omitted, identical in shape to AccidentalText) ... + + for (auto it = el.begin(); it != el.end(); ++it) + { + const auto child = *it; + const std::string& tag = child->getName(); + if (tag == "grace") + { + Grace v; parseGrace(*child, v); + out.children.emplace_back(std::in_place_index<0>, std::move(v)); + } + else if (tag == "chord") + { + // is presence-only -> Empty (which is bool); presence is true. + out.children.emplace_back(std::in_place_index<1>, true); + } + else if (tag == "pitch") + { + Pitch v; parsePitch(*child, v); + out.children.emplace_back(std::in_place_index<2>, std::move(v)); + } + else if (tag == "cue") + { + // Same Empty type as ; index disambiguates. + out.children.emplace_back(std::in_place_index<6>, true); + } + else if (tag == "voice") + { + out.children.emplace_back(std::in_place_index<11>, child->getValue()); + } + // ... rest of the cases ... + else + { + throw std::runtime_error("unknown element <" + tag + "> in "); + } + } +} + +void serializeNote(const Note& v, ezxml::XElement& parent, const std::string& tag) +{ + auto el = parent.appendChild(tag); + // ... attribute writes (omitted) ... + + for (const auto& c : v.children) + { + switch (c.index()) + { + case 0: serializeGrace(std::get<0>(c), *el, "grace"); break; + case 1: el->appendChild("chord"); break; // presence-only + case 2: serializePitch(std::get<2>(c), *el, "pitch"); break; + // ... down through index 23 ... + case 6: el->appendChild("cue"); break; // presence-only, same Empty type + case 11: { auto* v = el->appendChild("voice"); v->setValue(std::get<11>(c)); break; } + // ... + } + } +} +``` + +**Why a `switch` on `index()` rather than `std::visit`?** Two reasons. (1) The tag name is +*per-alternative*, not per-type, and `std::visit` dispatches on type — duplicate alternatives +(`` and `` both `Empty`) can't be told apart by visit. (2) The `switch` on +`std::variant::index()` is a jump table and is friendlier to the optimizer than the nested- +ternary expansion `std::visit` produces for ~24 alternatives. Both reasons are real; either alone +would be enough. + +--- + +## 10. Complex: derived (real inheritance) + +The `gen/cpp/config.toml` does not yet declare `[target] inheritance` (the C target sets `false`). +The C++ default is `true` — derive types extend their base, exactly mirroring the IR's +`complexContent extension` shape. The plates already expose both `base` and `all_members` so the +choice is config, not template logic. + +```cpp +// Mordent.h +#pragma once + +#include "mx/core/AboveBelow.h" +#include "mx/core/EmptyTrillSound.h" // base class +#include "mx/core/YesNo.h" + +#include + +namespace ezxml { class XElement; } + +namespace mx::core +{ + +/// The mordent type is used for both [...] The long attribute is "no" by +/// default. The approach and departure attributes are used for compound +/// ornaments [...] +class Mordent : public EmptyTrillSound +{ +public: + std::optional mordentLong; // attribute "long" (renamed: "long" is reserved) + std::optional approach; + std::optional departure; +}; + +bool operator==(const Mordent& a, const Mordent& b); +bool operator!=(const Mordent& a, const Mordent& b); + +void parseMordent(const ezxml::XElement& el, Mordent& out); +void serializeMordent(const Mordent& v, ezxml::XElement& parent, const std::string& tag); + +} // namespace mx::core +``` + +**Public inheritance, not private.** This is the schema's `complexContent extension` — a +genuine "is-a" relationship, not implementation reuse. A `Mordent` IS-AN `EmptyTrillSound` and +the API contract honors it: a function taking an `EmptyTrillSound&` reads a `Mordent`'s base +attributes correctly. + +**No virtual functions, no virtual destructor.** These are data classes, not polymorphic ones. +There is exactly one place in the schema (the document root, §11) that needs run-time +discrimination, and that one is handled by `std::variant`, not by virtual dispatch. Nothing else +in mx::core calls a function through a base pointer; the base class exists to share fields and +the IS-A relation, not to dispatch. Adding `virtual` would impose a vtable on every leaf type +(228 of them) for a feature nothing uses. (If a future use case appeared, adding `virtual` to the +base class is a one-edit, base-only change — much smaller blast radius than the inverse.) + +**Slicing is fine here.** Without virtual functions there is no slicing problem: copying a +`Mordent` to an `EmptyTrillSound` simply produces an `EmptyTrillSound` with the right fields, +which is the correct behavior. + +**`mordentLong` for the `long` attribute** is the `[reserved] words = ["long", ...]` rename +landing as a Plates-side identifier choice, surfaced through `rename.attribute.mordent.long = +"mordent-long"` in `gen/cpp/config.toml`. The wire form stays `long`; only the C++ identifier +mangles. This is the existing Plates §6.2 mechanism, used as designed. + +**Parse/serialize delegate to the base.** `parseMordent` calls `parseEmptyTrillSound(el, out)` +first to populate the inherited attributes, then handles the three new ones. `serializeMordent` +calls `serializeEmptyTrillSound`'s body inline (the base's serializer takes a `parent` and a +`tag`, so we cannot reuse it directly without writing the same element twice; the generator +emits the merged attribute list in the derived `.cpp` rather than chaining serialize calls). + +--- + +## 11. The document and its roots + +`Document` is the entry point. Two roots (`score-partwise`, `score-timewise`); exactly one is +present. `std::variant` again. + +```cpp +// Document.h +#pragma once + +#include "mx/core/ScorePartwise.h" +#include "mx/core/ScoreTimewise.h" + +#include +#include +#include + +namespace ezxml { class XDoc; } + +namespace mx::core +{ + +/// An attribute on the document root that is outside the schema; in +/// practice this is the namespace declarations (xmlns, xmlns:xlink). +/// Round-trip parity requires preserving them verbatim. +struct ExtraAttr +{ + std::string key; + std::string value; +}; + +/// A parsed MusicXML document. Exactly one root variant is held. +class Document +{ +public: + using Root = std::variant; + + Root root; + std::vector rootNamespaces; +}; + +/// Parse an ezxml document into the typed model. Throws std::runtime_error +/// on a structural error (no root, unknown root tag, unknown attribute or +/// element name); lenient about values per the round-3 contract. +Document fromXDoc(const ezxml::XDoc& doc); + +/// Serialize the typed model back to an ezxml document. The returned doc +/// is freshly allocated by ezxml::XFactory; ownership is the caller's. +ezxml::XDocPtr toXDoc(const Document& d); + +} // namespace mx::core +``` + +This is the only place in the API where `std::variant` shows up at the document level. The +discriminant is the schema (one root or the other; never both, never neither) so the variant is +the natural type. + +**`fromXDoc` returns by value.** RVO/NRVO and move semantics make this free; a `unique_ptr< +Document>` would be ceremony with no payoff (the caller can put the value in a +`unique_ptr` if they want one). + +**`toXDoc` returns `ezxml::XDocPtr`.** That is ezxml's existing ownership type (a `shared_ptr< +XDoc>`). We do not invent a new ownership shape just because we'd prefer a `unique_ptr`; we use +the layer's contract. + +--- + +## 12. Parse and serialize: the API shape + +Pattern, applied uniformly: + +```cpp +// For every complex type T: +void parse(const ezxml::XElement& el, T& out); // throws on structural error +void serialize(const T& v, ezxml::XElement& parent, const std::string& tag); + +// For every value type V: +void toString(const V& v, std::string& out); // append to `out`; never throws +const char* toString(EnumValue v); // for plain enums; static storage +bool tryParse(const char* s, V* out); // strict; returns false on bad input +V parse(const char* s); // lenient: bad input -> first variant / 0 / clamp +``` + +**Why an out-parameter for parse rather than a return?** Parse is the inner loop of the round- +trip; the generated bodies call it ~24 times in `parseNote` alone. Out-parameter avoids one move +per call and lets the caller `emplace` directly into a `std::vector>` slot. +This is a hot path (see `gen/README.md` "the corert test ... ~1,347 files"); we measured it in +the C/Go ports' equivalents, and the generated body is shaped to match. + +**Why free functions, not member functions?** Because the data classes are aggregates (§3, §8). +Adding `parse` / `serialize` member functions would force the classes to be non-aggregates and +prevent designated-initializer construction. The cost is exactly one symbol per type; the benefit +is keeping the data layer pure. + +**Why `const char*` for parse input rather than `std::string_view`?** Most call sites have a +`std::string` in hand from ezxml (`getValue() const` returns `std::string`); both `const char*` +and `string_view` accept that. `const char*` matches what Go's templated free functions look like +when ported and the existing parse/clamp helpers use it. (`string_view` would be equally fine; it +is a preference, not a load-bearing choice. Lock either one and stay consistent.) + +**`throw std::runtime_error` for structural parse failures.** The corert harness already +`try { ... } catch(std::exception& e) { result.message = e.what(); }`; matching that contract +keeps the harness untouched. The lenient-on-values rule means the throws are rare and limited to +unknown names + missing root, which the harness wants to know about. + +--- + +## 13. Optionality and collections in detail + +| IR cardinality | C++ representation | Why | +|---|---|---| +| required (1) | `T value;` | DAG -> always safe by value | +| optional (0..1)| `std::optional value;` | C++17 native; value-typed; clear engaged/disengaged | +| vector (0..n) | `std::vector values;` | C++17 native; contiguous; supports the choice-vector pattern | + +Special cases: + +- `std::vector>` for the children of choice-bearing composites (§9b). +- `std::variant<...>` for a value-type union (§6). +- Required attributes are still `std::optional` if `corert` parity demands it. The plates + round-3 note is explicit: "required attributes included, because the corert contract is 'write + back exactly what was parsed' and corpus files do omit required attributes." The C++ target + inherits this rule. `Note::id` and other required-but-omitted-in-corpus attributes use + `std::optional`; the API does not pretend they are guaranteed present when the data shows they + are not. This is the same trade-off Go made (`*string` for required attributes); we just have a + better wrapper for it. + +--- + +## 14. Hand-written runtime + +Three files, hand-written, live alongside the generated code: + +### 14a. `Decimal.h` / `Decimal.cpp` + +The lossless decimal class. Already an `mx::core` decision per `gen/cpp/config.toml` `[types] +decimal = "Decimal"`. Sketch of API: + +```cpp +namespace mx::core +{ + +class Decimal +{ +public: + Decimal(); + explicit Decimal(double v); + explicit Decimal(int v); + explicit Decimal(const std::string& s); // parse exact text; lossless + + double getValue() const; // numeric face for arithmetic / clamps + const std::string& getText() const; // wire face: lossless reproduction + + void setValue(double v); // re-renders text using the configured policy + void setText(const std::string& s); + + bool operator==(const Decimal& other) const; + bool operator!=(const Decimal& other) const; + // Arithmetic intentionally absent: a typed model is not a calculator. +}; + +} // namespace mx::core +``` + +The trailing-zero handling that `DecimalFields.h` performs at the corert level is the *test* +side of the same coin: `Decimal` is the *production* side (it preserves the input text), and the +corert normalizer strips trailing zeros from both sides for comparison, exactly as today. + +### 14b. `Runtime.h` / `Runtime.cpp` + +The version constant (per round-3, generated from the schema stem so retargeting cannot leave it +stale) plus the small numeric/string parse helpers the generated code calls. + +```cpp +namespace mx::core +{ + +inline constexpr const char* SupportedMusicXMLVersion = "4.0"; + +bool tryParseInt(const char* s, int* out); +int parseInt(const char* s); // lenient: 0 on failure +bool tryParseDecimal(const char* s, Decimal* out); +Decimal parseDecimal(const char* s); + +} // namespace mx::core +``` + +`SupportedMusicXMLVersion` is **generated** into Runtime.h, despite Runtime.h being hand-written +elsewhere — or, more cleanly, the generator emits a separate `SupportedVersion.h` with the +constant and Runtime.h includes it. (Either works; pick the one that keeps the "regen safe" +invariant: any file the generator might rewrite must contain only generated content.) + +### 14c. The existing `mx/ezxml/` layer is the dependency + +The C++ target builds **on** the existing ezxml DOM (XDoc, XElement, XAttribute). It does not +parse XML itself. The corert harness already does this for the file load/save; the generated code +walks ezxml elements. This matches the Go (etree) and C (libxml2) targets; ezxml is C++'s +counterpart. + +--- + +## 15. CMake integration + +The press's `[render]` manifest emits a `sources.cmake` (the C target already does this) listing +the generated `.cpp` files. The repository `CMakeLists.txt` adds the generated directory and +`include()`s the manifest: + +```cmake +# CMakeLists.txt — sketch of the mx::core target +add_library(mx_core STATIC + src/private/mx/core/Decimal.cpp + src/private/mx/core/Runtime.cpp +) +include(src/private/mx/core/sources.cmake) # appends generated .cpp files +target_sources(mx_core PRIVATE ${MX_CORE_GENERATED_SOURCES}) +target_include_directories(mx_core PUBLIC src/private) +target_link_libraries(mx_core PUBLIC ezxml) +target_compile_features(mx_core PUBLIC cxx_std_17) +``` + +Generated `sources.cmake` writes a single `set(MX_CORE_GENERATED_SOURCES ...)` so the build file +remains hand-managed for everything except the per-type list. Same shape as the C target's +`sources.cmake` template. + +--- + +## 16. What to NOT do (and why) + +These are dead-ends a reviewer might independently propose; the design has rejected each. + +- **`std::shared_ptr` / `std::unique_ptr` everywhere.** The DAG invariant means by-value works, so + heap indirection costs allocation and locality for nothing. Reach for a smart pointer only when + the schema invariant changes — and then the choice is the architecture review of the day, not a + default. +- **Virtual destructors / runtime polymorphism.** Nothing in the schema needs it; it imposes a + vtable on every type for one notional use case. See §10. +- **Hand-rolled tagged unions.** `std::variant` exists. See §6, §9b. +- **A synthetic `Kind` enum on each composite's child variant.** The pointer-struct pattern in + Go/C needs it; the variant pattern does not. Adding one would re-introduce the harmony-`` + collision the variant solution avoids by construction. +- **Member function `parse` / `serialize`.** Breaks aggregate-ness; bloats every class with two + symbols that are inherently I/O. Free functions, see §12. +- **Strong typedef via `using`.** Loses type identity at the language level. Class with private + storage instead, §4-§5. +- **Reflection / Boost.Hana / a registry-of-types.** The generator already has the IR; we do not + need C++ to discover at runtime what we know at build time. Templates emit straight-line C++ + from straight-line plate data. +- **A `Visitor` base class for tree walks.** Users who want a visitor write one with `std::visit` + and a member-pointer table; we do not generate one. Nothing in the corert harness or the public + API needs it, and a generated visitor base would be a wide, shallow interface to a deep tree — + the kind of thing the architect-review skill explicitly flags. +- **Coroutines, ranges, `std::expected`.** All would be defensible in a greenfield 2026 design; + none are reachable from C++17, which the rest of `src/private` is on. Holding the line on the + language baseline keeps blast radius local. + +--- + +## 17. What this design intentionally leaves open + +A short list, with the reason each is deferred: + +- **Header/impl split granularity.** §2 says one type per pair. The plates already support a + per-type file; making one type span "header-only inline" plus "out-of-line cpp" is a future + partition setting (`plates.md` §10 lists this as future work). For now: every type gets a + `.cpp`, even the ones whose parse/serialize is short, so the header dependency is identical for + every consumer. +- **`operator==` for variants whose alternatives are duplicated types.** `std::variant`'s default + `==` works on (index, value) pairs, which is what we want. No generator action required, but + worth a unit test on (e.g.) `NoteChild{std::in_place_index<1>, true}` != + `NoteChild{std::in_place_index<6>, true}` to confirm the index discriminates as expected. +- **`std::hash` / set/map keys.** Not generated. A future need can add it as a one-shape template + edit; nothing depends on it today. +- **Streaming serialize.** `serialize` emits to ezxml's DOM, which then writes to a stream. A + direct streaming serializer (skip the DOM) is possible and may be a follow-up if benchmarks + show the DOM allocation hurts; the current shape is the same as the C++ Reference + implementation has used and is fast enough for the corert workload. +- **The `Decimal` class's exact API.** §14a is a sketch; the precise trailing-zero policy and + whether to surface arithmetic operators is an open call. Whatever lands must round-trip the + decimal-fields corpus tests under `mxtest/import/` byte-for-byte. + +--- + +## 18. Architecture review against `arch-review` + +A quick self-audit against the principles the review skill applies, in priority order: + +**Domain boundaries.** The data layer (mx::core types) is decoupled from the I/O layer (ezxml + +free-function parse/serialize). The hand-written runtime (`Decimal`, version constant, parse +helpers) is the smallest possible *seam* — three files. Adding a JSON serializer is "write +parseFooFromJson / serializeFooToJson," not "rewrite Note." Boundary holds. + +**Simple, deep abstractions.** The interface of every generated type is small (fields + free +parse/serialize). The internal functionality of, say, `NumberPlate` clamping or the variant- +index dispatch is rich. Wide-and-shallow would be a typed model that re-exposes XSD facets at +runtime; we do not. + +**Blast radius.** A schema bump regenerates files; no hand-written code changes. Adding an +attribute to one element changes one `.h`/`.cpp` pair. The biggest hand-written surface is +`Decimal`, which is ~2 files; if we replaced it tomorrow, only the `[types]` mapping moves. + +**Clarity.** `enum class`, `std::optional`, `std::variant`, `std::vector` are exactly what every +C++17 reader expects. Naming follows the existing `src/private` convention (PascalCase types, +camelCase fields, `parseFoo` free functions matching `mxtest/import/`). No surprises. + +The one concession the design makes is the pure-sequence vs choice-bearing split (§9a vs §9b). +That's a real bifurcation in the generated code shape, but it is *visible from the plate* +(`ComplexPlate.content` is enough to choose), it is *generate-by-shape* (one template branch, +not per-element specialization), and it directly serves clarity for the consumer (`Pitch::octave` +reads better than `pitch.children[2]`). It earns its complexity. diff --git a/docs/ai/design/plates.md b/docs/ai/design/plates.md new file mode 100644 index 000000000..92759136e --- /dev/null +++ b/docs/ai/design/plates.md @@ -0,0 +1,785 @@ +# The Plates: the template-facing, target-projected layer + +Status: implemented in `gen/plates/` (see Implementation notes, section 11, for the deltas between +this design and the code). This document specifies the layer that sits between the IR (`gen/ir`) +and the per-language templates in the generator pipeline: + +``` +XSD file -> XSD model -> IR -> [ Plates ] -> templates -> C++ / Go / C / JSON Schema + (gen.xsd) (gen.ir) (dumb renderers) +``` + +The IR is a pure, language-agnostic, config-free function of the schema inputs. The Plates are its +opposite number: the per-target projection of that neutral model into a presentation-ready form a +template can print without thinking. This is where config.toml meets the IR. Everything a target +needs to decide -- what an identifier is called in each casing, what a `decimal` maps to, whether a +derived type uses inheritance or a flattened copy, which file a type lands in -- is decided here, +once, so the templates stay dumb: walk the structure, print text, no naming logic and no per-element +special casing. + +## 1. Name and rationale + +**Chosen name: the Plates** (Python package `gen/plates/`, CLI `python3 -m gen plates --config C`). +Each metadata object handed to a template -- one per emitted type -- is a **plate**; the full +collection projected for a target is the **Plates**. + +In music engraving -- the discipline MusicXML exists to serve -- a publisher prepared an edition by +engraving the manuscript onto metal plates: every spelling, layout, and spacing decision committed +into the metal, one plate per page, ready for the press to ink and print. Published scores carry +plate numbers to this day. The metaphor maps exactly onto this layer: + +- The IR is the abstract manuscript: neutral content, no typeface, no layout. +- A plate is one type engraved for a *specific* target: the same content rendered into that target's + concrete identifiers (the casing is the engraving style), in that target's order and file layout, + ready for the press. +- The Plates are the complete set of plates for the edition; one target is one edition. +- The templates are the press: they ink and print what the plates already fixed. They add no + composition decisions of their own. +- `python3 -m gen plates --config C` is the proof pulled from the plates before the print run: a + dumpable, diffable preview of the engraved edition before any code is printed -- the same role + `ir --resolve` plays for the IR. + +The name is evocative, thematically exact for a project about music engraving, and collides with +nothing already in this codebase: not `model`, not `IR`, not `facet` (which already means an XSD +constraint here), not `resolve`. It reads cleanly as a noun, a module, and a command alongside `ir`, +and the singular/plural pair names the per-type object and the collection with one word -- a plate, +the Plates -- so there is no second term to learn. + +### Alternatives considered and rejected + +- **ViewModel** -- conceptually the most precise fit (MVVM's "presentation-ready projection of the + model that the view consumes" is exactly this). Rejected: the brief explicitly bars overloading + "model", and the term drags in web-framework baggage. +- **Projection / Project** -- accurate (the IR is "projected" onto a target), but `project` is badly + overloaded in this repo (there is a `/project` skill and a `docs/ai/projects/` tree), and `gen + project` reads as a noun command while colliding with the verb. The CLI ergonomics alone + disqualify it. +- **Binding** -- the brief's own framing ("target-binding stage") endorses it, and in compiler terms + binding-to-a-target is exactly this. Rejected for the audience: to a systems engineer "binding" + reads first as FFI / language bindings (bindgen, "Rust bindings to libfoo"). Since this project + literally emits C++/C/Go libraries, naming the layer "Binding" actively invites the wrong reading. + The term survives as the name of the *per-target field group* inside each plate (see section 4), + where the FFI reading cannot intrude. +- **Facet** -- already a load-bearing XSD term in this codebase (enumeration/pattern/minInclusive). + Reusing it would be a genuine collision. +- **Dialect / Idiom** -- evocative of per-language flavor, but both connote *names only* and + undersell the layer's representation, layout, and structural work. + +## 2. Responsibilities and non-responsibilities + +The Plates layer owns the per-target projection and nothing else. + +It **is responsible for**: + +- Name expansion: every fundamental name gets all standard casings, automatically, plus its + immutable wire form preserved verbatim (section 5). +- Renames and per-convention overrides, with validation against the IR (section 6). +- Post-projection collision detection as a CI gate (section 7). +- Representation strategy: mapping each of the 8 shapes to an emit strategy, cardinality to the + target's optional/collection types, and IR primitives to target types via a config-overridable map + (section 8). +- Resolving `default`/`fixed` literals that name an enum variant to that variant's target + identifier, while keeping the wire literal (section 8). +- Exposing both the resolved content tree and a flat member list (section 8). +- File/layout partitioning and the per-file include/import graph (optional; section 8). +- Namespaces/packages/prefixes, reserved-word policy, identifier-validity enforcement, doc-comment + style, deterministic ordering (section 8). + +It **is not responsible for** (stays in the IR): + +- Schema resolution: collapsing restriction chains, normalizing cardinalities, hoisting anonymous + types, dropping dead types, dependency ordering, and the group / attribute-group structure. The + Plates consume the IR and its `Resolver`; they never re-derive a schema fact. +- The wire names themselves (the plates preserve them, the IR produces them). +- The sounds.xml fold (an IR-level, config-gated input selection, already done before the Plates + are built). + +It is **not responsible for** (stays in the templates): + +- The literal text: language grammar, punctuation, whitespace, file headers, the actual rendering of + a strategy tag into source lines. Templates contain no naming logic and no per-element + conditionals; they read plate fields and print. + +## 3. One layer or two: decided by the JSON Schema contrast + +The forcing question (section 9 works it fully): a template that emits a JSON Schema version of the +MusicXML spec wants wire names (not casings), the resolved choice/sequence structure, enum wire +literals, union members, number facets, string patterns, the open-enum, and docs as `description`. +It wants **none** of the file partitioning, includes, reserved-word mangling, comment styling, or +casing machinery a code target needs. + +That split is real, and it has a sharp consequence: almost everything the JSON Schema target wants +is *already in the IR plus the Resolver*. The IR's names already are the wire names; +`Resolver.content` already splices groups into a choice/sequence tree; union members, number bounds, +patterns, enum values, the open-enum, and `doc` strings are all present. So the neutral half of this +layer is not new information -- it is the IR, re-presented. + +This drives the decision: + +**The Plates are one rich, materialized, template-facing object, and each plate in it is internally +partitioned into two field groups:** + +- a **neutral core** -- wire-faithful, target-independent facts (wire name, shape, resolved + structure, value lists, facets, docs), mirrored from the IR + Resolver; and +- a **target binding** -- the per-target overlay (the casing bundle, resolved target types, emit + strategy tags, file assignment, reserved-word resolution, doc style). + +Code targets read both groups. A neutral target like JSON Schema reads only the neutral core, leaves +the binding's optional pieces (partitioning, includes) unconfigured, and never touches the casings. + +**Why one object and not two passes.** Two separate artifacts -- a neutral enrichment layer plus a +detached per-target overlay -- would force every template to cross-reference the two by name and +re-walk the structure to stitch them, re-introducing exactly the per-emitter splicing the IR worked +to centralize. It would also split a wire name from its own casings across two objects. One object +with a disciplined neutral/bound field split gives the ergonomics of one (templates walk a single +tree) and the generality proof of two (the JSON Schema target demonstrably needs only the neutral +fields). The cost -- computing five casings per name and a file assignment even for a target that +ignores them -- is trivial (a few thousand names) and partitioning is opt-out, so neutral targets +pay nothing meaningful. + +## 4. Data shape: materialized, dumpable, built on the Resolver + +The IR's `Resolver` is computed-on-demand because it is pure over the IR and is needed *mid-build* +(to compute `deps`). The Plates have neither property: they depend on a config (a specific target), +and nothing consumes them mid-build. They are therefore **materialized** -- a plain dataclass tree +built once per target -- for three reasons: + +1. Collision detection (section 7) and rename validation (section 6) are global passes over all + projected identifiers; they are naturally build-then-check steps, which fit a materialized + result. +2. Inspectability and gating: a materialized tree dumps to JSON via the existing `gen/ir/dump.py` + machinery, giving `gen plates --config C` as a diffable artifact and a `--check` CI gate, + matching the project's analyze-as-gate ethos. +3. Templates want random-access to fully-resolved plates, not recomputation. + +The Plates are *built on* the Resolver: they consume `Resolver.attributes`, `all_attributes`, +`content`, and `elements` rather than re-deriving any splicing. + +Design sketch of the types (shapes and accessors, not implementation): + +``` +# --- the neutral/bound name bundle (R1, R3) --- +Name: + wire: str # immutable on-the-wire string (R3); never a code identifier + words: tuple[str, ...] # the tokenized word vector (section 5) + cased: dict[str, str] # convention-name -> identifier, e.g. {"pascal": "Note", ...} + # convenience accessors pascal/camel/snake/kebab/screaming read from `cased`. + # `cased` is filled by iterating a CONVENTION REGISTRY, so adding a convention + # later is registering one function -- zero changes elsewhere (R1). + +# --- value plates (mirror the IR's 4 value shapes) --- +EnumPlate: name: Name; base: str; variants: list[Variant]; doc: str|None +Variant: wire: str; name: Name; ident: str # ident = sanitized name.cased[variant-conv] +NumberPlate: name: Name; base: str; bounds: NumberBounds; target_type: str; doc: str|None +StringPlate: name: Name; base: str; patterns; length; target_type: str; doc: str|None +UnionPlate: name: Name; members: list[UnionMember]; doc: str|None # member -> Ref or literal set + +# --- complex plates (mirror the IR's 4 complex shapes) --- +Member: name: Name; kind: str # "element" | "attribute" | "value" + type_ref: PlateRef; cardinality: str # required|optional|vector + repr: MemberRepr # concrete optional/collection wrapper (section 8) + default: str|None; fixed: str|None + default_variant: str|None # variant ident when default/fixed names a variant + doc: str|None +ComplexPlate: name: Name; shape: str # value|composite|empty|derived (or value-type shape) + strategy: str # emit-strategy tag the template switches on + members: list[Member] # flat, deduped, ordered (code targets) + content: ContentNode|None # resolved sequence/choice tree (schema targets) + base: PlateRef|None # derived: the inheritance edge + all_members: list[Member]|None # derived: flattened (base chain merged) + presence_only: bool + file: FileId|None # None when partition == single + doc: str|None + +# --- the whole projected target --- +TargetInfo: language: str; namespace: str; prefix: str + conventions: list[str]; doc_style: DocStyle; reserved: set[str]; partition: str +Plates: target: TargetInfo + value_types: list[EnumPlate|NumberPlate|StringPlate|UnionPlate] # deps-ordered + complex_types: list[ComplexPlate] # deps-ordered + roots: list[PlateRef] + files: list[FileSpec]|None # per-file include graph; None when not partitioned + type_map: dict[str, str] # primitive -> target type, after config overrides +``` + +Build entry point and CLI (mirrors `ir`): + +``` +build_plates(ir: Ir, config: Config) -> Plates # uses Resolver + a NameFactory + collision check +python3 -m gen plates --config C [--type N] [--check] +``` + +`--check` runs rename validation and collision detection and exits non-zero on any failure, so it +can gate CI exactly as `analyze` does for the DAG/no-collision invariants. Output serializes through +the existing `to_jsonable` in `gen/ir/dump.py`. + +## 5. The name-convention model + +### 5.1 Tokenizer + +A fundamental name is split into an ordered **word vector** of lowercase words, then recased. The +wire form is preserved untouched alongside (R3); tokenization feeds *only* the cased identifiers, +never serialization. + +Rules, applied in order: + +1. **Separators.** Split on and consume any of: hyphen `-`, dot `.`, underscore `_`, colon `:`, and + ASCII whitespace. (Hyphen covers ordinary kebab names; dot covers `brass.alphorn`; whitespace + covers space-separated enum values like `up down` and `bass drum`; colon covers external refs + like `xml:lang`, `xlink:type`.) +2. **Case-transition splits** (for any already-mixed-case input, rare in MusicXML but the tokenizer + must be total): split at a lower-to-upper boundary (`fooBar` -> `foo`, `bar`) and at an acronym + boundary, where an uppercase run is followed by an uppercase+lowercase (`MIDIChannel` -> `midi`, + `channel`): the last capital of the run begins the next word. +3. **Digits do not split.** A letter-digit or digit-letter boundary is *not* a word boundary, so + `default-x` -> `[default, x]` (split on the hyphen only), `midi-128` -> `[midi, 128]`, and the + enum value `1024th` -> `[1024th]` (one word). Digits ride with their adjacent letters. +4. **Lowercase.** Each resulting word is lowercased to its canonical form. Casing is reapplied per + convention. +5. **Degenerate input.** If the rules yield an empty vector -- the empty-string enum value `""` from + `positive-integer-or-empty` and a few `*-value` enums -- substitute the configured fallback word + vector, default `["empty"]`. The wire form stays `""`; only the identifier gets a name. + +### 5.2 Recasing + +Each convention is a function from the word vector (plus the acronym set) to a string. The five +standard conventions: + +- **PascalCase**: capitalize every word, concatenate. +- **camelCase**: the first word fully lowercased, every later word capitalized, concatenate. +- **snake_case**: words joined with `_`. +- **kebab-case**: words joined with `-`. +- **SCREAMING_SNAKE_CASE**: each word uppercased, joined with `_`. + +Where "capitalize a word" means: if the word is in the **acronym set**, uppercase it whole (`midi` +-> `MIDI`, `id` -> `ID`); else if its first character is a letter, uppercase that letter and +lowercase the rest; else (a digit-led word like `1024th`) leave it lowercased. The acronym set is +config-extensible (`[naming] acronyms = [...]`); the default is `{midi, id, xml, css, smufl, uri, +url}`. Acronyms affect only PascalCase and the non-leading words of camelCase; snake/kebab/screaming +are case-uniform and ignore the set. (The camelCase *leading* word is always fully lowercased, so a +leading acronym yields `midiChannel`, not `MIDIChannel`.) + +Because conventions live in a registry keyed by name, adding (say) `Train-Case` or `dot.case` later +is registering one function; `Name.cased` simply grows a key and templates opt in (R1). + +### 5.3 Worked conversion table + +| wire | words | PascalCase | camelCase | snake_case | kebab-case | SCREAMING_SNAKE_CASE | +|------------------|------------------|-------------------|-------------------|---------------------|---------------------|----------------------| +| `note` | [note] | `Note` | `note` | `note` | `note` | `NOTE` | +| `default-x` | [default, x] | `DefaultX` | `defaultX` | `default_x` | `default-x` | `DEFAULT_X` | +| `clef-octave-change` | [clef, octave, change] | `ClefOctaveChange` | `clefOctaveChange` | `clef_octave_change` | `clef-octave-change` | `CLEF_OCTAVE_CHANGE` | +| `midi-channel` | [midi, channel] | `MIDIChannel` | `midiChannel` | `midi_channel` | `midi-channel` | `MIDI_CHANNEL` | +| `optional-unique-id` | [optional, unique, id] | `OptionalUniqueID` | `optionalUniqueID` | `optional_unique_id` | `optional-unique-id` | `OPTIONAL_UNIQUE_ID` | +| `brass.alphorn` | [brass, alphorn] | `BrassAlphorn` | `brassAlphorn` | `brass_alphorn` | `brass-alphorn` | `BRASS_ALPHORN` | +| `up down` | [up, down] | `UpDown` | `upDown` | `up_down` | `up-down` | `UP_DOWN` | +| `1024th` | [1024th] | `1024th` | `1024th` | `1024th` | `1024th` | `1024TH` | +| `` (empty) | [empty] | `Empty` | `empty` | `empty` | `empty` | `EMPTY` | + +Notes on the hard rows: + +- `default-x` shows digit-free splitting on the hyphen and a single-letter trailing word; the wire + form `default-x` is preserved for the attribute on the wire. +- `midi-channel` and `optional-unique-id` show the acronym set producing `MIDI`/`ID` in PascalCase + while snake/kebab/screaming stay mechanical. +- `brass.alphorn` shows the dot as a separator while the wire keeps the dot for serialization (R3). +- `up down` shows a space-separated enum value tokenizing cleanly while the wire keeps the space. +- `1024th` shows a digit-led word: the casings are well-defined, but the result is not a legal + identifier in most code targets. That is fixed in the *binding's* identifier-validity step + (section 8.6), not here: the wire `1024th` and the recased `1024th` are both kept, and a code + target mangles to e.g. `_1024th`. The casing is never silently changed to make it legal; the + plate records the ideal and lets the sanitizer (and the collision check) act on the result. +- the empty value shows the fallback word vector `["empty"]`; the wire form remains the empty + string, which is what a serializer must emit. + +A note on the dynamics elements `p`, `pp`, `ppp`, `f`, `ff`, ... and `sfz`: these tokenize to single +words and PascalCase to `Pp`, `Ppp`, `Sfz`, which is ugly. That is the textbook motivation for +per-convention overrides (section 6): a target can force `PascalCase = PP` per element without +disturbing the wire form. + +## 6. The override system + +### 6.1 Two tiers (R4) + +- **(a) Fundamental rename.** Rename the canonical root once; every convention re-expands from the + new root automatically. `attributes` -> `properties` makes PascalCase `Properties`, snake + `properties`, and so on, with no per-flavor work. +- **(b) Per-convention override.** When one flavor's auto-expansion is unacceptable, override that + single flavor and leave the rest auto-expanded. Keep fundamental `note`, force `PascalCase = + MusicNote`, and snake_case still resolves to `note`. + +Both tiers are available for any fundamental element name, attribute name, type name, enum type +name, and enum value/variant. + +### 6.2 Addressing scheme (R5) + +Override keys are namespaced by target-kind so they are unambiguous. Enum values are not globally +unique (`start`, `stop`, `up`, `down` recur across dozens of enums), so an enum-value key is scoped +to its enum type: + +| target kind | key path | notes | +|---------------------|-------------------------------------------------|----------------------------------------| +| type (cplx/val/enum)| `rename.type.` | one namespace; no collisions invariant | +| element | `rename.element.` | name -> type is 1:1 (invariant) | +| attribute (global) | `rename.attribute.` | applies on every owner | +| attribute (scoped) | `rename.attribute..`| more specific; wins over global | +| enum value | `rename.enum-value..` | scoped to the enum (R5) | +| group | `rename.group.` | for targets emitting shared fragments | +| attribute group | `rename.attribute-group.` | for targets emitting mixins | + +The empty enum value is addressed by the TOML empty-string key `"" = ...` under its enum's table. + +### 6.3 TOML schema + +Each override entry is a table. A bare `fundamental` key sets the root rename; convention keys +(`pascal`, `camel`, `snake`, `kebab`, `screaming`, or any registered convention) override individual +flavors. A string shorthand `type.note = "tone"` is sugar for a table with only `fundamental`. + +```toml +# ---- existing config, untouched ---- +[input] +xsd = "../../docs/musicxml-4.0-ed15c23.xsd" +[output] +dir = "../../src/private/mx/core" +[sounds] +xml = "../../docs/sounds-4.0-ed15c23.xml" + +# ---- new Plates config ---- +[target] +language = "cpp" +namespace = "mx::core" # Go: package; C: leave empty and use prefix +prefix = "" # global symbol prefix (C uses e.g. "Mx") + +[naming] +extends = "../naming.base.toml" # optional shared base (section 6.4) +acronyms = ["midi", "id", "xml", "css", "smufl"] +type-convention = "pascal" # which casing type identifiers use +field-convention = "snake" # which casing member identifiers use +variant-convention = "pascal" # which casing enum variants use +field-prefix = "" # e.g. "m_" for member fields (section 8.7) +empty-value-word = "empty" # fallback word vector for the "" wire value +pluralize-vectors = false # see section 8.7 + +[reserved] +words = ["class", "namespace", "for", "default", "operator"] # extends language defaults +policy = "suffix-underscore" # reserved word -> append "_" +invalid-prefix = "_" # leading-digit / empty identifier -> prepend "_" + +[types] # IR primitive -> target type (overrides defaults) +decimal = "Decimal" +integer = "int" +positive_integer = "unsigned" +non_negative_integer = "unsigned" +string = "std::string" +token = "std::string" +nmtoken = "std::string" +date = "std::string" + +[layout] +partition = "per-type" # "per-type" | "grouped" | "single" +include-style = "quoted" + +[docs] +style = "triple-slash" # "//" | "///" | "/** */" +wrap = 100 + +# ---- (a) fundamental rename: all flavors re-expand ---- +[rename.type.attributes] +fundamental = "properties" + +# shorthand form, identical effect: +# rename.element.default-x = "origin-x" + +# ---- (b) per-convention override: keep root, override one flavor ---- +[rename.type.note] +pascal = "MusicNote" # snake_case still resolves to "note" + +# ---- scoped enum-value rename (R5): key scoped to the enum type ---- +[rename.enum-value.up-down] +"up" = "upward" # variant 'up' of enum 'up-down' only +"down" = "downward" + +[rename.enum-value.breath-mark-value] +"" = "none" # the empty variant, scoped to this enum + +# ---- scoped vs global attribute rename ---- +[rename.attribute] +default-x = "origin-x" # every owner +[rename.attribute.note] +type = "kind" # only the 'type' attribute on 'note'; wins over global +``` + +### 6.4 Where overrides live and precedence (R6) + +Both per-target and shared: + +- **Per-target** (the common case): renames are almost always language-driven -- avoiding a C++ + keyword, a Go predeclared identifier -- so they live in each target's `config.toml`. +- **Shared base** (optional): a `naming.base.toml` referenced via `[naming] extends = "..."` holds + renames common to all targets (rare). A target's own entries win over the base on any conflict. + +Precedence, highest first: + +1. A per-convention override key (`pascal`, `snake`, ...) for the exact target kind. +2. A `fundamental` rename for that target kind. +3. Auto-expansion from the wire name. + +Orthogonally: per-target config beats the shared base; a scoped attribute key beats a global one. + +### 6.5 Validation (R6) + +Every rename key is validated against the IR at build time and the run **fails loud** on a miss: +`rename.type.` must name a type in the IR; `rename.element.` an element that occurs; +`rename.enum-value..` an enum `E` that actually lists value `V`; and so on. This matches the +analyze-as-gate ethos: a typo in a rename key (or a key left stale after a schema bump) is a build +error, not a silently ignored line. Chosen and recommended. + +## 7. Collision detection (R7) + +After tokenizing, recasing, applying renames, and reserved-word / validity mangling, two distinct +fundamental names can collapse to one identifier. The Plates build detects these and reports them as +errors (`--check` exits non-zero), the way `analyze` guards the DAG and no-collision invariants +today. The IR's "no element-name collisions" invariant guarantees nothing here, because collisions +are *induced* by the projection (casing, mangling, prefixing), not present in the wire names. + +Scopes checked, each in the convention(s) the target actually uses: + +- **Type identifiers**: all emitted type identifiers (complex + value + enum) must be unique within + the target's namespace/package, in the type-convention. (`default-x` the element and `default_x` + some other name could both snake to `default_x`, etc.) +- **Enum variant identifiers**: unique within each enum type, in the variant-convention. (Distinct + wire values that mangle to the same identifier -- e.g. several empty/invalid values all sanitized + to the same fallback -- are caught here, per-enum.) +- **Member identifiers**: within a single complex plate's flat member list (attributes + child + elements + the value body), the field identifiers must be unique in the field-convention. This is + where an attribute and a child element sharing a recased name, or a pluralized vector member + colliding with another member, would surface. +- **Group / attribute-group identifiers**: for targets that emit them as shared structs/mixins, + unique within the relevant namespace. +- **File stems** (when partitioning): unique within the output directory, checked + **case-insensitively** so `Note.h` and `note.h` are flagged -- a real hazard on macOS and Windows + filesystems. + +The report lists, per collision: the scope, the colliding fundamental (wire) names, and the +identifier they share -- enough to write a targeted rename to resolve it. + +## 8. The transformation catalog + +### 8.1 Shape -> emit strategy + +Each IR shape carries an explicit `strategy` tag the template switches on; the template never +re-derives the shape. The eight shapes and their default strategies: + +| IR shape (kind) | Plate strategy | Template emits (typical code target) | +|---------------------|--------------------------------|---------------------------------------------------------| +| value: enum | `enum-class` | enum class + wire<->variant lookup tables | +| value: number | `numeric-wrapper` | wrapper over a target numeric type, range-validating | +| value: string | `string-wrapper` | wrapper over the target string type, optional pattern | +| value: union | `tagged-variant` | a small tagged variant over the member types | +| complex: value | `value-class` | class with a `value` field (typed by `value_type`) + attrs | +| complex: composite | `composite-class` | class with attrs + ordered children (see section 11, round 3) | +| complex: empty | `flag` or `attrs-class` | bool if `presence_only`, else an attributes-only class | +| complex: derived | `inherit` or `flatten` | base-class inheritance, or a flattened copy (8.4) | + +### 8.2 Cardinality -> optional/collection representation + +Each member's `cardinality` (required / optional / vector, already normalized by the IR) projects to +a `MemberRepr` describing the concrete wrapper, filled from a config mapping and the type map: + +- **required** -> a by-value member (the DAG invariant means no indirection is ever needed). +- **optional** -> the target's optional: C++ `std::optional`, Go a pointer `*T`, C a `bool has_x` + plus a value field. +- **vector** -> the target's collection: C++ `std::vector`, Go `[]T`, C a `T* xs; size_t n_xs`. + +The plate carries the descriptor; the template prints the concrete spelling via the type map, so +the choice of wrapper is data, not template logic. + +### 8.3 IR primitive -> target type + +The IR's primitive set (`string`, `token`, `decimal`, `integer`, `positive_integer`, +`non_negative_integer`, `date`, `nmtoken`) maps to target types through `Plates.type_map`, seeded +with per-language defaults and overridable in `[types]` (section 6.3). The map is the single place a +target decides that `decimal` is a `Decimal` wrapper or that `token` is just `std::string`. + +### 8.4 Derived types: inheritance vs flattened + +A derived plate exposes *both* the `base` edge (for a target with inheritance) and `all_members` +(the base chain merged via `Resolver.all_attributes`, for a target without it). A per-target switch +(`[target] inheritance = true|false`, default true for C++/Go-style structs that can embed, false +for C) selects which the `derived` strategy resolves to (`inherit` vs `flatten`). Templates read +whichever the strategy names; both are present so the choice is config, not a template fork. + +### 8.5 Enum variant identifiers and default/fixed resolution + +Enum variants are generated from arbitrary wire strings (dots, spaces, leading digits, empty), and +the variant identifier is always distinct from the wire literal, which is retained for serialization +(R3). A `Variant` carries `wire`, the `Name` bundle, and the sanitized `ident`. + +When an attribute's `default` or `fixed` value names an enum variant -- e.g. `strong-accent.type` +defaults to `up` against enum `up-down`, `barline.location` defaults to `right` against +`right-left-middle` -- the Plates build resolves that wire literal to the variant's target `ident` +and stores it as `Member.default_variant`, so the emitter writes the enum member (`UpDown::Up`) +rather than a raw string, while the wire literal stays available for the serializer. A +`default`/`fixed` on a non-enum member (e.g. `beam.number` default `1`, or `xlink:type` fixed +`simple`) is formatted as a literal of the member's target type (section 8.8), not resolved to a +variant. + +### 8.6 Identifier validity and reserved words + +After recasing and renames, the binding applies a sanitizer per the `[reserved]` policy: + +- **Reserved words** (language built-ins plus `[reserved] words`) are mangled by the configured + policy (default: append `_`, so `class` -> `class_`). +- **Invalid identifiers** -- leading digit (`1024th`), empty result, or any non-identifier character + that survived -- get the configured `invalid-prefix` (default `_`, so `1024th` -> `_1024th`). + +The pre-sanitized casing and the final identifier are both retained; collision detection (section 7) +runs on the *final* identifiers. + +### 8.7 Structure: resolved tree and flat member list + +Both are exposed, because emitters need different views: + +- `ComplexPlate.content` is the resolved sequence/choice tree (from `Resolver.content`, groups + spliced), for a target that cares about order and choice structure (a schema emitter). +- `ComplexPlate.members` is the flat, deduped, cardinality-tagged member list (attributes from + `Resolver.attributes`/`all_attributes` + elements from `Resolver.flat_elements`), for a code + target's field list. Note that "one field per child element member" turned out to be + insufficient for round-trip fidelity -- see the ordered-children decision in section 11, + round 3: code targets emit attributes as fields and child elements as ONE ordered collection. + +### 8.8 File / layout partitioning (optional) + +`[layout] partition` selects the strategy: + +- `per-type` -- one type per file. Each plate gets a `file`, and `Plates.files` carries, per + file, the include/import list derived from `deps`: each dependency's file, mapped through the same + assignment, deduped, self-excluded. +- `grouped` -- types grouped (by shape or by name prefix) into a fixed set of files; same include + graph, coarser. +- `single` -- one document, `file` is `None`, `Plates.files` is `None`, no include graph. This is + the JSON Schema case and the explicit reason partitioning is optional rather than assumed. + +### 8.9 Namespaces, docs, ordering + +- **Namespaces/packages/prefixes**: `TargetInfo.namespace` (C++ `mx::core`, Go package `mx`), and + `prefix` for languages without namespaces (C symbols `MxNote...`). +- **Doc comments**: the neutral core keeps the raw `doc` text (so JSON Schema can use it verbatim as + `description`); `TargetInfo.doc_style` carries the comment syntax, wrap column, and escape rules, + and the template applies them. The plate does not pre-bake comment syntax into the doc string. +- **Ordering**: the Plates preserve the IR's deps-first order for types (so a single-file emit is a + valid total order) and document order for members and variants. All config-driven maps are + iterated deterministically. Determinism is a hard rule: the same IR + config always yields + byte-identical output. + +### 8.10 Optional niceties: accept / reject + +- **English pluralization of vector members** -- *rejected as a default*, available as opt-in + (`[naming] pluralize-vectors`, default `false`). Irregular plurals need a dictionary, the wire + name is singular, and a wrong plural is worse than a singular member name. Default leaves vector + members singular; a target that wants plurals enables the flag (naive `+s`) or renames the + offending member explicitly. +- **Prefix/suffix policy** (`m_` fields, `Enum`/`Type` suffixes) -- *accepted as config*, off by + default (`[naming] field-prefix`, and analogous type suffix keys). Applied after recasing and + before collision detection, so a prefix that induces a collision is still caught. +- **Numeric formatting of decimal defaults** -- *accepted*. A `default`/`fixed` literal on a numeric + member is normalized to the target's spelling for that primitive (e.g. `8` stays `8` for an + integer field, becomes `8.0` for a `decimal` field if the target wants explicit decimals), with + the wire literal retained. This reuses the corpus normalization spirit (trailing-zero handling) at + the identifier layer. + +## 9. Forcing function: a JSON Schema emitter + +A template emitting a JSON Schema (Draft 2020-12) version of the MusicXML spec reads only the +neutral core of the Plates, and configures `[layout] partition = "single"`. Walkthrough of what it +touches and what it ignores: + +What it reads (all neutral-core fields): + +- **Names: the wire form, never a casing.** `$defs` keys are the type wire names; object property + names are element/attribute wire names; enum members are wire values. JSON property names can be + any string, so `default-x`, `brass.alphorn`, and `up down` are used verbatim -- exactly the data + `Name.wire` preserves, and exactly why the wire form is a first-class field (R3). +- **Resolved structure.** `ComplexPlate.content` (the spliced sequence/choice tree) maps directly: a + `sequence` -> an `object` with ordered `properties` and a `required` list for required members; a + `choice` -> `oneOf`; a `vector` member -> `{ "type": "array", "items": ... }`; optional vs + required -> presence in `required`. No group references remain to chase (the Resolver already + spliced them). +- **Enum wire-literal lists.** `EnumPlate.variants[*].wire` -> `{ "enum": [...] }`. The variant + identifiers (`ident`, casings) are not read at all -- proof the casing machinery is inert here. +- **Union members -> `anyOf`.** `UnionPlate.members` -> `anyOf` of member schemas. +- **The open-enum** (`instrument-sound` = `sound-id` enum unioned with open string) -> `anyOf: [ { + "enum": [ ...sound ids... ] }, { "type": "string" } ]`. The plate represents it as an ordinary + union with an enum member and a string member, so the schema falls out with no special case. +- **Number facets.** `NumberPlate.bounds` -> `minimum` / `maximum` / `exclusiveMinimum` / + `exclusiveMaximum`. +- **String facets.** `StringPlate.patterns` -> `pattern`; length -> `minLength` / `maxLength`. +- **Docs -> `description`.** The raw `doc` text, used verbatim. + +What it never touches (the entire target binding): + +- casings (`Name.cased`), reserved-word and validity mangling (`ident` sanitization), the primitive + type map, namespaces/prefixes, doc comment style, file partitioning, and the include/import graph. + It sets `partition = single` and reads no `file` or `files`. + +This is the proof the layer is general, not C++-shaped: the JSON Schema target consumes a strict +subset of the same object every code target consumes, needing only the neutral core, while the code +targets layer their binding on top. It is the concrete justification for the one-object, +two-field-group decision in section 3: the neutral core is demonstrably self-sufficient (so the +split is real), but it is delivered as fields of one tree the template walks once (so templates stay +dumb). + +## 10. Open questions and future work + +- **Pluralization dictionary.** If natural plurals for vector members ever become desirable, a small + irregular-plural table would be needed; deferred behind the off-by-default flag. +- **Per-context attribute meaning.** The same attribute wire name (`type`, `number`) carries + different meaning across owners. Scoped attribute overrides handle it manually today; + auto-detecting divergent uses and warning could be future work. +- **`xs:list` support.** The IR currently maps the (unused) `xs:list` case defensively to a token + string; if a future schema uses real lists, the Plates would need a list `MemberRepr`. +- **Header/implementation split.** `FileId` is one file per type today; C++ may want a type to span + a header and a source file, making file assignment one-type-to-many-files. +- **Acronym splitting of arbitrary camel input.** The case-transition tokenizer rule is specified + but effectively unexercised by MusicXML's kebab names; it would need test coverage before relying + on it for a future mixed-case schema. +- **Configurable per-enum invalid-identifier prefix.** A single global `invalid-prefix` is assumed; + some targets might want it per enum (e.g. note-type values prefixed `N`). + +**Supersession note:** the emit-stage portions of this document (templates as a concept, file +partitioning via `[layout]`, `FileSpec`) are superseded by +[`generator-agnosticism.md`](generator-agnosticism.md): templates are now per-target Mustache +files rendered by `gen/press/` per each target's `[render]` manifest, and output paths are +manifest output patterns. Sections 1-8 (the projection itself) remain accurate. + +## 11. Implementation notes + +The implementation (`gen/plates/`: `model.py`, `names.py`, `languages.py`, `build.py`, `check.py`; +config parsing in `gen/config.py`; tests in `gen/tests/test_plates.py`) follows this document with +these deliberate deltas: + +- **`MemberRepr` dropped.** A member's `cardinality` plus the target type map already fully + determine the wrapper spelling (by-value / optional / collection); a descriptor object in between + carried no information. The spelling is the template's grammar. +- **Named convention fields.** `TargetInfo` carries `type_convention` / `field_convention` / + `variant_convention` / `file_convention` rather than an anonymous `conventions` list -- the four + roles are load-bearing and deserve names. +- **Final identifiers are materialized.** Every plate, member, and variant carries `ident`: the + recased, renamed, prefix-applied, sanitized identifier the target uses. Collision detection runs + on these. The pre-sanitized casings stay available in `Name.cased` (section 8.6's both-retained + rule). +- **`grouped` partition deferred.** Config accepts the value; the build rejects it with a clear + message until a target needs it. `per-type` and `single` are implemented. +- **`group` / `attribute-group` rename kinds reserved.** No current target emits shared fragments + or mixins, so configuring those kinds is a config error rather than a silently dead table. +- **One reserved-word policy.** Only `suffix-underscore` is implemented; the config key exists and + validates so a future policy is an addition, not a migration. +- **First contact found a real collision.** `barline` carries both elements and attributes named + `segno`/`coda`; every code target's field casing collapses each pair. The shared + `gen/naming.base.toml` (the section 6.4 mechanism) renames the attributes' fundamentals to + `segno-sound`/`coda-sound`, and all three target configs extend it. +- **`xs:ID`/`xs:IDREF` canonicalized in the IR.** They surfaced as accidental ninth and tenth + primitives via the builtin fallback; the IR now folds them to `token`, keeping the primitive set + the eight this document assumes. + +Revised after the first implementation review (one code review, one architecture review): + +- **`Variant.ident` is the final emitted constant.** How enum constants are scoped is a language + fact seeded in `gen/plates/languages.py`: `bare` where the language scopes them inside the type + (C++ `enum class` -> `_1024th`), `composed` where they share one flat namespace (Go + `NoteTypeValue1024th`, C `MX_NOTE_TYPE_VALUE_1024TH`). The projection composes (prefix + type + casing + variant casing, joined in the variant convention's style), sanitizes, and stores the + result; templates print it verbatim. The collision gate runs in the namespace the target actually + has: for `composed`, type identifiers and all constants are checked mutually; for `bare`, + per-enum. (Originally the composition was left to templates, which both broke "templates do no + naming" and blinded the gate to the real namespace.) +- **Effective cardinality lives in the Resolver.** `Resolver.flat_elements` / + `all_flat_elements` / `base_chain` (gen/ir/resolve.py) own the flattened field view: repeated + wrappers make vectors, choices demote to optional, and duplicate occurrences of one name merge by + co-occurrence analysis -- occurrences in different branches of one choice are exclusive + (optional), anything else can co-occur in a single instance and must be a vector. The review + caught a real bug here: `metronome`'s `beat-unit` appears on a branch's spine and again inside + that branch's inner choice, so it must merge to vector, not optional (the corpus exercises this). + Schema reasoning of this kind belongs in the resolution layer, not the projection. +- **The naming vocabulary is a leaf module.** Tokenizer, convention registry, `Name`, and the + sanitizer live in `gen/names.py`, below both `gen/config.py` (which validates convention names) + and `gen/plates/` -- removing a latent config -> plates import cycle. +- **Config surface cuts.** `[layout] include-style` (consumed by nothing), `[reserved] policy` (one + legal value), and `[naming] empty-value-word` (strictly weaker than a scoped enum-value rename) + were removed. Unknown top-level sections are rejected, as are unknown keys in + `[input]`/`[output]`/`[sounds]`. `extends` is hardened: a base may not chain, may hold only + `[naming]`/`[rename]`, and a scope/entry shape disagreement between base and target is an error. + String-list keys reject bare strings. `[types]` keys must name real IR primitives. The cpp + `decimal = "Decimal"` mapping moved from the language seeds to `gen/cpp/config.toml` (it is an + mx::core decision, not a C++ fact). +- **`Plates.type_map` dropped from the public object.** Members and value plates already carry + their resolved spellings (`PlateRef.ident`, `target_type`); publishing the raw map a second way + was drift surface. +- **`all_members` is always built for derived plates**, so the collision gate covers the merged + chain under both the inherit and flatten strategies. +- **The build always gates.** `build_plates` runs validation and collision detection + unconditionally, so a plain `plates` dump fails loud too; `--check` is the quiet CI entry point, + not the only gate. +- **`UnionPlateMember.name` added.** A union member referencing a primitive (`decimal`) has no + plate to take a field name from; the member carries its own name bundle so templates invent + nothing. + +Revised after the second review round (the emit stage and its first two backends): + +- **The clamp policy is data on the plates.** `NumberPlate.clamp` carries resolved + `ClampStep`s -- facet bounds merged with the primitive-implied lower bounds + (`positive_integer` >= 1, `non_negative_integer` >= 0), tightest bound winning, exclusive bounds + clamping to the nearest representable in-range value (next integer; bound +/- 1e-6 for decimals) + -- plus `family` (decimal vs integer). Both backends had hand-mirrored copies of this logic + ("one policy, two spellings" enforced only by a comment); the policy now lives once, is dumpable, + and is tested in `test_plates`. The same steps apply to primitive numeric union members + (`UnionPlateMember.clamp`), closing the hole where `positive-integer-or-empty` accepted 0. +- **Union discriminator constants are projected, not template-composed.** `UnionPlateMember.tag` + is a `Variant` scoped exactly like an enum variant (renameable via + `rename.enum-value..`), and union literal variants double as their own + tags; the flat-namespace collision gate covers them all. The `Kind` infix the templates used to + compose is gone (`FontSizeDecimal`, `MX_INSTRUMENT_SOUND_SOUND_ID`). +- **`TryParse`'s contract is pinned:** lexically strict (the input must be a well-formed value of + the type's family; an enum literal must match exactly), then numbers clamp. Generated doc + comments mention clamping only when clamp steps exist. +- **An open string union member must be last** (it matches anything); both backends fail loud if + a schema ever orders one earlier rather than silently emitting unreachable members. + +Revised after the third review round (the complex-type templates; both corert suites green): + +- **Ordered children, not one-field-per-member.** This document's original composite sketch (a + class with one member per child element, section 8.1/8.7) cannot round-trip MusicXML: a + measure's music-data interleaves note/backup/direction in document order, and metronome's + beat-unit legally appears twice in one instance -- per-member fields lose the interleaving. + The Go and C backends therefore emit attributes as presence-tracked fields (a pointer in Go; + `bool has_x` + value in C -- required attributes included, because the corert contract is + "write back exactly what was parsed" and corpus files do omit required attributes) and child + elements as ONE ordered collection of per-type Child structs whose typed pointers discriminate + by non-nil/non-NULL. No kind discriminator: harmony has a child element literally named `kind`, + so any synthetic field can collide in those languages. **The C++ backend should not copy this + encoding**: with real sum types the collision argument evaporates + (`std::vector>` or generated choice classes give document order and + exactly-one by construction), and a hybrid -- plain fields for pure-sequence composites, + ordered variants only for choice-bearing content -- is still generate-by-shape, derivable from + `content`. +- **The generated packages are order-faithful typed DOMs, not validating bindings.** Parsing is + strict about NAMES (an unknown attribute or element is an error; the version gate keeps newer + documents out, so an unknown name is a generator gap, not data) and lenient about STRUCTURE and + VALUES (`pitch` accepts its children in any order or multiplicity; values degrade per the clamp + policy). `Member.cardinality` and `ComplexPlate.content` are therefore unread by these two + backends -- they stay on the plates for the C++ backend and the JSON Schema forcing function, + which want the structural facts. +- **Version gating is generated, not hand-kept.** `Plates.schema_version` (parsed from the source + stem) is emitted into each runtime (`SupportedMusicXMLVersion`, + `MX_SUPPORTED_MUSICXML_VERSION`) and the corert harnesses read it, so retargeting a schema + cannot leave a stale gate. +- **Shape queries live beside the data.** `attribute_members`/`element_members`/`value_member`, + `ComplexPlate.members_view()` (the strategy-resolved member list), and + `Plates.children_owner()` (the base-chain plate whose child struct holds an inheriting type's + children) moved out of the templates into `gen/plates/model.py`, so a third backend consumes + decisions instead of copying them. +- **Backend-composed identifiers are guarded.** A few names are still composed in templates (the + per-type `Child` struct, the children/presence fields, the document support types). Each + backend fails loud at render time if a projected identifier lands on one, so the collision + story stays airtight even where the gate cannot see; serializing a child with zero or multiple + fields set is documented as undefined (first non-nil wins; all-nil writes nothing). diff --git a/docs/musicxml-3.0.xsd b/docs/musicxml-3.0-5fd8eb3.xsd old mode 100755 new mode 100644 similarity index 98% rename from docs/musicxml-3.0.xsd rename to docs/musicxml-3.0-5fd8eb3.xsd index 2e602eadf..76701dd43 --- a/docs/musicxml-3.0.xsd +++ b/docs/musicxml-3.0-5fd8eb3.xsd @@ -5,12 +5,12 @@ Version 3.0 -Copyright © 2004-2011 Recordare LLC. -http://www.recordare.com/ +Copyright © 2004-2011 MakeMusic, Inc. +http://www.makemusic.com/ This MusicXML™ work is being provided by the copyright holder under the MusicXML Public License Version 3.0, available from: - http://www.recordare.com/dtds/license.html + http://www.musicxml.org/dtds/license.html This is the W3C XML Schema (XSD) version of the MusicXML 3.0 language. Validation is tightened by moving MusicXML definitions from comments into schema data types and definitions. Character entities and other entity usages that are not supported in W3C XML Schema have been removed. The features of W3C XML Schema make it easier to define variations of the MusicXML format, either via extension or restriction. @@ -18,7 +18,7 @@ This file defines the MusicXML 3.0 XSD, including the score-partwise and score-t - The MusicXML 3.0 DTD has no namespace, so for compatibility the MusicXML 3.0 XSD has no namespace either. Those who need to import the MusicXML XSD into another schema are advised to create a new version that uses "http://www.musicxml.org/xsd/MusicXML" as the namespace. + The MusicXML 3.0 DTD has no namespace, so for compatibility the MusicXML 3.0 XSD has no namespace either. Those who need to import the MusicXML XSD into another schema are advised to create a new version that uses "MusicXML" as the namespace. diff --git a/docs/musicxml-3.1.xsd b/docs/musicxml-3.1-8bbe8e5.xsd old mode 100755 new mode 100644 similarity index 98% rename from docs/musicxml-3.1.xsd rename to docs/musicxml-3.1-8bbe8e5.xsd index 081d9c71c..c10bb3c56 --- a/docs/musicxml-3.1.xsd +++ b/docs/musicxml-3.1-8bbe8e5.xsd @@ -19,7 +19,7 @@ This file defines the MusicXML 3.1 XSD, including the score-partwise and score-t - The MusicXML 3.1 DTD has no namespace, so for compatibility the MusicXML 3.1 XSD has no namespace either. Those who need to import the MusicXML XSD into another schema are advised to create a new version that uses "http://www.musicxml.org/xsd/MusicXML" as the namespace. + The MusicXML 3.1 DTD has no namespace, so for compatibility the MusicXML 3.1 XSD has no namespace either. Those who need to import the MusicXML XSD into another schema are advised to create a new version that uses "MusicXML" as the namespace. diff --git a/docs/sounds.xml b/docs/sounds-3.0-5fd8eb3.xml similarity index 96% rename from docs/sounds.xml rename to docs/sounds-3.0-5fd8eb3.xml index b604c19a8..067ce566f 100644 --- a/docs/sounds.xml +++ b/docs/sounds-3.0-5fd8eb3.xml @@ -6,14 +6,14 @@ Version 3.0 - Copyright © 2004-2011 Recordare LLC. - http://www.recordare.com/ + Copyright © 2004-2011 MakeMusic, Inc. + http://www.makemusic.com/ This MusicXML™ work is being provided by the copyright holder under the MusicXML Public License Version 3.0, available from: - http://www.recordare.com/dtds/license.html + http://www.musicxml.org/dtds/license.html Starting with Version 3.0, the MusicXML format includes a standard set of instrument sounds to identify musical diff --git a/docs/sounds-3.1-8bbe8e5.xml b/docs/sounds-3.1-8bbe8e5.xml new file mode 100644 index 000000000..3191870a5 --- /dev/null +++ b/docs/sounds-3.1-8bbe8e5.xml @@ -0,0 +1,924 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/docs/sounds-4.0-ed15c23.xml b/docs/sounds-4.0-ed15c23.xml new file mode 100644 index 000000000..bb09ef62f --- /dev/null +++ b/docs/sounds-4.0-ed15c23.xml @@ -0,0 +1,932 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/gen/README.md b/gen/README.md new file mode 100644 index 000000000..a24f05846 --- /dev/null +++ b/gen/README.md @@ -0,0 +1,317 @@ +# mx generator (`gen/`) + +Reads the MusicXML XSD and emits typed serialization/deserialization libraries. Runs as +`python3 -m gen`. This document covers the parsing and analysis layer that exists today; for the +build system, Docker toolchain, and corert test, see [`../AGENTS.md`](../AGENTS.md). + +## Background + +### Pipeline + +``` +XSD file --parse--> XSD model --lower--> IR --project--> Plates --emit--> Go / C (/ C++) + (gen.xsd) (gen.ir) (gen.plates) (gen.emit) +``` + +1. Parse (`gen/xsd/`) reads the XSD into a model mirroring it 1:1, still speaking XSD: restriction + chains, attribute-group references, anonymous inline types. +2. Lower (`gen/ir/`) resolves that into the intermediate representation (IR): a flat, fully-named, + dependency-ordered model in code-generation terms. A pure function of the XSD, no configurable + knobs (see Design principles). +3. Project (`gen/plates/`) binds the IR to one target: every per-target decision -- identifier + casings, renames, primitive type mappings, emit strategies, file layout -- is made here, once, + producing one **plate** per emitted type. The collection projected for a target is the + **Plates**. Designed in [`../docs/ai/design/plates.md`](../docs/ai/design/plates.md). +4. Render (`gen/press/`): the target's own Mustache templates, declared in its `[render]` + manifest, are pressed against contexts built from the plates. The CARDINAL RULE: the + generator is language agnostic -- adding a target touches no Python (enforced by + `gen/tests/test_agnosticism.py`). A target is a directory: `config.toml` + `templates/`. + The engine is spec-tested Mustache with three deviations (strict missing keys, no HTML + escaping, no lambdas); dispatch is by manifest (one template per shape) and by the context + builder's mechanical enrichments (discriminant flags, `_q` quoted companions, loop metadata, + member splits). Go, C, and JSON Schema render this way today; C++ awaits its templates. + +### Layout + +``` +gen/ + __main__.py CLI: analyze | ir | plates | + config.py typed config.toml loader (inputs, output, plates sections) + names.py naming vocabulary: tokenizer, convention registry, sanitizer + naming.base.toml schema-forced renames shared by all targets + xsd/ + model.py dataclasses mirroring the XSD subset MusicXML uses + parser.py ElementTree parser, no external dependencies + analyze.py structural analysis + reusable index helpers + ir/ + model.py the IR dataclasses + build.py lowering from the XSD model to the IR + resolve.py collapsed views (groups, attributes, flat element fields) for emitters + dump.py IR to JSON + plates/ + model.py the plate dataclasses (neutral core + target binding) + build.py the projection: IR + config -> Plates + check.py post-projection collision detection + press/ + engine.py the Mustache engine (official spec suite + 3 deviations) + context.py plates -> render contexts (flags, _q companions, loop metadata) + render.py [render] manifest expansion, format hook, orchestration + writer.py deterministic output (write-if-changed, marker-gated pruning) + schema/, cpp/, test/go/, test/c/ targets: config.toml + templates/ (+ harnesses) +``` + +### Design principles + +- Generate by shape, not by element. Every type falls into one of 8 shapes (4 value + 4 complex, + defined in the Glossary). One template per shape; no per-element special casing. +- The IR is a pure, canonical function of the XSD. All schema-specific reasoning -- resolving + references, ordering, dead-code removal, naming -- happens once, in the IR, shared by every + target. Per-language choices (inheritance vs flattening, mixins vs inlined attributes) belong to + the emitter. The IR takes no configuration. +- Resolve, but preserve names. The IR data model computes every resolved answer (effective + primitives, cardinalities, dependency order) yet keeps the schema's named structure (aliases, + inheritance edges, model groups, attribute groups) so each emitter can decide how much to collapse. +- One resolution, shared. The collapsed form most emitters actually want -- attribute groups + flattened into a single ordered list, model-group refs spliced into the content, a derived type's + full attribute set including its base chain -- is *not* duplicated into the data (which would risk + drift). It is computed on demand by the resolution layer (`ir/resolve.py`), so the + splicing-and-deduping reasoning lives once and every emitter shares it rather than re-deriving it. + +### Companion data (sounds.xml) + +The one documented exception to "the IR is a pure function of the XSD." The schema types the +`instrument-sound` element as a bare `xs:string`; the ~900 standard timbre identifiers it expects +(`brass.alphorn`, ...) live only in `sounds.xml`, a separately versioned MusicXML file, not the +schema. `ir/sounds.py` reads that file and folds it into the IR: it adds a `sound-id` enum over the +identifiers and an `instrument-sound` union of that enum with an open string, then retypes the +element from `string` to the union. Known values become typed; any other string stays valid, because +the schema leaves the content open. This needs no new IR shape -- it is an ordinary enum plus an +ordinary union, the same shape as `font-size` (the `css-font-size` enum unioned with `decimal`). + +The patch runs only when a target's `config.toml` names a sounds file under `[sounds] xml` (vendored +in `docs/`, version-matched to the XSD), so it is opt-in per target and the base IR stays pure. +`python3 -m gen ir --config ` shows the patched view. It is config-gated, but not a +type-shaping knob: it selects an *input*, not how a type is emitted. + +## Usage + +``` +python3 -m gen analyze [xsd] # structural analysis report (text) +python3 -m gen ir [--type NAME] [--resolve] [--config C] [xsd] # lower to IR, print as JSON +python3 -m gen plates --config C [--type NAME] [--check] # project the IR onto a target, print as JSON +python3 -m gen render --config C --type NAME # render one type to stdout (template debugging) +python3 -m gen # emit the target (render its [render] manifest) +``` + +`--resolve` prints the *collapsed* view of complex types (attribute groups flattened, model-group +refs spliced into the content, derived types carrying their full base-chain attribute set) -- the +form an emitter consumes. Without it, `ir` prints the IR verbatim, with the named structure intact. + +`--config C` lowers the IR exactly as that target's emitter will consume it: it reads the schema the +config pins (`[input] xsd`) and applies the companion patches the config enables (the sounds.xml +fold, see Companion data). An explicit positional `xsd` still overrides the config's. Without +`--config`, `ir` shows the pure-XSD IR of the default (or positional) schema. + +`plates --config C` projects that same IR onto the target: identifier casings, renames, sanitized +identifiers, type mappings, strategies, file assignment -- the proof pulled from the plates before +any code is printed. `--check` runs rename validation and collision detection and exits non-zero on +any failure, so it can gate CI the way `analyze` guards the DAG and no-collision invariants. + +`xsd` defaults to `docs/musicxml-4.0-ed15c23.xsd`. Examples: + +``` +python3 -m gen ir --type note # one type +python3 -m gen ir --type note --resolve # one type, collapsed for an emitter +python3 -m gen ir > build/ir/musicxml-4.0.ir.json # whole IR (build/ is gitignored) +jq '.complex_types[] | select(.name=="note")' build/ir/musicxml-4.0.ir.json +python3 -m gen analyze docs/musicxml-3.1-8bbe8e5.xsd # analyze a different version +``` + +## Glossary + +### XSD source terms + +W3C XML Schema constructs MusicXML uses, as the parser sees them. + +- simpleType -- a type with no child elements and no attributes: just a constrained text value + (enumeration, number range, pattern, or union). Becomes an IR value type. +- complexType -- a type for an element with attributes and/or child elements. Becomes an IR complex + type. +- element -- a named node in the document. In MusicXML every element is declared inline as + `name="x" type="y"`; there are no global element refs. +- attribute -- a `name="value"` pair on an element. Its type is always a simpleType or builtin. +- group (`xs:group`) -- a named, reusable fragment of element content (a sequence/choice) spliced + into complex types by reference. No identity in the XML document. +- attributeGroup -- a named, reusable bundle of attributes referenced by complex types. +- restriction / extension -- the two ways one type derives from another: narrowing with facets, or + adding to it. +- simpleContent / complexContent -- a complex type whose body is a text value plus attributes + (simpleContent), or that derives from another complex type (complexContent). +- facet -- a constraint on a simpleType: `enumeration`, `pattern`, `minInclusive`, `maxInclusive`, + `minExclusive`, `maxExclusive`, `minLength`, `maxLength`, `length`. +- particle -- a piece of a content model: an element, `sequence`, `choice`, or group ref, each with + `minOccurs`/`maxOccurs`. +- anonymous (inline) type -- a type defined in place on an element rather than named at top level. + The IR names and hoists these (see synthesized type). + +### IR stats keys + +The `stats` block summarizes the lowered model. Every key: + +- value_types (143) -- IR value types: a single scalar value, no child elements. Lowered from XSD + simpleTypes (plus the text body of simpleContent complex types). +- value_kinds -- value types by kind: + - enum (96) -- a closed set of allowed string tokens, e.g. `step` = {A..G}. IR fields: `base` (the + primitive the tokens are drawn from, usually `token`/`string`) and `values`. Emits an enum class + plus string<->enum lookup tables. + - number (25) -- a numeric value whose resolved primitive is `decimal`/`integer`/ + `positive_integer`/`non_negative_integer`, with optional bounds. IR fields: `base`, and any of + `min_inclusive`/`max_inclusive`/`min_exclusive`/`max_exclusive`. Includes numeric aliases (a + named type that just renames a numeric primitive, e.g. `divisions`). Emits a numeric wrapper + that range-validates on assignment. + - string (18) -- a text value (primitive `string`/`token`/`nmtoken`/`date`) with optional + `patterns` and length constraints. Includes plain string aliases. Emits a string wrapper with an + optional pattern check. + - union (4) -- a value that may be any one of several member value types or inline literal sets, + e.g. `number-or-normal` = decimal | "normal". IR field: `members`, each a `UnionMember` holding + either a `ref` (a Ref to a value type or primitive) or inline `literals`. Emits a small tagged + variant. +- complex_types (228) -- IR complex types: elements that carry attributes and/or child elements. + Lowered from XSD complexTypes (including synthesized ones). +- complex_kinds -- complex types by kind: + - value (82) -- a typed text body plus attributes (from XSD simpleContent), e.g. + `accidental-text`. IR fields: `value_type` (a Ref to the body's value type), `attributes`, + `attribute_groups`. Emits a class with a `value` field plus attribute fields. + - composite (96) -- child elements arranged in sequences/choices, plus attributes. The structural + workhorse, e.g. `note`. IR field: `content` (a particle tree). Emits a class with one member per + child element (cardinality required/optional/vector), order preserved. + - empty (45) -- an element with no child elements. Two sub-cases the IR unifies: presence-only (a + bare flag, `presence_only: true`) and attributes-only (attributes but no children, e.g. + `empty-placement`). Emits a bool or an attributes-only class. + - derived (5) -- extends another complex type and adds attributes (from XSD complexContent), e.g. + `metronome-tuplet` extends `time-modification`. IR field: `base` (parent type name). Emits + inheritance, or a flattened copy where the language has none. +- groups (27) -- XSD model groups carried into the IR as named, reusable content fragments. +- attribute_groups (45) -- XSD attribute groups carried into the IR as named, reusable attribute + bundles. +- synthesized_types (7) -- complex types the IR created by naming and hoisting anonymous XSD types: + `score-partwise`, `score-timewise`, `partwise-part`, `partwise-measure`, `timewise-part`, + `timewise-measure`, `directive`. The part/measure pairs are context-qualified because the partwise + and timewise hierarchies give them genuinely different shapes. +- dropped_dead_types (5) -- named XSD types nothing references, which the IR omits: + `empty-print-style`, `empty-print-style-align`, `formatted-symbol`, `positive-decimal`, + `start-stop-change-continue` (see XSD Analysis). + +### IR structural terms + +Terms used inside the lowered types, not in `stats`. + +- Ref `{ name, category }` -- a typed reference to another type. `category` is `complex` (a + generated element class), `value` (a generated value type), or `primitive` (a builtin, not + generated). +- primitive -- a builtin base type the generator does not emit. The IR canonicalizes XSD builtins + (`xs:decimal` -> `decimal`, `xs:token` -> `token`, ...) and the 10 `xml:`/`xlink:` attribute refs + into a small primitive set, listed in the `builtins` map. +- attribute -- IR fields: `name`, `type` (a Ref to a value type or primitive), `required`, and + optional `default`/`fixed`. +- particle -- a node in a complex type's `content`: an `element` (a child occurrence), a `sequence` + (ordered list), a `choice` (exactly one of), or a `group` reference. Each carries `min`/`max` (max + may be the string `"unbounded"`). +- cardinality -- the normalized occurrence of an element field: `required` (exactly 1), `optional` + (0 or 1), or `vector` (repeatable). Derived from min/max. +- presence_only -- true for an empty element with no attributes: its only information is whether it + appears, so it maps to a bool. +- base -- for a `derived` complex type, the parent complex type it extends. The IR stores only the + added attributes; inherited attributes are reached through `base`, or flattened in one call by + `resolve.all_attributes`. `ComplexType.content` is defined for derived types but is currently + always empty: every MusicXML derivation adds attributes only, never content. +- value_type -- for a `value` complex type, the Ref to the value type of its text body. +- deps -- the complex types a type structurally depends on (child element types + base), resolved + through groups. Drives the ordering below. +- roots -- the document root elements: `score-partwise` and `score-timewise`. +- builtins -- the map from XSD/external builtin names to canonical IR primitives. + +### Resolution layer + +The IR data model preserves the schema's named structure; `ir/resolve.py` collapses it on demand. +`Resolver.from_ir(ir)` exposes read-only accessors over a complex type, none of which mutate the +IR: + +- `attributes(ct)` -- the type's own attributes with its `attribute_groups` expanded inline, in + declaration order, deduped by name. (`note`: 7 own + 5 groups -> 21 attributes.) +- `all_attributes(ct)` -- `attributes(ct)` plus the base chain's attributes, base-most first, for a + target with no inheritance to lean on. (`mordent`: 3 own -> 20 once `empty-trill-sound` is merged.) +- `content(ct)` -- `ct.content` with every model-group ref spliced in: a self-contained tree of + elements/sequences/choices with no `group` nodes left. Nesting and all min/max bounds are + preserved. +- `elements(ct)` -- every element occurrence in the resolved content, in document order, flattened + across sequences/choices/groups (drops the choice/sequence grouping and keeps local cardinality; + use `content` when structure matters and `flat_elements` for a field view). +- `flat_elements(ct)` -- each distinct element name with its *effective* cardinality for a flat + one-field-per-name view: repeated wrappers make vectors, choices demote to optional, and + duplicate occurrences of one name merge by co-occurrence analysis (occurrences in different + branches of one choice are exclusive -> optional; anything else can co-occur in one instance -> + vector, e.g. `metronome`'s `beat-unit`). +- `all_flat_elements(ct)` / `base_chain(ct)` -- the flattened view merged across the derivation + chain (base-most first), mirroring `all_attributes`. + +`python3 -m gen ir --resolve` dumps this view. `build` itself uses the resolver to compute each +complex type's `deps`, so the group-walking logic lives in exactly one place rather than once per +emitter. + +### Ordering + +Both `value_types` and `complex_types` are emitted **deps-first**: a type's dependencies always +precede it in the list. The list order *is* the topological order -- there is no separate rank field, +because the array index already encodes it and a duplicate integer would only risk drifting out of +sync. Value types never reference complex types, so concatenating `value_types` then `complex_types` +is a valid total order for a single-file emit. An emitter that wants a different order (alphabetical, +per-file, by shape) has `deps` and can compute its own. Within a value type, union members precede +the union; the only value-to-value dependency is a union referencing its members. + +## XSD Analysis + +Run `python3 -m gen analyze` for the full report. Key findings for MusicXML 4.0: + +### Inventory + +145 simpleTypes, 224 complexTypes, 27 model groups, 45 attribute groups, 2 document roots. 440 +distinct element names across 478 declaration sites. 351 attribute declarations (60 required). + +### Two load-bearing invariants + +These hold for 3.0, 3.1, 4.0, and 4.1 alike; the codegen design leans on both: + +1. The complex-type graph is a DAG. Zero cycles, zero self-references. So generated code can use + plain by-value members, emit types in topological order, and skip forward declarations and heap + indirection entirely -- removing the hardest problem in typed-XML codegen. +2. No element-name collisions. Every element name maps to exactly one type (no global element refs). + An element's type is fully determined by its name, so parse/serialize dispatch is a flat name -> + type table with no context-sensitive resolution. + +Because these are empirical, not guaranteed by XSD, `analyze` is worth keeping as a CI gate: fail +the build if a future schema introduces a cycle or name collision, before the value-type and +flat-dispatch assumptions silently break. + +### The part/measure wrinkle + +The one place name -> type is not 1:1 is the document-root scaffolding. `part` and `measure` each +appear as two anonymous types -- partwise nests `part > measure`, timewise nests `measure > part` -- +so the IR qualifies them as `partwise-part`/`timewise-part`/etc. + +### Dead types + +Five named types are defined but referenced by nothing, even indirectly: + +| Type | Why it is dead | +|------------------------------|--------------------------------------------------------------------| +| `positive-decimal` | orphan definition, never referenced | +| `start-stop-change-continue` | orphan; elements use `start-stop-change` / `start-stop-continue` | +| `formatted-symbol` | superseded by `formatted-symbol-id`, which is the one elements use | +| `empty-print-style` | superseded by the `-align` / `-id` / `-object` variants | +| `empty-print-style-align` | superseded by `empty-print-style-align-id` / `-object` | + +Verified by direct text search of the XSD (independent of the parser): each has exactly one +definition and zero references via `type`/`base`/`ref`/`itemType`/`memberTypes`. The IR drops them +and reports the count. Recognizing the families (print-style, formatted-symbol) is expected -- those +are heavily used; only the specific bare type names are vestigial. diff --git a/gen/__init__.py b/gen/__init__.py new file mode 100644 index 000000000..bd0866d7a --- /dev/null +++ b/gen/__init__.py @@ -0,0 +1 @@ +"""mx code generator package.""" diff --git a/gen/__main__.py b/gen/__main__.py new file mode 100644 index 000000000..15dbacbb7 --- /dev/null +++ b/gen/__main__.py @@ -0,0 +1,279 @@ +"""mx code generator entry point. + +Usage: + python3 -m gen emit code for the target the config describes + python3 -m gen analyze [xsd] parse the XSD and print a structural analysis + python3 -m gen ir [--type N] [--resolve] [--config C] [xsd] + lower the XSD to the IR and print it as JSON; + --resolve prints the collapsed (group-spliced, + attribute-flattened) view of complex types; + --config applies a target's companion patches + (e.g. the sounds.xml fold) before dumping + python3 -m gen plates --config C [--type N] [--check] + project the IR onto the target the config + describes and print the Plates as JSON; + --check validates renames and detects + identifier collisions, exiting non-zero on + any failure (a CI gate, like analyze) + python3 -m gen render --config C --type N + render one type through the target's + templates to stdout (template debugging) + +Reads a MusicXML 4.0 XSD specification and generates typed document +serialization/deserialization code for the target described in the given +config file. +""" + +import sys +from pathlib import Path + +# The MusicXML version this generator targets, used as the default for analyze. +DEFAULT_XSD = Path(__file__).resolve().parent.parent / "docs" / "musicxml-4.0-ed15c23.xsd" + + +def _analyze(args: list[str]) -> int: + from gen.xsd.analyze import report + from gen.xsd.parser import parse + + xsd = Path(args[0]) if args else DEFAULT_XSD + if not xsd.exists(): + print(f"error: XSD not found: {xsd}", file=sys.stderr) + return 1 + print(report(parse(xsd))) + return 0 + + +def _ir(args: list[str]) -> int: + from gen.ir.dump import resolved_view, to_json + from gen.ir.resolve import Resolver + + type_name = None + resolve = False + config_path = None + rest = [] + i = 0 + while i < len(args): + if args[i] == "--type" and i + 1 < len(args): + type_name = args[i + 1] + i += 2 + elif args[i] == "--resolve": + resolve = True + i += 1 + elif args[i] == "--config" and i + 1 < len(args): + config_path = args[i + 1] + i += 2 + else: + rest.append(args[i]) + i += 1 + + cfg = None + if config_path is not None: + from gen.config import load as load_config + + cfg = load_config(config_path) + + # XSD precedence: an explicit positional argument wins, else the target + # config's pinned version, else the 4.0 default. + if rest: + xsd = Path(rest[0]) + elif cfg is not None and cfg.xsd is not None: + xsd = cfg.xsd + else: + xsd = DEFAULT_XSD + if not xsd.exists(): + print(f"error: XSD not found: {xsd}", file=sys.stderr) + return 1 + ir = _lower(xsd, cfg) + + resolver = Resolver.from_ir(ir) if resolve else None + + if type_name: + ct = next((c for c in ir.complex_types if c.name == type_name), None) + if ct is not None: + print(to_json(resolved_view(resolver, ct) if resolver else ct)) + return 0 + vt = next((v for v in ir.value_types if v.name == type_name), None) + if vt is None: + print(f"error: type not found in IR: {type_name}", file=sys.stderr) + return 1 + print(to_json(vt)) # value types are already fully resolved + return 0 + + if resolver: + print(to_json([resolved_view(resolver, c) for c in ir.complex_types])) + else: + print(to_json(ir)) + return 0 + + +def _lower(xsd: Path, cfg): + """Lower an XSD to the IR, applying a config's companion patches (today: + the sounds.xml fold). One definition, shared by every command.""" + from gen.ir.build import build_ir + from gen.xsd.parser import parse + + ir = build_ir(parse(xsd), source=xsd.stem) + if cfg is not None and cfg.sounds_xml is not None: + from gen.ir.sounds import patch_sounds, read_sound_ids + + patch_sounds(ir, read_sound_ids(cfg.sounds_xml)) + return ir + + +def _plates(args: list[str]) -> int: + from gen.ir.dump import to_json + from gen.plates import PlatesError + + config_path = None + type_name = None + check = False + i = 0 + while i < len(args): + if args[i] == "--config" and i + 1 < len(args): + config_path = args[i + 1] + i += 2 + elif args[i] == "--type" and i + 1 < len(args): + type_name = args[i + 1] + i += 2 + elif args[i] == "--check": + check = True + i += 1 + else: + print(f"error: unexpected argument: {args[i]}", file=sys.stderr) + return 2 + if config_path is None: + print("error: plates requires --config ", file=sys.stderr) + return 2 + + from gen.plates import build_for_config + + try: + plates, _ = build_for_config(config_path) + except PlatesError as e: + for line in e.errors: + print(f"error: {line}", file=sys.stderr) + return 1 + + if check: + # Rename validation and collision detection already ran in the build; + # reaching here means the projection is clean. + print( + f"plates ok: {len(plates.value_types)} value types, " + f"{len(plates.complex_types)} complex types" + ) + return 0 + + if type_name: + if not plates.has_plate(type_name): + print(f"error: type not found in plates: {type_name}", file=sys.stderr) + return 1 + print(to_json(plates.plate(type_name))) + return 0 + + print(to_json(plates)) + return 0 + + +def _emit(config_path: str) -> int: + from gen.config import ConfigError + from gen.plates import PlatesError, build_for_config + from gen.press.engine import PressError + from gen.press.render import RenderError, render_target + + try: + plates, cfg = build_for_config(config_path) + result = render_target(plates, cfg) + except PlatesError as e: + for line in e.errors: + print(f"error: {line}", file=sys.stderr) + return 1 + except ( + ConfigError, + FileNotFoundError, + PressError, + RenderError, + RuntimeError, + ValueError, + ) as e: + print(f"error: {e}", file=sys.stderr) + return 1 + print(result.summary()) + return 0 + + +def _render_debug(args: list[str]) -> int: + from gen.plates import PlatesError, build_for_config + from gen.press.engine import PressError + from gen.press.render import RenderError, render_files + + config_path = None + type_name = None + i = 0 + while i < len(args): + if args[i] == "--config" and i + 1 < len(args): + config_path = args[i + 1] + i += 2 + elif args[i] == "--type" and i + 1 < len(args): + type_name = args[i + 1] + i += 2 + else: + print(f"error: unexpected argument: {args[i]}", file=sys.stderr) + return 2 + if config_path is None or type_name is None: + print("error: render requires --config --type ", + file=sys.stderr) + return 2 + try: + plates, cfg = build_for_config(config_path) + if cfg.render is None: + print(f"error: config has no [render] manifest: {cfg.path}", + file=sys.stderr) + return 1 + if not plates.has_plate(type_name): + print(f"error: type not found in plates: {type_name}", file=sys.stderr) + return 1 + plate = plates.plate(type_name) + files = render_files(plates, cfg) + from gen.press.render import _expand + + shown = 0 + for entry in cfg.render.types: + if plate.strategy in entry.strategies: + path = _expand(entry.output, plate) + print(f"==== {path} (from {entry.template})") + print(files[path], end="") + shown += 1 + if not shown: + print(f"error: no manifest entry renders strategy " + f"'{plate.strategy}'", file=sys.stderr) + return 1 + except PlatesError as e: + for line in e.errors: + print(f"error: {line}", file=sys.stderr) + return 1 + except (PressError, RenderError, FileNotFoundError, ValueError) as e: + print(f"error: {e}", file=sys.stderr) + return 1 + return 0 + + +def main(argv: list[str]) -> int: + if not argv: + print(__doc__, file=sys.stderr) + return 2 + if argv[0] == "analyze": + return _analyze(argv[1:]) + if argv[0] == "ir": + return _ir(argv[1:]) + if argv[0] == "plates": + return _plates(argv[1:]) + if argv[0] == "render": + return _render_debug(argv[1:]) + if argv[0].endswith(".toml"): + return _emit(argv[0]) + print(f"error: unknown command: {argv[0]}", file=sys.stderr) + return 2 + + +if __name__ == "__main__": + sys.exit(main(sys.argv[1:])) diff --git a/gen/config.py b/gen/config.py new file mode 100644 index 000000000..29ab56a68 --- /dev/null +++ b/gen/config.py @@ -0,0 +1,475 @@ +"""Load a target's config.toml into a typed Config. + +A target config describes one generation run: which schema inputs to read, +where generated code lands, which optional companion patches to apply before +emitting, and how the IR is projected onto the target (the Plates: naming +conventions, renames, type mappings, layout). The IR itself stays a pure +function of the schema inputs (see gen.ir); config selects *which* inputs and +how the result is presented, never what the schema means. + +Parsing is structural only: key shapes, types, and the rename addressing +scheme. Semantic validation (does a rename key name something in the IR, do +projected identifiers collide) happens in gen.plates.build, which has the IR +in hand and fails loud there. +""" + +from __future__ import annotations + +import tomllib +from dataclasses import dataclass, field +from pathlib import Path + +from gen.names import CONVENTIONS + +# Keys allowed in a rename entry table: a fundamental rename (all casings +# re-expand from the new root) or per-convention overrides (pin one flavor). +_ENTRY_KEYS = frozenset(CONVENTIONS) | {"fundamental"} + +# Rename kinds the Plates build consumes today. `group` and `attribute-group` +# are reserved by the design for targets that emit shared fragments/mixins; +# none of ours does, so configuring them is an error rather than a silently +# dead table. +_RENAME_KINDS = ("type", "element", "attribute", "enum-value") + + +class ConfigError(ValueError): + """A malformed config file. Always raised with the offending key path.""" + + +@dataclass +class RenameEntry: + """One rename: an optional fundamental root plus per-convention pins.""" + + fundamental: str | None = None + cased: dict[str, str] = field(default_factory=dict) + + +@dataclass +class Renames: + """Parsed [rename.*] tables, keyed by the design's addressing scheme.""" + + types: dict[str, RenameEntry] = field(default_factory=dict) + elements: dict[str, RenameEntry] = field(default_factory=dict) + attributes: dict[str, RenameEntry] = field(default_factory=dict) # global + scoped_attributes: dict[tuple[str, str], RenameEntry] = field(default_factory=dict) + enum_values: dict[tuple[str, str], RenameEntry] = field(default_factory=dict) + + def __bool__(self) -> bool: + return bool( + self.types + or self.elements + or self.attributes + or self.scoped_attributes + or self.enum_values + ) + + +@dataclass +class TargetSection: + symbol_prefix: str = "" # prepended to type idents and composed constants + inheritance: bool = True # derived types: inherit (True) or flatten + variant_scope: str = "bare" # constant scoping: "bare" | "composed" + + +@dataclass +class NamingSection: + acronyms: tuple[str, ...] | None = None # None -> the built-in default set + type_convention: str = "pascal" + field_convention: str = "snake" + variant_convention: str = "pascal" + field_prefix: str = "" + pluralize_vectors: bool = False + + +@dataclass +class ReservedSection: + words: tuple[str, ...] = () # the target's WHOLE reserved-word list + members: tuple[str, ...] = () # member idents the target's templates reserve + type_suffixes: tuple[str, ...] = () # compositions templates append to type idents + invalid_prefix: str = "_" + + +@dataclass +class RenderEntry: + """One manifest row: render `template` for every plate the row selects, + writing to the `output` pattern (casing placeholders like {snake} come + from the plate's name). A row selects either by `strategies` (the normal + shape-driven case) or by `types` -- exact wire names, for bespoke + handling of individual types. Type rows OVERRIDE strategy rows: a plate + named by any type row is rendered only by its type rows, so custom code + for one element or attribute is a config-and-template change, never a + generator change. A once-per-target row has neither.""" + + template: str + output: str + strategies: tuple[str, ...] = () + types: tuple[str, ...] = () + + +@dataclass +class RenderSection: + """The render manifest: which templates produce which files. Its presence + selects the press pipeline.""" + + dir: Path # the target's templates directory, resolved + format: tuple[str, ...] = () # optional post-render command; {dir} expands + types: list[RenderEntry] = field(default_factory=list) + once: list[RenderEntry] = field(default_factory=list) + + +@dataclass +class DocsSection: + # Width of the wrapped doc TEXT, excluding comment syntax (templates add + # their own prefixes). 97 + a 3-character prefix is the 100-column house + # style. + wrap: int = 97 + + +@dataclass +class Config: + path: Path = Path(".") # the config file itself, resolved + xsd: Path | None = None # the MusicXML XSD this target generates from + output_dir: Path | None = None # where generated code lands, resolved + sounds_xml: Path | None = None # companion sounds file to fold in, or None + target: TargetSection = field(default_factory=TargetSection) + vars: dict[str, str] = field(default_factory=dict) # freeform, for templates + naming: NamingSection = field(default_factory=NamingSection) + reserved: ReservedSection = field(default_factory=ReservedSection) + types: dict[str, str] = field(default_factory=dict) # primitive overrides + docs: DocsSection = field(default_factory=DocsSection) + renames: Renames = field(default_factory=Renames) + render: RenderSection | None = None # presence selects the press pipeline + + +def load(config_path) -> Config: + """Parse config.toml. Paths inside it are interpreted relative to the + config file's own directory, so a target's config stays self-contained.""" + path = Path(config_path).resolve() + if not path.exists(): + raise FileNotFoundError(f"config not found: {path}") + with open(path, "rb") as f: + data = tomllib.load(f) + base = path.parent + _check_keys( + data, + {"input", "output", "sounds", "target", "naming", "reserved", "types", + "docs", "rename", "vars", "render"}, + "top level", + ) + _check_keys(data.get("input", {}), {"xsd"}, "input") + _check_keys(data.get("output", {}), {"dir"}, "output") + _check_keys(data.get("sounds", {}), {"xml"}, "sounds") + + # A shared naming base (design: [naming] extends) contributes [naming] + # keys and [rename.*] entries; the target's own win on any conflict. + data = _apply_extends(data, base) + + # Each target pins its own MusicXML version: the schema it generates from + # is part of the target's identity, not a global default. + xsd = None + inp = data.get("input", {}) + if inp.get("xsd"): + xsd = (base / inp["xsd"]).resolve() + if not xsd.exists(): + raise FileNotFoundError(f"xsd not found: {xsd}") + + output_dir = None + out = data.get("output", {}) + if out.get("dir"): + output_dir = (base / out["dir"]).resolve() + + # Companion sounds patch is on iff [sounds] xml names a file (see + # gen.ir.sounds). Resolve and existence-check it here so a bad path fails + # at config load, not deep in the lowering. + sounds_xml = None + sounds = data.get("sounds", {}) + if sounds.get("xml"): + sounds_xml = (base / sounds["xml"]).resolve() + if not sounds_xml.exists(): + raise FileNotFoundError(f"sounds file not found: {sounds_xml}") + + return Config( + path=path, + xsd=xsd, + output_dir=output_dir, + sounds_xml=sounds_xml, + target=_target(data.get("target", {})), + vars=_vars(data.get("vars", {})), + naming=_naming(data.get("naming", {})), + reserved=_reserved(data.get("reserved", {})), + types=_types(data.get("types", {})), + docs=_docs(data.get("docs", {})), + renames=_renames(data.get("rename", {})), + render=_render(data["render"], base) if "render" in data else None, + ) + + +# --------------------------------------------------------------------------- # +# Section parsers. Each takes the raw TOML table and fails loud on unknown +# keys, so a typo is a config error, not a silently ignored line. +# --------------------------------------------------------------------------- # + + +def _check_keys(table: dict, allowed: set[str], where: str) -> None: + unknown = set(table) - allowed + if unknown: + raise ConfigError(f"unknown key(s) in [{where}]: {', '.join(sorted(unknown))}") + + +def _target(t: dict) -> TargetSection: + _check_keys(t, {"symbol-prefix", "inheritance", "variant-scope"}, "target") + section = TargetSection( + symbol_prefix=t.get("symbol-prefix", ""), + inheritance=bool(t.get("inheritance", True)), + variant_scope=t.get("variant-scope", "bare"), + ) + if section.variant_scope not in ("bare", "composed"): + raise ConfigError( + f"[target] variant-scope = {section.variant_scope!r}: expected bare or composed" + ) + return section + + +def _vars(t: dict) -> dict[str, str]: + """Freeform key-values passed verbatim to templates ({{target.vars.x}}). + The generator never interprets them; this is where anything that cannot + be defined without naming a language belongs.""" + for k, v in t.items(): + if not isinstance(v, str): + raise ConfigError(f"[vars] {k} must be a string") + return dict(t) + + +def _string_list(value, where: str) -> tuple[str, ...]: + """A TOML array of strings. A bare string is rejected rather than being + silently exploded into characters.""" + if not isinstance(value, list) or not all(isinstance(x, str) for x in value): + raise ConfigError(f"[{where}] must be an array of strings") + return tuple(value) + + +def _naming(t: dict) -> NamingSection: + _check_keys( + t, + { + "extends", "acronyms", "type-convention", "field-convention", + "variant-convention", "field-prefix", "pluralize-vectors", + }, + "naming", + ) + section = NamingSection( + acronyms=_string_list(t["acronyms"], "naming.acronyms") if "acronyms" in t else None, + type_convention=t.get("type-convention", "pascal"), + field_convention=t.get("field-convention", "snake"), + variant_convention=t.get("variant-convention", "pascal"), + field_prefix=t.get("field-prefix", ""), + pluralize_vectors=bool(t.get("pluralize-vectors", False)), + ) + for key in ("type_convention", "field_convention", "variant_convention"): + value = getattr(section, key) + if value not in CONVENTIONS: + raise ConfigError( + f"[naming] {key.replace('_', '-')} = {value!r} is not a " + f"registered convention ({', '.join(sorted(CONVENTIONS))})" + ) + return section + + +def _reserved(t: dict) -> ReservedSection: + _check_keys(t, {"words", "members", "type-suffixes", "invalid-prefix"}, "reserved") + return ReservedSection( + words=_string_list(t["words"], "reserved.words") if "words" in t else (), + members=_string_list(t["members"], "reserved.members") if "members" in t else (), + type_suffixes=_string_list(t["type-suffixes"], "reserved.type-suffixes") + if "type-suffixes" in t + else (), + invalid_prefix=t.get("invalid-prefix", "_"), + ) + + +def _types(t: dict) -> dict[str, str]: + for k, v in t.items(): + if not isinstance(v, str): + raise ConfigError(f"[types] {k} must be a string target type") + return dict(t) + + +def _render_entry(t: dict, where: str, once: bool) -> RenderEntry: + allowed = {"template", "output"} | (set() if once else {"strategies", "types"}) + _check_keys(t, allowed, where) + for key in ("template", "output"): + if not isinstance(t.get(key), str) or not t[key]: + raise ConfigError(f"[{where}] requires a non-empty '{key}' string") + strategies: tuple[str, ...] = () + types: tuple[str, ...] = () + if not once: + strategies = tuple(_string_list(t.get("strategies", []), f"{where}.strategies")) + types = tuple(_string_list(t.get("types", []), f"{where}.types")) + if bool(strategies) == bool(types): + raise ConfigError( + f"[{where}] requires exactly one of 'strategies' or 'types'" + ) + return RenderEntry( + template=t["template"], output=t["output"], strategies=strategies, types=types + ) + + +def _render(t: dict, base: Path) -> RenderSection: + _check_keys(t, {"dir", "format", "type", "once"}, "render") + if not isinstance(t.get("dir"), str) or not t["dir"]: + raise ConfigError("[render] requires a 'dir' (the templates directory)") + directory = (base / t["dir"]).resolve() + if not directory.is_dir(): + raise FileNotFoundError(f"templates directory not found: {directory}") + section = RenderSection( + dir=directory, + format=tuple(_string_list(t["format"], "render.format")) if "format" in t else (), + types=[ + _render_entry(e, "render.type", once=False) for e in t.get("type", []) + ], + once=[ + _render_entry(e, "render.once", once=True) for e in t.get("once", []) + ], + ) + if not section.types and not section.once: + raise ConfigError("[render] declares no template entries") + return section + + +def _docs(t: dict) -> DocsSection: + _check_keys(t, {"wrap"}, "docs") + return DocsSection(wrap=int(t.get("wrap", 97))) + + +# --------------------------------------------------------------------------- # +# Renames (design 6.2/6.3): two tiers (fundamental + per-convention), four +# addressable kinds, with enum values scoped to their enum and attributes +# optionally scoped to their owner type. +# --------------------------------------------------------------------------- # + + +def _entry(value, where: str) -> RenameEntry: + """A rename value is either the string shorthand (sugar for a table with + only `fundamental`) or a table of fundamental/convention keys.""" + if isinstance(value, str): + return RenameEntry(fundamental=value) + if isinstance(value, dict): + unknown = set(value) - _ENTRY_KEYS + if unknown: + raise ConfigError( + f"unknown key(s) in [{where}]: {', '.join(sorted(unknown))} " + f"(expected fundamental or a convention: {', '.join(sorted(CONVENTIONS))})" + ) + bad = [k for k, v in value.items() if not isinstance(v, str)] + if bad: + raise ConfigError(f"[{where}] {bad[0]} must be a string") + if not value: + raise ConfigError(f"[{where}] is empty: set fundamental or a convention") + return RenameEntry( + fundamental=value.get("fundamental"), + cased={k: v for k, v in value.items() if k != "fundamental"}, + ) + raise ConfigError(f"[{where}] must be a string or a table") + + +def _is_entry_table(value) -> bool: + return isinstance(value, dict) and set(value) <= _ENTRY_KEYS + + +def _renames(t: dict) -> Renames: + unknown = set(t) - set(_RENAME_KINDS) - {"group", "attribute-group"} + if unknown: + raise ConfigError(f"unknown rename kind(s): {', '.join(sorted(unknown))}") + for reserved_kind in ("group", "attribute-group"): + if reserved_kind in t: + raise ConfigError( + f"rename kind '{reserved_kind}' is reserved for targets that emit " + f"shared fragments; no current target does" + ) + + for kind in _RENAME_KINDS: + if kind in t and not isinstance(t[kind], dict): + raise ConfigError(f"[rename.{kind}] must be a table") + + r = Renames() + for wire, value in t.get("type", {}).items(): + r.types[wire] = _entry(value, f"rename.type.{wire}") + for wire, value in t.get("element", {}).items(): + r.elements[wire] = _entry(value, f"rename.element.{wire}") + + # [rename.attribute] mixes global entries (string, or a table of entry + # keys) with owner scopes (a table keyed by attribute names). The key sets + # are disjoint: entry keys are fundamental/conventions, never wire names. + for key, value in t.get("attribute", {}).items(): + if isinstance(value, str) or _is_entry_table(value): + r.attributes[key] = _entry(value, f"rename.attribute.{key}") + elif isinstance(value, dict): + for attr, sub in value.items(): + r.scoped_attributes[(key, attr)] = _entry( + sub, f"rename.attribute.{key}.{attr}" + ) + else: + raise ConfigError(f"[rename.attribute] {key} must be a string or a table") + + for enum, table in t.get("enum-value", {}).items(): + if not isinstance(table, dict): + raise ConfigError(f"[rename.enum-value.{enum}] must be a table of values") + for wire, value in table.items(): + r.enum_values[(enum, wire)] = _entry(value, f"rename.enum-value.{enum}.{wire}") + return r + + +# --------------------------------------------------------------------------- # +# Shared naming base ([naming] extends) +# --------------------------------------------------------------------------- # + + +def _apply_extends(data: dict, base_dir: Path) -> dict: + """Merge a shared base file under the target's config: the base + contributes [naming] keys and [rename] entries; the target's own win per + key/entry. Anything else in the base is an error, as is chaining bases.""" + extends = data.get("naming", {}).get("extends") + if not extends: + return data + base_path = (base_dir / extends).resolve() + if not base_path.exists(): + raise FileNotFoundError(f"naming base not found: {base_path}") + with open(base_path, "rb") as f: + shared = tomllib.load(f) + _check_keys(shared, {"naming", "rename"}, f"naming base {base_path.name}") + if "extends" in shared.get("naming", {}): + raise ConfigError(f"naming base {base_path.name} may not chain to another base") + + merged = dict(data) + naming = dict(shared.get("naming", {})) + naming.update(data.get("naming", {})) + naming.pop("extends", None) + merged["naming"] = naming + + rename: dict = {} + for kind in set(shared.get("rename", {})) | set(data.get("rename", {})): + base_table = shared.get("rename", {}).get(kind, {}) + own_table = data.get("rename", {}).get(kind, {}) + table: dict = {} + for key in list(base_table) + [k for k in own_table if k not in base_table]: + b, o = base_table.get(key), own_table.get(key) + b_scope = isinstance(b, dict) and not _is_entry_table(b) + o_scope = isinstance(o, dict) and not _is_entry_table(o) + if b is not None and o is not None and b_scope != o_scope: + # One side addresses a scope table, the other a single entry: + # a silent wholesale replacement would quietly drop the + # base's renames, so the disagreement is an error. + raise ConfigError( + f"[rename.{kind}.{key}]: the target and its naming base " + f"disagree on whether this is a scope or an entry" + ) + if b_scope and o_scope: + # A nested scope (an enum's value table, an owner's attribute + # table): merge per inner entry, target winning. + table[key] = {**b, **o} + else: + table[key] = o if key in own_table else b + rename[kind] = table + if rename: + merged["rename"] = rename + return merged diff --git a/gen/cpp/config.toml b/gen/cpp/config.toml new file mode 100644 index 000000000..052227718 --- /dev/null +++ b/gen/cpp/config.toml @@ -0,0 +1,65 @@ +# C++ generator target configuration. +# The generator reads this to know which schema to read and where/how to emit. +# C++ is the primary target: MusicXML 4.0, with the sounds.xml companion. +# +# The generator is language agnostic (the cardinal rule): everything C++-shaped +# about this target lives here and in its templates (when they are written), +# never in generator code. + +[input] +# MusicXML XSD this target generates from, relative to this config file. +xsd = "../../docs/musicxml-4.0-ed15c23.xsd" + +[output] +# Directory for generated source files, relative to this config file. +dir = "../../src/private/mx/core" + +[sounds] +# Companion sounds file (vendored under docs/, version-matched to the XSD). +# Folds the standard instrument-sound identifiers into the IR as a sound-id +# enum unioned with an open string. Comment out to disable. +xml = "../../docs/sounds-4.0-ed15c23.xml" + +[target] +# enum class scopes its constants inside the type, so they stay bare. +variant-scope = "bare" + +[vars] +# Freeform, passed verbatim to this target's templates (when they exist). +namespace = "mx::core" + +[naming] +extends = "../naming.base.toml" # schema-forced renames shared by all targets + +[types] +# IR primitive -> C++ spelling. Decimal is mx::core's own wrapper class: a +# target decision, which is exactly why it lives here. +string = "std::string" +token = "std::string" +nmtoken = "std::string" +date = "std::string" +decimal = "Decimal" +integer = "int" +positive_integer = "int" +non_negative_integer = "int" + +[reserved] +# C++ keywords (including alternative tokens) a generated identifier must not be. +words = [ + "alignas", "alignof", "and", "and_eq", "asm", "auto", "bitand", + "bitor", "bool", "break", "case", "catch", "char", "char8_t", + "char16_t", "char32_t", "class", "compl", "concept", "const", + "consteval", "constexpr", "constinit", "const_cast", "continue", + "co_await", "co_return", "co_yield", "decltype", "default", "delete", + "do", "double", "dynamic_cast", "else", "enum", "explicit", "export", + "extern", "false", "float", "for", "friend", "goto", "if", "inline", + "int", "long", "mutable", "namespace", "new", "noexcept", "not", + "not_eq", "nullptr", "operator", "or", "or_eq", "private", + "protected", "public", "register", "reinterpret_cast", "requires", + "return", "short", "signed", "sizeof", "static", "static_assert", + "static_cast", "struct", "switch", "template", "this", "thread_local", + "throw", "true", "try", "typedef", "typeid", "typename", "union", + "unsigned", "using", "virtual", "void", "volatile", "wchar_t", + "while", "xor", "xor_eq", +] + diff --git a/gen/ir/__init__.py b/gen/ir/__init__.py new file mode 100644 index 000000000..bf6796e55 --- /dev/null +++ b/gen/ir/__init__.py @@ -0,0 +1,7 @@ +"""Intermediate representation for the mx generator.""" + +from gen.ir.build import build_ir +from gen.ir.model import Ir +from gen.ir.resolve import Resolver + +__all__ = ["Ir", "Resolver", "build_ir"] diff --git a/gen/ir/build.py b/gen/ir/build.py new file mode 100644 index 000000000..80caec677 --- /dev/null +++ b/gen/ir/build.py @@ -0,0 +1,387 @@ +"""Lower a parsed XSD schema (gen.xsd.model) into the IR (gen.ir.model).""" + +from __future__ import annotations + +from collections import Counter + +from gen.xsd import model as xsd +from gen.xsd.analyze import content_particle, reachable_types +from gen.ir import model as ir +from gen.ir.resolve import Resolver + +# Map XSD builtin types to canonical IR primitive names. +_XS_PRIMITIVE = { + "xs:string": "string", + "xs:token": "token", + "xs:NMTOKEN": "nmtoken", + "xs:decimal": "decimal", + "xs:integer": "integer", + "xs:int": "integer", + "xs:positiveInteger": "positive_integer", + "xs:nonNegativeInteger": "non_negative_integer", + "xs:date": "date", + "xs:anyURI": "string", + "xs:language": "token", + # The identity builtins are NCName-derived (whitespace-collapsed tokens); + # no target gives them identity semantics, so they canonicalize to token + # rather than leaking ID/IDREF as accidental extra primitives. + "xs:ID": "token", + "xs:IDREF": "token", +} + +# The 10 attribute refs into the imported xml/xlink schemas, resolved to the +# primitive the emitter should use. This is the only place the IR reaches +# outside the main schema. +_EXTERNAL_ATTR = { + "xml:lang": "token", + "xml:space": "token", + "xlink:href": "string", + "xlink:type": "token", + "xlink:role": "string", + "xlink:title": "string", + "xlink:show": "token", + "xlink:actuate": "token", +} + +_NUMERIC = {"decimal", "integer", "positive_integer", "non_negative_integer"} + +# The closed set of canonical IR primitives (the values of the builtins map): +# everything a Ref with category "primitive" can name, and the keys a target +# type map may override. +PRIMITIVES = frozenset(_XS_PRIMITIVE.values()) | frozenset(_EXTERNAL_ATTR.values()) + + +def _primitive(name: str) -> str: + if name in _XS_PRIMITIVE: + return _XS_PRIMITIVE[name] + if name in _EXTERNAL_ATTR: + return _EXTERNAL_ATTR[name] + return name.split(":")[-1] + + +def _occ(value: int) -> int | str: + return ir.UNBOUNDED if value == xsd.UNBOUNDED else value + + +def _cardinality(min_occurs: int, max_occurs: int) -> str: + if max_occurs == xsd.UNBOUNDED or max_occurs > 1: + return "vector" + return "optional" if min_occurs == 0 else "required" + + +def build_ir(schema: xsd.Schema, source: str) -> ir.Ir: + return _Builder(schema, source).build() + + +class _Builder: + def __init__(self, schema: xsd.Schema, source: str): + self.schema = schema + self.source = source + self.anon_names: dict[int, str] = {} # id(ComplexType) -> synthesized name + self.synth: list[tuple[str, xsd.ComplexType]] = [] + + # ----- top level ------------------------------------------------------- # + + def build(self) -> ir.Ir: + self._hoist_anonymous() + reachable = reachable_types(self.schema) + + value_types = [ + self._value_type(st) + for name, st in self.schema.simple_types.items() + if name in reachable + ] + value_types = self._topo_sort_values(value_types) + groups = [ + ir.Group(name, self._particle(g.particle), g.doc) + for name, g in self.schema.groups.items() + ] + attribute_groups = [ + ir.AttributeGroup( + name, + [self._attr(a) for a in ag.attributes], + [r.ref for r in ag.group_refs], + ag.doc, + ) + for name, ag in self.schema.attribute_groups.items() + ] + + complex_types = [ + self._complex_type(name, ct) + for name, ct in self.schema.complex_types.items() + if name in reachable + ] + complex_types += [self._complex_type(name, ct) for name, ct in self.synth] + + resolver = Resolver(groups, attribute_groups, complex_types) + for ct in complex_types: + ct.deps = sorted(resolver.deps(ct)) + complex_types = self._topo_sort(complex_types) + + all_named = set(self.schema.simple_types) | set(self.schema.complex_types) + dropped = sorted(all_named - reachable) + + return ir.Ir( + source=self.source, + builtins={**_XS_PRIMITIVE, **_EXTERNAL_ATTR}, + value_types=value_types, + groups=groups, + attribute_groups=attribute_groups, + complex_types=complex_types, + roots=[ir.Root(top.name, top.name) for top in self.schema.elements], + dropped_dead=dropped, + stats=self._stats(value_types, complex_types, dropped), + ) + + # ----- anonymous type hoisting ----------------------------------------- # + + def _hoist_anonymous(self) -> None: + used = set(self.schema.complex_types) | set(self.schema.simple_types) + # Document roots: the root element name is free; descendants are + # qualified by the partwise/timewise hierarchy to keep part/measure + # (which differ between the two) distinct. + for top in self.schema.elements: + if top.inline_type: + qualifier = top.name.replace("score-", "") + self._hoist(top.inline_type, top.name, qualifier, used) + # Anonymous types nested inside named types (e.g. directive). + for ct in self.schema.complex_types.values(): + self._scan(content_particle(ct), "", used) + for g in self.schema.groups.values(): + self._scan(g.particle, "", used) + + def _hoist(self, ct: xsd.ComplexType, name: str, qualifier: str, used: set) -> None: + self.anon_names[id(ct)] = name + self.synth.append((name, ct)) + used.add(name) + self._scan(content_particle(ct), qualifier, used) + + def _scan(self, particle, qualifier: str, used: set) -> None: + for ep in _iter_elements(particle): + if isinstance(ep.inline_type, xsd.ComplexType): + candidate = f"{qualifier}-{ep.name}" if qualifier else ep.name + if candidate in used: + candidate = f"{qualifier}-{ep.name}" if qualifier else f"{ep.name}-type" + self._hoist(ep.inline_type, candidate, qualifier, used) + + # ----- value types ----------------------------------------------------- # + + def _value_type(self, st: xsd.SimpleType) -> ir.ValueType: + if isinstance(st.content, xsd.Union): + return self._union(st) + if isinstance(st.content, xsd.ListType): + # MusicXML uses no xs:list; represent defensively as a token string. + return ir.StringType(st.name, "token", doc=st.doc) + primitive, facets = self._resolve_restriction(st.name) + if facets.enumerations: + return ir.EnumType( + st.name, primitive, [e.value for e in facets.enumerations], st.doc + ) + if primitive in _NUMERIC: + return ir.NumberType( + st.name, + primitive, + facets.min_inclusive, + facets.max_inclusive, + facets.min_exclusive, + facets.max_exclusive, + st.doc, + ) + return ir.StringType( + st.name, + primitive, + list(facets.patterns), + facets.min_length, + facets.max_length, + facets.length, + st.doc, + ) + + def _resolve_restriction(self, type_name: str) -> tuple[str, xsd.Facets]: + """Collapse a restriction chain to (primitive, merged facets). Child + facets override inherited ones; patterns accumulate.""" + st = self.schema.simple_types.get(type_name) + if st is None or not isinstance(st.content, xsd.Restriction): + return _primitive(type_name), xsd.Facets() + base = st.content.base + if base in self.schema.simple_types: + primitive, merged = self._resolve_restriction(base) + else: + primitive, merged = _primitive(base), xsd.Facets() + _merge_facets(merged, st.content.facets) + return primitive, merged + + def _union(self, st: xsd.SimpleType) -> ir.UnionType: + members: list[ir.UnionMember] = [] + for m in st.content.member_types: + if m in self.schema.simple_types: + members.append(ir.UnionMember(ir.Ref(m, "value"))) + else: + members.append(ir.UnionMember(ir.Ref(_primitive(m), "primitive"))) + for inline in st.content.inline_members: + if isinstance(inline.content, xsd.Restriction) and inline.content.facets.enumerations: + members.append( + ir.UnionMember( + literals=[e.value for e in inline.content.facets.enumerations] + ) + ) + return ir.UnionType(st.name, members, st.doc) + + # ----- attributes ------------------------------------------------------ # + + def _attr(self, a: xsd.Attribute) -> ir.Attr: + if a.ref: + ref = ir.Ref(_primitive(a.ref), "primitive") + name = a.ref + else: + ref = self._type_ref(a.type) if a.type else ir.Ref("string", "primitive") + name = a.name or "" + return ir.Attr(name, ref, a.use == "required", a.default, a.fixed, a.doc) + + # ----- complex types --------------------------------------------------- # + + def _complex_type(self, name: str, ct: xsd.ComplexType) -> ir.ComplexType: + c = ct.content + attrs = [self._attr(a) for a in c.attributes] + agrefs = [r.ref for r in c.attribute_group_refs] + + if isinstance(c, xsd.SimpleContent): + return ir.ComplexType( + name, "value", attrs, agrefs, value_type=self._type_ref(c.base), doc=ct.doc + ) + if isinstance(c, xsd.ComplexContent): + content = self._particle(c.particle) if c.particle else None + return ir.ComplexType( + name, "derived", attrs, agrefs, base=c.base, content=content, doc=ct.doc + ) + # ImplicitContent + if c.particle is not None: + return ir.ComplexType( + name, "composite", attrs, agrefs, content=self._particle(c.particle), doc=ct.doc + ) + presence = not attrs and not agrefs + return ir.ComplexType(name, "empty", attrs, agrefs, presence_only=presence, doc=ct.doc) + + # ----- particles ------------------------------------------------------- # + + def _particle(self, p) -> ir.Particle: + if isinstance(p, xsd.Sequence): + return ir.Sequence([self._particle(i) for i in p.items], p.min_occurs, _occ(p.max_occurs)) + if isinstance(p, xsd.Choice): + return ir.Choice([self._particle(i) for i in p.items], p.min_occurs, _occ(p.max_occurs)) + if isinstance(p, xsd.GroupRef): + return ir.GroupRef(p.ref, p.min_occurs, _occ(p.max_occurs)) + if isinstance(p, xsd.ElementParticle): + return ir.Element( + p.name, + self._element_ref(p), + _cardinality(p.min_occurs, p.max_occurs), + p.min_occurs, + _occ(p.max_occurs), + p.doc, + ) + raise ValueError(f"unexpected particle: {type(p).__name__}") + + def _element_ref(self, ep: xsd.ElementParticle) -> ir.Ref: + if isinstance(ep.inline_type, xsd.ComplexType): + return ir.Ref(self.anon_names[id(ep.inline_type)], "complex") + if ep.type: + return self._type_ref(ep.type) + return ir.Ref("string", "primitive") + + def _type_ref(self, type_name: str) -> ir.Ref: + if type_name in self.schema.complex_types: + return ir.Ref(type_name, "complex") + if type_name in self.schema.simple_types: + return ir.Ref(type_name, "value") + if type_name.startswith(("xs:", "xml:", "xlink:")): + return ir.Ref(_primitive(type_name), "primitive") + return ir.Ref(type_name, "complex") + + # ----- dependency ordering --------------------------------------------- # + + def _topo_sort(self, types: list[ir.ComplexType]) -> list[ir.ComplexType]: + by_name = {t.name: t for t in types} + ordered: list[ir.ComplexType] = [] + state: dict[str, int] = {} # 0 visiting, 1 done + + def visit(name: str) -> None: + if state.get(name) == 1 or name not in by_name: + return + state[name] = 0 + for dep in by_name[name].deps: + visit(dep) + state[name] = 1 + ordered.append(by_name[name]) + + for name in sorted(by_name): + visit(name) + return ordered + + def _topo_sort_values(self, values: list[ir.ValueType]) -> list[ir.ValueType]: + """Order value types deps-first. Only unions reference other value + types (their members); every other kind resolves to a primitive.""" + by_name = {v.name: v for v in values} + ordered: list[ir.ValueType] = [] + state: dict[str, int] = {} + + def deps(v: ir.ValueType) -> list[str]: + if isinstance(v, ir.UnionType): + return [ + m.ref.name + for m in v.members + if m.ref and m.ref.category == "value" and m.ref.name in by_name + ] + return [] + + def visit(name: str) -> None: + if state.get(name) == 1 or name not in by_name: + return + state[name] = 1 + for dep in sorted(deps(by_name[name])): + visit(dep) + ordered.append(by_name[name]) + + for name in sorted(by_name): + visit(name) + return ordered + + # ----- stats ----------------------------------------------------------- # + + def _stats(self, value_types, complex_types, dropped) -> dict: + return { + "value_types": len(value_types), + "value_kinds": dict(Counter(v.kind for v in value_types)), + "complex_types": len(complex_types), + "complex_kinds": dict(Counter(c.kind for c in complex_types)), + "groups": len(self.schema.groups), + "attribute_groups": len(self.schema.attribute_groups), + "synthesized_types": len(self.synth), + "dropped_dead_types": len(dropped), + } + + +# --------------------------------------------------------------------------- # +# Helpers +# --------------------------------------------------------------------------- # + + +def _iter_elements(particle): + """Yield element particles directly contained in a particle (not inside + group refs, which are scanned at the group definition).""" + if isinstance(particle, (xsd.Sequence, xsd.Choice)): + for item in particle.items: + yield from _iter_elements(item) + elif isinstance(particle, xsd.ElementParticle): + yield particle + + +def _merge_facets(into: xsd.Facets, src: xsd.Facets) -> None: + if src.enumerations: + into.enumerations = src.enumerations + into.patterns = into.patterns + src.patterns + for f in ("min_inclusive", "max_inclusive", "min_exclusive", "max_exclusive", + "min_length", "max_length", "length"): + v = getattr(src, f) + if v is not None: + setattr(into, f, v) diff --git a/gen/ir/dump.py b/gen/ir/dump.py new file mode 100644 index 000000000..d16c1c40b --- /dev/null +++ b/gen/ir/dump.py @@ -0,0 +1,56 @@ +"""Serialize the IR to JSON for inspection.""" + +from __future__ import annotations + +import json +from dataclasses import fields, is_dataclass + +# Discriminator fields are emitted first so each object announces what it is. +_FIRST = ("kind", "node", "name", "element") + + +def to_jsonable(obj): + """Convert IR dataclasses to plain JSON-able data, dropping None and empty + collections to keep the output readable.""" + if is_dataclass(obj): + names = [f.name for f in fields(obj)] + order = [n for n in _FIRST if n in names] + [n for n in names if n not in _FIRST] + result = {} + for name in order: + value = getattr(obj, name) + if value is None or (isinstance(value, (list, dict)) and not value): + continue + result[name] = to_jsonable(value) + return result + if isinstance(obj, list): + return [to_jsonable(x) for x in obj] + if isinstance(obj, dict): + return {k: to_jsonable(v) for k, v in obj.items()} + return obj + + +def to_json(obj) -> str: + return json.dumps(to_jsonable(obj), indent=2) + + +def resolved_view(resolver, ct) -> dict: + """A complex type as an emitter consumes it: attribute groups flattened into + one list, model-group refs spliced into the content. The collapsed form the + Resolver computes, shaped for inspection via `ir --resolve`.""" + view: dict = {"kind": ct.kind, "name": ct.name} + attrs = resolver.attributes(ct) + if attrs: + view["attributes"] = attrs + if ct.kind == "derived": + view["base"] = ct.base + view["all_attributes"] = resolver.all_attributes(ct) + if ct.value_type: + view["value_type"] = ct.value_type + content = resolver.content(ct) + if content is not None: + view["content"] = content + if ct.presence_only: + view["presence_only"] = True + if ct.doc: + view["doc"] = ct.doc + return view diff --git a/gen/ir/model.py b/gen/ir/model.py new file mode 100644 index 000000000..f57e39eb4 --- /dev/null +++ b/gen/ir/model.py @@ -0,0 +1,212 @@ +"""The intermediate representation: a resolved, language-agnostic model. + +The raw XSD model (gen.xsd.model) mirrors the schema 1:1 and still speaks in +XSD terms (restriction chains, attribute-group refs, anonymous inline types). +The IR is what the language emitters consume instead. It is a pure function of +the XSD with every cross-reference resolved: + + - all types are named (anonymous types are hoisted, with context-qualified + names for the partwise/timewise scaffolding); + - simple-type restriction chains are collapsed to one primitive plus merged + facets; + - element occurrence is normalized to a cardinality (required/optional/vector); + - dead types are dropped and complex types are emitted in dependency order. + +The IR deliberately *preserves* named structure (aliases keep their names, the +five inheritance edges stay as derivations, model groups and attribute groups +remain addressable) so emitters can choose how much to collapse. See the +analyze module's recommendations for the reasoning. +""" + +from __future__ import annotations + +from dataclasses import dataclass, field + +# Canonical maxOccurs="unbounded" marker in normalized particles. Kept as a +# string so it is self-describing in the JSON dump. +UNBOUNDED = "unbounded" + + +# --------------------------------------------------------------------------- # +# Type references +# --------------------------------------------------------------------------- # + + +@dataclass +class Ref: + """A reference to another type by name, tagged with where to resolve it.""" + + name: str + category: str # "complex" | "value" | "primitive" + + +# --------------------------------------------------------------------------- # +# Value types (lowered from simpleType and simpleContent bases) +# --------------------------------------------------------------------------- # + + +@dataclass +class EnumType: + name: str + base: str # primitive the tokens are drawn from (token/string) + values: list[str] + doc: str | None = None + kind: str = "enum" + + +@dataclass +class NumberType: + name: str + base: str # decimal/integer/positive_integer/non_negative_integer + min_inclusive: str | None = None + max_inclusive: str | None = None + min_exclusive: str | None = None + max_exclusive: str | None = None + doc: str | None = None + kind: str = "number" + + +@dataclass +class StringType: + name: str + base: str # string/token/nmtoken/date; a plain alias has no constraints + patterns: list[str] = field(default_factory=list) + min_length: str | None = None + max_length: str | None = None + length: str | None = None + doc: str | None = None + kind: str = "string" + + +@dataclass +class UnionMember: + # Exactly one is set: a Ref to another type, or inline enumeration literals. + ref: Ref | None = None + literals: list[str] | None = None + + +@dataclass +class UnionType: + name: str + members: list[UnionMember] + doc: str | None = None + kind: str = "union" + + +ValueType = EnumType | NumberType | StringType | UnionType + + +# --------------------------------------------------------------------------- # +# Attributes +# --------------------------------------------------------------------------- # + + +@dataclass +class Attr: + name: str + type: Ref + required: bool = False + default: str | None = None + fixed: str | None = None + doc: str | None = None + + +@dataclass +class AttributeGroup: + name: str + attributes: list[Attr] = field(default_factory=list) + attribute_groups: list[str] = field(default_factory=list) # nested refs + doc: str | None = None + + +# --------------------------------------------------------------------------- # +# Content model (normalized particles) +# --------------------------------------------------------------------------- # + + +@dataclass +class Element: + name: str + type: Ref + card: str # "required" | "optional" | "vector" + min: int = 1 + max: int | str = 1 # int or UNBOUNDED + doc: str | None = None + node: str = "element" + + +@dataclass +class GroupRef: + name: str + min: int = 1 + max: int | str = 1 + node: str = "group" + + +@dataclass +class Sequence: + items: list + min: int = 1 + max: int | str = 1 + node: str = "sequence" + + +@dataclass +class Choice: + items: list + min: int = 1 + max: int | str = 1 + node: str = "choice" + + +Particle = Element | GroupRef | Sequence | Choice + + +@dataclass +class Group: + name: str + content: Particle + doc: str | None = None + + +# --------------------------------------------------------------------------- # +# Complex types +# --------------------------------------------------------------------------- # + + +@dataclass +class ComplexType: + name: str + kind: str # "value" | "composite" | "derived" | "empty" + attributes: list[Attr] = field(default_factory=list) + attribute_groups: list[str] = field(default_factory=list) + value_type: Ref | None = None # kind == "value" (text content type) + base: str | None = None # kind == "derived" (parent complex type) + content: Particle | None = None # composite/derived particle + presence_only: bool = False # empty element used as a boolean flag + deps: list[str] = field(default_factory=list) # complex types referenced + doc: str | None = None + + +# --------------------------------------------------------------------------- # +# Schema root +# --------------------------------------------------------------------------- # + + +@dataclass +class Root: + element: str + type: str + + +@dataclass +class Ir: + source: str + builtins: dict[str, str] + value_types: list[ValueType] + groups: list[Group] + attribute_groups: list[AttributeGroup] + complex_types: list[ComplexType] # dependency-ordered + roots: list[Root] + dropped_dead: list[str] + stats: dict diff --git a/gen/ir/resolve.py b/gen/ir/resolve.py new file mode 100644 index 000000000..674dc7ef1 --- /dev/null +++ b/gen/ir/resolve.py @@ -0,0 +1,223 @@ +"""Collapsed views over the named structure the IR preserves. + +The IR keeps the schema's reusable structure addressable: a complex type lists +its attribute groups by name and leaves group references in its content tree, so +an emitter that wants mixins or shared structs can mirror them. Most emitters +instead want the collapsed view -- the full ordered attribute list, the content +with groups spliced in. Producing it means expanding attribute-group and +model-group references, deduping, and guarding cycles. That is schema reasoning, +so it lives here, once, rather than re-derived in every target's templates. + +Resolver is a pure read over the IR; it never mutates it. It depends only on the +three reusable tables (groups, attribute groups, complex types), not the whole +Ir, so build can use it mid-construction to compute dependencies. +""" + +from __future__ import annotations + +from gen.ir import model as ir + + +class Resolver: + """Collapsed views over an IR's preserved named structure.""" + + def __init__( + self, + groups: list[ir.Group], + attribute_groups: list[ir.AttributeGroup], + complex_types: list[ir.ComplexType], + ): + self._groups = {g.name: g for g in groups} + self._agroups = {a.name: a for a in attribute_groups} + self._complex = {c.name: c for c in complex_types} + + @classmethod + def from_ir(cls, m: ir.Ir) -> "Resolver": + return cls(m.groups, m.attribute_groups, m.complex_types) + + # ----- attributes ------------------------------------------------------ # + + def attributes(self, ct: ir.ComplexType) -> list[ir.Attr]: + """ct's own attributes with its attribute groups expanded inline, in + declaration order, deduped by name (first wins). Excludes the base.""" + out: list[ir.Attr] = [] + self._add_attrs(ct.attributes, ct.attribute_groups, out, set(), set()) + return out + + def all_attributes(self, ct: ir.ComplexType) -> list[ir.Attr]: + """attributes() plus the base chain's attributes (base-most first), for + the flattened set an emitter needs when the target has no inheritance.""" + out: list[ir.Attr] = [] + seen: set[str] = set() + for c in self.base_chain(ct): + self._add_attrs(c.attributes, c.attribute_groups, out, seen, set()) + return out + + def _add_attrs(self, attrs, group_names, out, seen, seen_groups) -> None: + for a in attrs: + if a.name not in seen: + seen.add(a.name) + out.append(a) + for name in group_names: + ag = self._agroups.get(name) + if ag is not None and name not in seen_groups: + seen_groups.add(name) + self._add_attrs(ag.attributes, ag.attribute_groups, out, seen, seen_groups) + + # ----- content --------------------------------------------------------- # + + def content(self, ct: ir.ComplexType) -> ir.Particle | None: + """ct.content with every group reference spliced in: a self-contained + tree of elements/sequences/choices with no GroupRef nodes. Nesting and + all min/max bounds are preserved. None for types with no content.""" + return None if ct.content is None else self._inline(ct.content, ()) + + def _inline(self, p: ir.Particle, path: tuple[str, ...]) -> ir.Particle: + if isinstance(p, ir.Sequence): + return ir.Sequence([self._inline(i, path) for i in p.items], p.min, p.max) + if isinstance(p, ir.Choice): + return ir.Choice([self._inline(i, path) for i in p.items], p.min, p.max) + if isinstance(p, ir.GroupRef): + g = self._groups.get(p.name) + if g is None or p.name in path: # unknown or cyclic: leave the leaf + return p + body = self._inline(g.content, path + (p.name,)) + # The ref's occurrence wraps the group body's own. Drop the wrapper + # when the ref is exactly-one and so contributes nothing. + if p.min == 1 and p.max == 1: + return body + return ir.Sequence([body], p.min, p.max) + return p # Element: a leaf with an already-resolved Ref + + # ----- elements -------------------------------------------------------- # + + def elements(self, ct: ir.ComplexType) -> list[ir.Element]: + """Every element occurrence in ct's resolved content, in document order, + flattened across sequences/choices/groups. Drops the choice/sequence + grouping and keeps each occurrence's LOCAL cardinality; use content() + when the structure matters and flat_elements() for the effective, + deduplicated field view an emitter wants.""" + out: list[ir.Element] = [] + self._collect_elements(self.content(ct), out) + return out + + def _collect_elements(self, p, out) -> None: + if isinstance(p, (ir.Sequence, ir.Choice)): + for i in p.items: + self._collect_elements(i, out) + elif isinstance(p, ir.Element): + out.append(p) + + def flat_elements(self, ct: ir.ComplexType) -> list[tuple[ir.Element, str]]: + """Each distinct element name in ct's resolved content, in document + order of first occurrence, with its EFFECTIVE cardinality for a flat + one-field-per-name view: + + - an element under any repeated particle (max != 1) is a vector; + - an element under a choice, or under an optional wrapper, is at + most optional; + - only an element required along a spine of exactly-once sequences + stays required. + + Occurrences of the same name merge by co-occurrence analysis: if two + occurrences sit in different branches of one choice they are mutually + exclusive (at most one per instance: optional), but otherwise both + can appear in a single instance and the merged field must be a vector + (e.g. metronome's beat-unit, which appears on a branch's spine and + again inside that same branch's inner choice).""" + merged: dict[str, int] = {} # name -> index into out + paths: dict[str, list[tuple]] = {} # name -> choice paths seen + out: list[tuple[ir.Element, str]] = [] + rank = {"required": 0, "optional": 1, "vector": 2} + + def exclusive(a: tuple, b: tuple) -> bool: + """True when the two occurrence paths diverge at two different + branches of one choice node, so they can never co-occur.""" + i = 0 + while i < len(a) and i < len(b) and a[i] == b[i]: + i += 1 + return ( + i < len(a) + and i < len(b) + and a[i][0] == b[i][0] # same choice node + and a[i][1] != b[i][1] # different branches + ) + + def walk(node, forced: bool, repeated: bool, path: tuple) -> None: + if node is None: + return + if isinstance(node, ir.Element): + if repeated or node.card == "vector": + card = "vector" + elif forced and node.card == "required": + card = "required" + else: + card = "optional" + if node.name not in merged: + merged[node.name] = len(out) + paths[node.name] = [path] + out.append((node, card)) + return + i = merged[node.name] + prev_el, prev_card = out[i] + if all(exclusive(path, seen) for seen in paths[node.name]): + # Alternative branches: at most one occurs, but none is + # statically guaranteed. + card = max(card, prev_card, key=lambda c: rank[c]) + if card == "required": + card = "optional" + else: + # The occurrences can co-occur in one instance. + card = "vector" + paths[node.name].append(path) + out[i] = (prev_el, card) + return + if node.max == 0: + return # a never-occurring particle contributes nothing + once = node.min >= 1 and node.max == 1 + again = repeated or node.max != 1 + if isinstance(node, ir.Sequence): + for item in node.items: + walk(item, forced and once, again, path) + elif isinstance(node, ir.Choice): + for branch, item in enumerate(node.items): + walk(item, False, again, path + ((id(node), branch),)) + # GroupRef leaves cannot appear: content() spliced them. + + walk(self.content(ct), True, False, ()) + return out + + def all_flat_elements(self, ct: ir.ComplexType) -> list[tuple[ir.Element, str]]: + """flat_elements() merged across the base chain (base-most first, + first occurrence of a name wins), mirroring all_attributes, for the + flattened view a target without inheritance emits.""" + out: list[tuple[ir.Element, str]] = [] + seen: set[str] = set() + for c in self.base_chain(ct): + for element, card in self.flat_elements(c): + if element.name not in seen: + seen.add(element.name) + out.append((element, card)) + return out + + # ----- derivation ------------------------------------------------------ # + + def base_chain(self, ct: ir.ComplexType) -> list[ir.ComplexType]: + """ct's derivation chain, base-most first, ending with ct itself.""" + chain: list[ir.ComplexType] = [] + cur: ir.ComplexType | None = ct + while cur is not None: + chain.append(cur) + cur = self._complex.get(cur.base) if cur.base else None + chain.reverse() + return chain + + # ----- dependencies ---------------------------------------------------- # + + def deps(self, ct: ir.ComplexType) -> set[str]: + """Complex types ct structurally depends on: its child element types + (groups resolved) plus its base. Drives the topological emit order.""" + d = {e.type.name for e in self.elements(ct) if e.type.category == "complex"} + if ct.base: + d.add(ct.base) + return d diff --git a/gen/ir/sounds.py b/gen/ir/sounds.py new file mode 100644 index 000000000..8b3a32bb4 --- /dev/null +++ b/gen/ir/sounds.py @@ -0,0 +1,101 @@ +"""Companion patch: fold sounds.xml into the IR as an open sound enum. + +The MusicXML XSD types the instrument-sound element as a bare xs:string. The +standard timbre identifiers it expects ("brass.alphorn", ...) live only in the +separately versioned sounds.xml companion file, not the schema. This patch +reads that file and rewrites the IR so instrument-sound resolves to a sound-id +enumeration unioned with an open string: the standard identifiers become typed +values, while any other string stays valid exactly as the schema allows. + +This is the one place the IR depends on an input beyond the XSD, and it runs +only when a target's config names a sounds file (see gen.config). The result +introduces no new IR shape -- it is an ordinary enum plus an ordinary union, +the same shape as font-size (the css-font-size enum unioned with decimal). +""" + +from __future__ import annotations + +import xml.etree.ElementTree as ET +from pathlib import Path + +from gen.ir import model as ir + +# The element keeps its name; its new type takes the element's name (the +# MusicXML convention, e.g. element note has type note), and the enumeration of +# identifiers gets a sub-name -- mirroring font-size over css-font-size. +ELEMENT = "instrument-sound" +UNION = "instrument-sound" +ENUM = "sound-id" + + +def read_sound_ids(path) -> list[str]: + """The id of every in a sounds.xml companion file, in document + order. The file's DOCTYPE points at an external DTD; ElementTree ignores it, + so this stays offline.""" + root = ET.parse(Path(path)).getroot() + return [s.get("id") for s in root.findall("sound") if s.get("id")] + + +def patch_sounds(m: ir.Ir, sound_ids: list[str]) -> int: + """Fold sound_ids into m in place: add the sound-id enum and instrument-sound + union, then retype every instrument-sound element from string to that union. + Returns the number of element occurrences retyped, which must be >= 1.""" + enum = ir.EnumType( + name=ENUM, + base="token", + values=list(sound_ids), + doc=( + "Standard MusicXML instrument sound identifiers. The XSD types " + "instrument-sound as xs:string and lists these values only in the " + "sounds.xml companion file; the generator injects them here." + ), + ) + union = ir.UnionType( + name=UNION, + members=[ + ir.UnionMember(ref=ir.Ref(ENUM, "value")), + ir.UnionMember(ref=ir.Ref("string", "primitive")), + ], + doc=( + "The instrument-sound value: one of the standard sound-id " + "identifiers, or any other string. The schema leaves the content " + "open (xs:string), so the string member is intrinsic, not a fallback." + ), + ) + # Deps-first invariant: the enum (no value deps) precedes the union that + # references it, and nothing already in the list references either, so + # appending the pair keeps value_types topologically ordered. + m.value_types.append(enum) + m.value_types.append(union) + + # instrument-sound is declared inside the virtual-instrument-data group, not + # a complex type's content, so retype across groups as well as complex types. + new_type = ir.Ref(UNION, "value") + retyped = sum( + _retype(ct.content, new_type) for ct in m.complex_types if ct.content is not None + ) + retyped += sum(_retype(g.content, new_type) for g in m.groups) + if retyped == 0: + raise ValueError(f"no {ELEMENT!r} element found to patch; schema changed?") + + _bump_stats(m.stats, len(sound_ids)) + return retyped + + +def _retype(particle: ir.Particle, new_type: ir.Ref) -> int: + """Retype every ELEMENT occurrence reachable in particle. GroupRef leaves are + left alone -- their target group is retyped where it is defined.""" + if isinstance(particle, (ir.Sequence, ir.Choice)): + return sum(_retype(i, new_type) for i in particle.items) + if isinstance(particle, ir.Element) and particle.name == ELEMENT: + particle.type = new_type + return 1 + return 0 + + +def _bump_stats(stats: dict, n_ids: int) -> None: + stats["value_types"] = stats.get("value_types", 0) + 2 + kinds = stats.setdefault("value_kinds", {}) + kinds["enum"] = kinds.get("enum", 0) + 1 + kinds["union"] = kinds.get("union", 0) + 1 + stats["companion_sound_ids"] = n_ids diff --git a/gen/names.py b/gen/names.py new file mode 100644 index 000000000..40b716c80 --- /dev/null +++ b/gen/names.py @@ -0,0 +1,182 @@ +"""Name expansion: tokenize wire names, recase per convention, sanitize. + +A fundamental (wire) name is split into an ordered word vector of lowercase +words, then recased by each registered convention. The wire form is preserved +untouched alongside the casings -- tokenization feeds only the cased +identifiers, never serialization (design R3). + +Conventions live in a registry keyed by name, so adding one later is +registering one function; every Name simply grows a key (design R1). + +This module is deliberately a leaf: it is shared vocabulary for the config +loader (which validates convention names and rename-entry keys) and for the +Plates projection, so it sits below both and imports neither. +""" + +from __future__ import annotations + +from dataclasses import dataclass + +# Word separators, split on and consumed. Hyphen covers ordinary kebab names; +# dot covers sound ids like `brass.alphorn`; whitespace covers space-separated +# enum values like `up down`; colon covers external refs like `xml:lang`. +_SEPARATORS = set("-._: \t\n\r\v\f") + +# Words uppercased whole by capitalizing conventions (Pascal, non-leading +# camel). Config-extensible via [naming] acronyms. +DEFAULT_ACRONYMS = ("midi", "id", "xml", "css", "smufl", "uri", "url") + +# Fallback word vector for wire names that tokenize to nothing (the empty enum +# value of positive-integer-or-empty and a few *-value enums). The wire form +# stays ""; only the identifier gets a name. A target wanting a different word +# for a particular enum renames it: [rename.enum-value.] "" = "none". +EMPTY_WORD = "empty" + + +@dataclass +class Name: + """The neutral/bound name bundle. `wire` is the immutable on-the-wire + string (never a code identifier); `words` is the tokenized vector the + casings expand from; `cased` maps convention name -> identifier, filled + by iterating the convention registry.""" + + wire: str + words: tuple[str, ...] + cased: dict[str, str] + + @property + def pascal(self) -> str: + return self.cased["pascal"] + + @property + def camel(self) -> str: + return self.cased["camel"] + + @property + def snake(self) -> str: + return self.cased["snake"] + + @property + def kebab(self) -> str: + return self.cased["kebab"] + + @property + def screaming(self) -> str: + return self.cased["screaming"] + + +def tokenize(wire: str, empty_word: str = EMPTY_WORD) -> tuple[str, ...]: + """Split a wire name into its canonical lowercase word vector.""" + tokens: list[str] = [] + current: list[str] = [] + for ch in wire: + if ch in _SEPARATORS: + if current: + tokens.append("".join(current)) + current = [] + else: + current.append(ch) + if current: + tokens.append("".join(current)) + + words: list[str] = [] + for token in tokens: + words.extend(w.lower() for w in _split_case_transitions(token)) + return tuple(words) if words else (empty_word,) + + +def _split_case_transitions(token: str) -> list[str]: + """Split an already-mixed-case token at a lower-to-upper boundary + (`fooBar` -> foo, Bar) and at an acronym boundary, where an uppercase run + is followed by uppercase+lowercase (`MIDIChannel` -> MIDI, Channel): the + last capital of the run begins the next word. Letter-digit boundaries do + not split; digits ride with their adjacent letters.""" + if not token: + return [] + starts = [0] + for i in range(1, len(token)): + prev, cur = token[i - 1], token[i] + if cur.isupper() and prev.islower(): + starts.append(i) + elif ( + cur.isupper() + and prev.isupper() + and i + 1 < len(token) + and token[i + 1].islower() + ): + starts.append(i) + return [token[a:b] for a, b in zip(starts, starts[1:] + [len(token)])] + + +def _capitalize(word: str, acronyms: frozenset[str]) -> str: + if word in acronyms: + return word.upper() + if word and word[0].isalpha(): + return word[0].upper() + word[1:] + return word # digit-led words like `1024th` stay lowercase + + +# Each convention maps (word vector, acronym set) -> identifier string. The +# camelCase leading word is always fully lowercased, so a leading acronym +# yields `midiChannel`, never `MIDIChannel`. snake/kebab/screaming are +# case-uniform and ignore the acronym set. +CONVENTIONS = { + "pascal": lambda ws, ac: "".join(_capitalize(w, ac) for w in ws), + "camel": lambda ws, ac: ws[0] + "".join(_capitalize(w, ac) for w in ws[1:]), + "snake": lambda ws, ac: "_".join(ws), + "kebab": lambda ws, ac: "-".join(ws), + "screaming": lambda ws, ac: "_".join(w.upper() for w in ws), +} + +# How a convention joins two already-cased parts when an identifier is +# composed from a scope plus a member (a type name plus a variant name, for +# targets whose enum constants share one namespace). Concatenating +# conventions join with nothing; delimited conventions reuse their delimiter. +JOINERS = { + "pascal": "", + "camel": "", + "snake": "_", + "kebab": "-", + "screaming": "_", +} + + +def sanitize_identifier(ident: str, reserved: frozenset[str], invalid_prefix: str = "_") -> str: + """Make a recased identifier legal for a code target: non-identifier + characters become underscores, a leading digit or empty result gets the + configured prefix, and reserved words get a trailing underscore. The + pre-sanitized casing stays available on the Name; collision detection + runs on the sanitized result.""" + out = "".join(ch if ch.isalnum() or ch == "_" else "_" for ch in ident) + if not out or out[0].isdigit(): + out = invalid_prefix + out + if out in reserved: + out += "_" + return out + + +class NameFactory: + """Builds Name bundles: tokenize once, expand every registered convention, + honoring a fundamental rename (re-expands all casings from the new root) + and per-convention overrides (pin one flavor, leave the rest expanded).""" + + def __init__(self, acronyms=DEFAULT_ACRONYMS): + # The acronym set matches against already-lowercased words, so it is + # normalized here: acronyms = ["MIDI"] must behave like ["midi"]. + self.acronyms = frozenset(a.lower() for a in acronyms) + + def make( + self, + wire: str, + fundamental: str | None = None, + overrides: dict[str, str] | None = None, + pluralize: bool = False, + ) -> Name: + words = tokenize(fundamental if fundamental is not None else wire) + if pluralize: + words = words[:-1] + (words[-1] + "s",) + cased = {conv: fn(words, self.acronyms) for conv, fn in CONVENTIONS.items()} + if overrides: + for conv, value in overrides.items(): + cased[conv] = value + return Name(wire=wire, words=words, cased=cased) diff --git a/gen/naming.base.toml b/gen/naming.base.toml new file mode 100644 index 000000000..88137c7b0 --- /dev/null +++ b/gen/naming.base.toml @@ -0,0 +1,12 @@ +# Shared naming base for every target ([naming] extends in each config.toml). +# Holds renames that are forced by the schema itself rather than by any one +# language, so they stay identical across targets. A target's own entries win +# over these on any conflict. + +# `barline` carries both child ELEMENTS segno/coda (the visible symbols) and +# ATTRIBUTES segno/coda (sound/playback jump markers). Any field casing +# collapses each pair to one identifier, so every code target collides. The +# elements keep their wire-true names; the attributes are qualified. +[rename.attribute.barline] +segno = "segno-sound" +coda = "coda-sound" diff --git a/gen/plates/__init__.py b/gen/plates/__init__.py new file mode 100644 index 000000000..a3fd8d2ae --- /dev/null +++ b/gen/plates/__init__.py @@ -0,0 +1,29 @@ +"""The Plates: the template-facing, per-target projection of the IR. + +See gen.plates.model for the data shape, gen.plates.build for the projection, +and docs/ai/design/plates.md for the design. +""" + +from gen.plates.build import PlatesError, build_plates +from gen.plates.model import Plates + +__all__ = ["Plates", "PlatesError", "build_plates", "build_for_config"] + + +def build_for_config(config_path): + """The whole pipeline for one target, shared by the CLI and tests: load + the config, lower its pinned XSD to the IR, apply companion patches, and + project. Returns (plates, config).""" + from gen.config import load as load_config + from gen.ir.build import build_ir + from gen.xsd.parser import parse + + cfg = load_config(config_path) + if cfg.xsd is None: + raise FileNotFoundError(f"config has no [input] xsd: {cfg.path}") + m = build_ir(parse(cfg.xsd), source=cfg.xsd.stem) + if cfg.sounds_xml is not None: + from gen.ir.sounds import patch_sounds, read_sound_ids + + patch_sounds(m, read_sound_ids(cfg.sounds_xml)) + return build_plates(m, cfg), cfg diff --git a/gen/plates/build.py b/gen/plates/build.py new file mode 100644 index 000000000..111bbda0b --- /dev/null +++ b/gen/plates/build.py @@ -0,0 +1,648 @@ +"""Project the IR onto one target: build the Plates. + +The build consumes the IR and its Resolver (it never re-derives a schema +fact: splicing, base-chain merging, and effective cardinality all come from +gen.ir.resolve) plus a Config, and produces the materialized Plates tree. +Three phases, each failing loud: + + 1. Config-against-IR validation: every [rename.*] key must name something + the IR actually contains, and every [types] key a real primitive (a + stale or misspelled key is a build error). + 2. Projection: names are tokenized and recased, renames and overrides + applied, identifiers composed per the target's scoping and sanitized, + types mapped, strategies and files assigned. + 3. Collision detection (gen.plates.check): distinct wire names that + collapsed to one identifier under the projection are reported together. +""" + +from __future__ import annotations + +import re + +from gen.config import Config +from gen.ir import model as ir +from gen.ir.build import PRIMITIVES +from gen.ir.resolve import Resolver +from gen.names import DEFAULT_ACRONYMS, JOINERS, NameFactory, sanitize_identifier +from gen.plates.check import run_checks +from gen.plates.model import ( + ClampStep, + ComplexPlate, + EnumPlate, + Member, + Name, + NumberBounds, + NumberPlate, + PlateRef, + Plates, + StringPlate, + TargetInfo, + UnionPlate, + UnionPlateMember, + Variant, +) + + +# Primitive-implied lower bounds the schema leaves unstated; part of the +# uniform clamp policy (see model.ClampStep and data/README.md). +_IMPLIED_MIN = {"positive_integer": 1, "non_negative_integer": 0} + +# The epsilon an exclusive DECIMAL bound clamps past (an exclusive integer +# bound clamps to the next integer). Matches the corpus duration fixup. +_EPSILON = 1e-6 + +# The numeric IR primitives ((see gen.ir.build.PRIMITIVES for the full set). +_PRIM_NUMERIC = {"decimal", "integer", "positive_integer", "non_negative_integer"} + + +def wrap_doc(doc: str | None, width: int) -> list[str]: + """Greedy word-wrap of raw doc text at `width` (the wrapped TEXT width; + templates add their own comment syntax). The break points reproduce the + house comment style: a 3-character prefix plus width 97 is column 100.""" + if not doc: + return [] + words = doc.split() + lines: list[str] = [] + current = "" + for word in words: + if current and len(current) + 1 + len(word) > width: + lines.append(current) + current = word + else: + current = f"{current} {word}" if current else word + if current: + lines.append(current) + return lines + + +# ASCII subsets of the XML name-character classes XSD's \i and \c denote. +# The full classes add non-ASCII ranges whose spelling is engine-specific +# (\x{...} vs \uXXXX); every identifier vocabulary a MusicXML pattern +# describes (SMuFL canonical glyph names) is ASCII, and the strict parse is +# the only consumer, so the approximation can only under-accept. +_XSD_NAME_START = "[:A-Z_a-z]" +_XSD_NAME_CHAR = "[-.0-9:A-Z_a-z]" + + +def _translate_pattern(pattern: str) -> str: + """One XSD pattern, re-spelled in the portable dialect. Constructs with + no portable spelling (class subtraction, \\C/\\I complements, \\p + properties) fail loud: a new schema construct is a decision, not a + silent pass-through.""" + out: list[str] = [] + in_class = False + i, n = 0, len(pattern) + while i < n: + ch = pattern[i] + if ch == "\\": + if i + 1 >= n: + raise ValueError(f"trailing backslash in pattern {pattern!r}") + esc = pattern[i + 1] + if esc in "ci": + if in_class: + raise ValueError( + f"\\{esc} inside a character class has no portable " + f"expansion: {pattern!r}" + ) + out.append(_XSD_NAME_CHAR if esc == "c" else _XSD_NAME_START) + elif esc in "CIpP": + raise ValueError( + f"\\{esc} has no portable spelling: {pattern!r}" + ) + else: + out.append(ch + esc) + i += 2 + continue + if in_class: + if ch == "[": + raise ValueError( + f"character class subtraction is not portable: {pattern!r}" + ) + if ch == "]": + in_class = False + elif ch == "[": + in_class = True + elif ch in "^$": + # XSD has no anchors; ^ and $ are ordinary characters there and + # must be escaped to stay ordinary in the portable form. + out.append("\\") + out.append(ch) + i += 1 + return "".join(out) + + +def portable_pattern(patterns: list[str]) -> str | None: + """The type's pattern facets as one anchored portable regex, or None. + + XSD patterns match the whole value (implicit anchoring), so the portable + form is explicitly anchored. Multiple pattern facets on one restriction + step are alternatives (XSD OR semantics); MusicXML never re-restricts an + already-patterned type, so the IR's accumulated list is always a single + step and the facets OR-join.""" + if not patterns: + return None + translated = [f"(?:{_translate_pattern(p)})" for p in patterns] + if len(translated) == 1: + return f"^{translated[0]}$" + return "^(?:" + "|".join(translated) + ")$" + + +def _dep_refs(refs) -> list: + """The unique non-primitive references a plate's emitted code depends on, + sorted by wire name -- the data templates compose include/import lines + from. Primitive refs are excluded by CATEGORY (a primitive's name can + coincide with a type's wire name).""" + unique = {} + for ref in refs: + if ref.category != "primitive": + unique.setdefault(ref.wire, ref) + return [unique[wire] for wire in sorted(unique)] + + +def _number_family(base: str) -> str: + return "decimal" if base == "decimal" else "integer" + + +def _spell(value: float, family: str) -> str: + """A numeric literal valid in every current target language.""" + if family == "integer": + return str(int(value)) + return repr(float(value)) + + +def clamp_steps(base: str, bounds: NumberBounds) -> list[ClampStep]: + """Resolve facets plus primitive-implied bounds into the ordered clamp + rules a wrapper applies after parsing. The tightest lower bound wins (an + exclusive bound at v is tighter than an inclusive one at the same v).""" + family = _number_family(base) + steps: list[ClampStep] = [] + + lows: list[tuple[float, bool]] = [] # (value, exclusive) + if bounds.min_inclusive is not None: + lows.append((float(bounds.min_inclusive), False)) + if bounds.min_exclusive is not None: + lows.append((float(bounds.min_exclusive), True)) + if base in _IMPLIED_MIN: + lows.append((float(_IMPLIED_MIN[base]), False)) + if lows: + value, exclusive = max(lows) + if exclusive: + past = value + (1 if family == "integer" else _EPSILON) + steps.append(ClampStep("<=", _spell(value, family), _spell(past, family))) + else: + bound = _spell(value, family) + steps.append(ClampStep("<", bound, bound)) + + highs: list[tuple[float, bool]] = [] + if bounds.max_inclusive is not None: + highs.append((float(bounds.max_inclusive), False)) + if bounds.max_exclusive is not None: + highs.append((float(bounds.max_exclusive), True)) + if highs: + value, exclusive = min((v, not e) for v, e in highs) + exclusive = not exclusive + if exclusive: + past = value - (1 if family == "integer" else _EPSILON) + steps.append(ClampStep(">=", _spell(value, family), _spell(past, family))) + else: + bound = _spell(value, family) + steps.append(ClampStep(">", bound, bound)) + return steps + + +class PlatesError(Exception): + """One or more projection failures, collected so a run reports every + problem at once rather than the first.""" + + def __init__(self, errors: list[str]): + self.errors = errors + super().__init__("\n".join(errors)) + + +def build_plates(m: ir.Ir, config: Config) -> Plates: + plates = _Builder(m, config).build() + errors = run_checks(plates) + if errors: + raise PlatesError(errors) + return plates + + +class _Builder: + def __init__(self, m: ir.Ir, config: Config): + self.m = m + self.cfg = config + self.resolver = Resolver.from_ir(m) + self.values_by_name: dict[str, ir.ValueType] = {v.name: v for v in m.value_types} + self.complex_by_name = {c.name: c for c in m.complex_types} + + naming = config.naming + self.factory = NameFactory( + naming.acronyms if naming.acronyms is not None else DEFAULT_ACRONYMS + ) + # All of this is config data: the generator has no per-language + # defaults (the cardinal rule -- see generator-agnosticism.md). + self.reserved = frozenset(config.reserved.words) + self.invalid_prefix = config.reserved.invalid_prefix + self.type_map = dict(config.types) + self.variant_scope = config.target.variant_scope + + # Every type's Name and final identifier, computed up front so any + # reference can be resolved to its target spelling in one lookup. + self.type_names: dict[str, Name] = {} + self.type_idents: dict[str, str] = {} + for type_wire in list(self.values_by_name) + list(self.complex_by_name): + name = self._type_name(type_wire) + self.type_names[type_wire] = name + self.type_idents[type_wire] = self._sanitize( + config.target.symbol_prefix + name.cased[naming.type_convention] + ) + + # ----- entry ------------------------------------------------------------ # + + def build(self) -> Plates: + errors = self._validate_config_against_ir() + if errors: + raise PlatesError(errors) + + version = re.search(r"musicxml-(\d+\.\d+)", self.m.source) + plates = Plates( + source=self.m.source, + schema_version=version.group(1) if version else "", + target=self._target_info(), + value_types=[self._value_plate(v) for v in self.m.value_types], + complex_types=[self._complex_plate(c) for c in self.m.complex_types], + roots=[self._plate_ref(ir.Ref(r.type, "complex")) for r in self.m.roots], + ) + return plates + + def _target_info(self) -> TargetInfo: + t, n = self.cfg.target, self.cfg.naming + return TargetInfo( + symbol_prefix=t.symbol_prefix, + type_convention=n.type_convention, + field_convention=n.field_convention, + variant_convention=n.variant_convention, + inheritance=t.inheritance, + variant_scope=self.variant_scope, + doc_wrap=self.cfg.docs.wrap, + reserved=sorted(self.reserved), + reserved_members=sorted(self.cfg.reserved.members), + reserved_type_suffixes=sorted(self.cfg.reserved.type_suffixes), + vars=dict(self.cfg.vars), + ) + + # ----- names and references ---------------------------------------------- # + + def _sanitize(self, raw: str) -> str: + return sanitize_identifier(raw, self.reserved, self.invalid_prefix) + + def _type_name(self, wire: str) -> Name: + entry = self.cfg.renames.types.get(wire) + return self.factory.make( + wire, + fundamental=entry.fundamental if entry else None, + overrides=entry.cased if entry else None, + ) + + def _element_name(self, wire: str, pluralize: bool) -> Name: + entry = self.cfg.renames.elements.get(wire) + return self.factory.make( + wire, + fundamental=entry.fundamental if entry else None, + overrides=entry.cased if entry else None, + pluralize=pluralize, + ) + + def _attribute_name(self, owner: str, wire: str) -> Name: + # A scoped key (this attribute on this owner) wins over a global one. + entry = self.cfg.renames.scoped_attributes.get((owner, wire)) + if entry is None: + entry = self.cfg.renames.attributes.get(wire) + return self.factory.make( + wire, + fundamental=entry.fundamental if entry else None, + overrides=entry.cased if entry else None, + ) + + def _variant(self, scope_wire: str, value_wire: str) -> Variant: + """Project one enum value (or union literal). The final constant + identifier follows the target's variant scope: `bare` sanitizes the + variant casing alone; `composed` joins the owning type's casing (and + symbol prefix) in the variant convention's join style, because the + constant will live in a flat namespace.""" + entry = self.cfg.renames.enum_values.get((scope_wire, value_wire)) + name = self.factory.make( + value_wire, + fundamental=entry.fundamental if entry else None, + overrides=entry.cased if entry else None, + ) + conv = self.cfg.naming.variant_convention + if self.variant_scope == "composed": + joiner = JOINERS.get(conv, "_") + if joiner: + parts = [] + if self.cfg.target.symbol_prefix: + prefix_name = self.factory.make(self.cfg.target.symbol_prefix) + parts.append(prefix_name.cased[conv]) + parts.append(self.type_names[scope_wire].cased[conv]) + parts.append(name.cased[conv]) + raw = joiner.join(parts) + else: + # Concatenating conventions: the type identifier (which + # already carries the prefix) plus the variant casing. + raw = self.type_idents[scope_wire] + name.cased[conv] + else: + raw = name.cased[conv] + return Variant(wire=value_wire, name=name, ident=self._sanitize(raw)) + + def _field_ident(self, name: Name) -> str: + raw = self.cfg.naming.field_prefix + name.cased[self.cfg.naming.field_convention] + return self._sanitize(raw) + + def _plate_ref(self, ref: ir.Ref) -> PlateRef: + """Resolve a reference with the referenced type's name bundle and kind + denormalized onto it, so templates never perform lookups.""" + if ref.category == "primitive": + return PlateRef( + wire=ref.name, + category="primitive", + ident=self.type_map.get(ref.name, ref.name), + name=self.factory.make(ref.name), + kind="primitive-" + _number_family(ref.name) + if ref.name in _PRIM_NUMERIC + else "primitive-string", + ) + if ref.category == "value": + kind = self.values_by_name[ref.name].kind + else: + kind = "complex" + return PlateRef( + wire=ref.name, + category=ref.category, + ident=self.type_idents[ref.name], + name=self.type_names[ref.name], + kind=kind, + ) + + # ----- value plates -------------------------------------------------------- # + + def _doc_lines(self, doc: str | None) -> list[str]: + return wrap_doc(doc, self.cfg.docs.wrap) + + def _value_plate(self, v: ir.ValueType): + name = self.type_names[v.name] + ident = self.type_idents[v.name] + if isinstance(v, ir.EnumType): + return EnumPlate( + name=name, + ident=ident, + base=v.base, + variants=[self._variant(v.name, value) for value in v.values], + doc=v.doc, + doc_lines=self._doc_lines(v.doc), + ) + if isinstance(v, ir.NumberType): + bounds = NumberBounds( + v.min_inclusive, v.max_inclusive, v.min_exclusive, v.max_exclusive + ) + return NumberPlate( + name=name, + ident=ident, + base=v.base, + bounds=bounds, + family=_number_family(v.base), + clamp=clamp_steps(v.base, bounds), + target_type=self.type_map.get(v.base, v.base), + doc=v.doc, + doc_lines=self._doc_lines(v.doc), + ) + if isinstance(v, ir.StringType): + return StringPlate( + name=name, + ident=ident, + base=v.base, + patterns=list(v.patterns), + pattern=portable_pattern(list(v.patterns)), + min_length=v.min_length, + max_length=v.max_length, + length=v.length, + target_type=self.type_map.get(v.base, v.base), + doc=v.doc, + doc_lines=self._doc_lines(v.doc), + ) + members = [] + for m in v.members: + if m.ref is not None: + member_name = self.type_names.get(m.ref.name) or self.factory.make(m.ref.name) + clamp = [] + if m.ref.category == "primitive" and m.ref.name in _IMPLIED_MIN: + # The primitive's implied bounds apply inside a union just + # as they would on a named number type. + clamp = clamp_steps(m.ref.name, NumberBounds()) + members.append( + UnionPlateMember( + ref=self._plate_ref(m.ref), + name=member_name, + # The member's discriminator constant: scoped, renamed, + # and collision-gated exactly like an enum variant. + tag=self._variant(v.name, m.ref.name), + clamp=clamp, + ) + ) + else: + # An inline literal set projects like a tiny anonymous enum; + # its variants are addressable for renames under the union's + # own type name and double as the discriminator constants. + members.append( + UnionPlateMember( + literals=[self._variant(v.name, lit) for lit in m.literals or []] + ) + ) + plate = UnionPlate( + name=name, + ident=ident, + members=members, + open_ended=any( + m.ref is not None and m.ref.kind in ("primitive-string", "string") + for m in members + ), + doc=v.doc, + doc_lines=self._doc_lines(v.doc), + ) + plate.deps = _dep_refs( + m.ref for m in plate.members if m.ref is not None + ) + return plate + + # ----- complex plates ------------------------------------------------------ # + + def _complex_plate(self, ct: ir.ComplexType) -> ComplexPlate: + strategy = { + "value": "value-class", + "composite": "composite-class", + "empty": "flag" if ct.presence_only else "attrs-class", + "derived": "inherit" if self.cfg.target.inheritance else "flatten", + }[ct.kind] + + members = self._members(ct, flatten=False) + all_members = None + if ct.kind == "derived": + # Built under either strategy, so the collision gate covers the + # merged chain even for inheriting targets. + all_members = self._members(ct, flatten=True) + + plate = ComplexPlate( + name=self.type_names[ct.name], + ident=self.type_idents[ct.name], + shape=ct.kind, + strategy=strategy, + members=members, + content=self.resolver.content(ct), + base=self._plate_ref(ir.Ref(ct.base, "complex")) if ct.base else None, + all_members=all_members, + presence_only=ct.presence_only, + doc=ct.doc, + doc_lines=self._doc_lines(ct.doc), + ) + refs = [m.type_ref for m in plate.members] + refs += [m.type_ref for m in (plate.all_members or [])] + if plate.base is not None: + refs.append(plate.base) + plate.deps = _dep_refs(refs) + return plate + + def _members(self, ct: ir.ComplexType, flatten: bool) -> list[Member]: + """The flat field list: attributes first, then the text value body, + then child elements in document order. The flattened variant merges + the base chain (base-most first) via the Resolver's chain views.""" + if flatten: + attrs = self.resolver.all_attributes(ct) + elements = self.resolver.all_flat_elements(ct) + chain = self.resolver.base_chain(ct) + else: + attrs = self.resolver.attributes(ct) + elements = self.resolver.flat_elements(ct) + chain = [ct] + + members = [self._attr_member(ct.name, a) for a in attrs] + for c in chain: + if c.value_type is not None: + members.append(self._value_member(c.value_type)) + members += [self._element_member(e, card) for e, card in elements] + return members + + def _attr_member(self, owner_wire: str, a: ir.Attr) -> Member: + name = self._attribute_name(owner_wire, a.name) + literal = a.fixed if a.fixed is not None else a.default + return Member( + name=name, + ident=self._field_ident(name), + kind="attribute", + type_ref=self._plate_ref(a.type), + cardinality="required" if a.required else "optional", + default=a.default, + fixed=a.fixed, + default_variant=self._default_variant(a.type, literal), + doc=a.doc, + ) + + def _value_member(self, value_type: ir.Ref) -> Member: + # The text body of a value-shaped type has no wire name of its own; + # it is projected under the fixed root "value". + name = self.factory.make("", fundamental="value") + return Member( + name=name, + ident=self._field_ident(name), + kind="value", + type_ref=self._plate_ref(value_type), + cardinality="required", + ) + + def _element_member(self, element: ir.Element, cardinality: str) -> Member: + pluralize = self.cfg.naming.pluralize_vectors and cardinality == "vector" + name = self._element_name(element.name, pluralize) + return Member( + name=name, + ident=self._field_ident(name), + kind="element", + type_ref=self._plate_ref(element.type), + cardinality=cardinality, + doc=element.doc, + ) + + def _default_variant(self, type_ref: ir.Ref, literal: str | None) -> str | None: + """When a default/fixed literal names a variant of the member's enum + type, resolve it to the variant's target identifier (the wire literal + stays in `default`/`fixed` for the serializer).""" + if literal is None or type_ref.category != "value": + return None + vt = self.values_by_name.get(type_ref.name) + if isinstance(vt, ir.EnumType) and literal in vt.values: + return self._variant(vt.name, literal).ident + return None + + # ----- config-against-IR validation ----------------------------------------- # + + def _validate_config_against_ir(self) -> list[str]: + """Every rename key must address something in the IR, and every + [types] key a real primitive (design 6.5): a typo or a key left stale + after a schema bump is a build error, not a silently ignored line.""" + r = self.cfg.renames + errors: list[str] = [] + + for primitive in self.cfg.types: + if primitive not in PRIMITIVES: + errors.append( + f"[types] {primitive}: not an IR primitive " + f"({', '.join(sorted(PRIMITIVES))})" + ) + + type_wires = set(self.values_by_name) | set(self.complex_by_name) + for wire in r.types: + if wire not in type_wires: + errors.append(f"rename.type.{wire}: no such type in the IR") + + element_wires: set[str] = set() + for ct in self.m.complex_types: + for e in self.resolver.elements(ct): + element_wires.add(e.name) + element_wires.update(root.element for root in self.m.roots) + for wire in r.elements: + if wire not in element_wires: + errors.append(f"rename.element.{wire}: no element by that name occurs") + + attribute_wires = { + a.name for ct in self.m.complex_types for a in self.resolver.attributes(ct) + } + for wire in r.attributes: + if wire not in attribute_wires: + errors.append(f"rename.attribute.{wire}: no attribute by that name occurs") + + for owner, attr in r.scoped_attributes: + ct = self.complex_by_name.get(owner) + if ct is None: + errors.append(f"rename.attribute.{owner}.{attr}: no such complex type") + elif attr not in {a.name for a in self.resolver.all_attributes(ct)}: + errors.append( + f"rename.attribute.{owner}.{attr}: type '{owner}' has no such attribute" + ) + + for enum, value in r.enum_values: + vt = self.values_by_name.get(enum) + if isinstance(vt, ir.EnumType): + if value not in vt.values: + errors.append( + f"rename.enum-value.{enum}.{value!r}: enum has no such value" + ) + elif isinstance(vt, ir.UnionType): + addressable = {lit for m in vt.members for lit in (m.literals or [])} + addressable |= {m.ref.name for m in vt.members if m.ref is not None} + if value not in addressable: + errors.append( + f"rename.enum-value.{enum}.{value!r}: union has no such " + f"literal or member" + ) + else: + errors.append(f"rename.enum-value.{enum}: no such enum type") + + return errors diff --git a/gen/plates/check.py b/gen/plates/check.py new file mode 100644 index 000000000..3b00129d6 --- /dev/null +++ b/gen/plates/check.py @@ -0,0 +1,170 @@ +"""Post-projection collision detection (design section 7). + +After tokenizing, recasing, renames, and reserved-word/validity mangling, two +distinct wire names can collapse to one identifier. The IR's "no element-name +collisions" invariant guarantees nothing here, because these collisions are +induced by the projection. Each scope is checked in the convention the target +actually uses (the identifiers were already produced in it); every report +names the scope, the colliding wire names, and the shared identifier -- +enough to write a targeted rename to resolve it. +""" + +from __future__ import annotations + +from gen.plates.model import ComplexPlate, EnumPlate, Plates, UnionPlate + + +def run_checks(plates: Plates) -> list[str]: + errors: list[str] = [] + if plates.target.variant_scope == "composed": + _check_flat_namespace(plates, errors) + else: + _check_type_idents(plates, errors) + _check_variants_per_type(plates, errors) + _check_members(plates, errors) + _check_template_reserved(plates, errors) + _check_union_member_order(plates, errors) + return errors + + +def _check_union_member_order(plates: Plates, errors: list[str]) -> None: + """An open string member matches ANY input, so every union parser that + tries members in schema order can never reach the members after it. A + fact about union semantics, not about any language, so it gates here + rather than in each target's templates.""" + for p in plates.value_types: + if not isinstance(p, UnionPlate): + continue + for i, m in enumerate(p.members): + open_member = m.ref is not None and m.ref.kind in ( + "primitive-string", "string" + ) + if open_member and i != len(p.members) - 1: + errors.append( + f"union '{p.name.wire}': member '{m.ref.wire}' matches any " + f"string, so the members after it are unreachable; it must " + f"be last" + ) + + +def _check_template_reserved(plates: Plates, errors: list[str]) -> None: + """Names the target's TEMPLATES synthesize cannot be gated structurally, + so the target declares them: [reserved] members (member identifiers its + templates claim on every struct) and [reserved] type-suffixes + (compositions appended to type identifiers, like a Child struct). A + schema name landing on either must fail here, not as a confusing compile + error in committed output.""" + reserved_members = set(plates.target.reserved_members) + if reserved_members: + for p in plates.complex_types: + for member_list in (p.members, p.all_members or []): + for m in member_list: + if m.ident in reserved_members: + errors.append( + f"member identifier '{m.ident}' in '{p.name.wire}' is " + f"reserved by the target's templates ([reserved] members); " + f"rename it" + ) + suffixes = plates.target.reserved_type_suffixes + if suffixes: + idents = { + p.ident: p.name.wire + for p in list(plates.value_types) + list(plates.complex_types) + } + for ident, wire in idents.items(): + for suffix in suffixes: + composed = ident + suffix + if composed in idents: + errors.append( + f"type identifier collision: '{idents[composed]}' is named " + f"'{composed}', which the target's templates compose from " + f"'{wire}' + reserved suffix '{suffix}'" + ) + + +def _variant_pairs(plate) -> list[tuple[str, str]]: + """(ident, claimant description) for every constant a value plate emits: + enum variants, union literal variants, and union member tags (the + discriminator constants) alike.""" + if isinstance(plate, EnumPlate): + return [(v.ident, f"{plate.name.wire}.{v.wire!r}") for v in plate.variants] + if isinstance(plate, UnionPlate): + pairs = [ + (v.ident, f"{plate.name.wire}.{v.wire!r}") + for m in plate.members + if m.literals + for v in m.literals + ] + pairs += [ + (m.tag.ident, f"{plate.name.wire} member {m.ref.wire!r}") + for m in plate.members + if m.tag is not None + ] + return pairs + return [] + + +def _check_flat_namespace(plates: Plates, errors: list[str]) -> None: + """For a composed variant scope, the target has one identifier namespace: + type identifiers and every (already composed) enum/literal constant must + be mutually unique -- this is the namespace the compiler actually sees.""" + pairs = [ + (p.ident, f"type {p.name.wire!r}") + for p in list(plates.value_types) + list(plates.complex_types) + ] + for p in plates.value_types: + pairs.extend(_variant_pairs(p)) + for ident, claimants in _collisions(pairs): + errors.append( + f"identifier collision: {sorted(set(claimants))} all project to '{ident}'" + ) + + +def _collisions(pairs: list[tuple[str, str]]) -> list[tuple[str, list[str]]]: + """Group (identifier, wire) pairs; return identifiers claimed by more + than one distinct wire name, with their claimants.""" + by_ident: dict[str, list[str]] = {} + for ident, wire in pairs: + by_ident.setdefault(ident, []).append(wire) + return [ + (ident, wires) + for ident, wires in by_ident.items() + if len(set(wires)) > 1 + ] + + +def _check_type_idents(plates: Plates, errors: list[str]) -> None: + pairs = [ + (p.ident, p.name.wire) + for p in list(plates.value_types) + list(plates.complex_types) + ] + for ident, wires in _collisions(pairs): + errors.append( + f"type identifier collision: {sorted(set(wires))} all project to '{ident}'" + ) + + +def _check_variants_per_type(plates: Plates, errors: list[str]) -> None: + """For a bare variant scope, constants live inside their type: uniqueness + is per enum (or per union's literal set).""" + for p in plates.value_types: + pairs = _variant_pairs(p) + for ident, wires in _collisions(pairs): + errors.append( + f"variant identifier collision in '{p.name.wire}': " + f"{sorted(set(wires))} all project to '{ident}'" + ) + + +def _check_members(plates: Plates, errors: list[str]) -> None: + for p in plates.complex_types: + for label, members in (("members", p.members), ("all_members", p.all_members)): + if not members: + continue + pairs = [(m.ident, f"{m.kind} {m.name.wire!r}") for m in members] + for ident, wires in _collisions(pairs): + errors.append( + f"member identifier collision in '{p.name.wire}' ({label}): " + f"{sorted(set(wires))} all project to '{ident}'" + ) + diff --git a/gen/plates/model.py b/gen/plates/model.py new file mode 100644 index 000000000..bd0ea6ef3 --- /dev/null +++ b/gen/plates/model.py @@ -0,0 +1,353 @@ +"""The Plates: the template-facing, per-target projection of the IR. + +The IR (gen.ir) is a pure, language-agnostic function of the schema inputs. +The Plates are its opposite number: one plate per emitted type, carrying +everything a template needs to print code without thinking -- identifier +casings, resolved target types, emit strategy tags, file assignment. This is +where config.toml meets the IR; templates stay dumb renderers. + +Each plate is internally partitioned into two field groups: + + - a neutral core: wire-faithful, target-independent facts (wire name, shape, + resolved structure, value lists, facets, docs), mirrored from the IR and + its Resolver; and + - a target binding: the per-target overlay (casings, sanitized identifiers, + resolved target types, strategy tags, file assignment). + +A code target reads both groups. A neutral target (e.g. a JSON Schema +emitter) reads only the neutral core and renders once-per-target templates, +paying nothing for the binding it ignores. + +The Plates are materialized (built once per target, dumpable via +gen.ir.dump.to_jsonable) rather than computed on demand: collision detection +and rename validation are global build-then-check passes, and templates want +random access to fully resolved plates. Design: docs/ai/design/plates.md. +""" + +from __future__ import annotations + +from dataclasses import dataclass, field + +from gen.ir import model as ir +from gen.names import Name + +__all__ = ["Name"] # re-exported: templates reach all plate vocabulary here + + +@dataclass +class PlateRef: + """A reference to another type, resolved for the target: `wire` and + `category` mirror the IR Ref; `ident` is the spelling a template prints -- + the referenced plate's type identifier, or the mapped target type when the + category is `primitive`. For primitives, `wire` carries the IR's canonical + primitive name (e.g. `non_negative_integer`), not an XSD spelling: builtins + never appear on the wire themselves. + + `name` and `kind` are denormalized from the referenced plate so templates + never perform lookups: `name` is the referenced type's name bundle (for a + primitive, its tokenized canonical name), and `kind` is the referenced + plate's kind (enum/number/string/union/complex) or, for primitives, the + family-qualified `primitive-decimal` / `primitive-integer` / + `primitive-string`.""" + + wire: str + category: str # "complex" | "value" | "primitive" + ident: str + name: Name | None = None + kind: str = "" + + +@dataclass +class TargetInfo: + """The per-target facts that are global to the projection, not per-type. + Every field here is part of the projection contract: definable without + reference to any language. Anything language-flavored belongs in `vars`, + which passes through to templates verbatim and is never interpreted.""" + + symbol_prefix: str # prepended to type idents and composed constants + type_convention: str + field_convention: str + variant_convention: str + inheritance: bool # derived strategy: True -> inherit, False -> flatten + variant_scope: str # "bare" | "composed" (see Variant) + doc_wrap: int # width doc text is wrapped to (doc_lines), excluding comment syntax + reserved: list[str] = field(default_factory=list) # the target's reserved words, sorted + reserved_members: list[str] = field(default_factory=list) # template-reserved member idents + reserved_type_suffixes: list[str] = field(default_factory=list) # template compositions + vars: dict[str, str] = field(default_factory=dict) # freeform, for templates + + +# --------------------------------------------------------------------------- # +# Value plates (mirror the IR's 4 value shapes) +# --------------------------------------------------------------------------- # + + +@dataclass +class Variant: + """One enum value. `wire` is retained for serialization; `ident` is the + FINAL emitted constant identifier -- templates print it verbatim, and the + collision gate certifies it. Its shape follows the target's variant scope + ([target] variant-scope): `bare` when the target's constants live inside + the type (`_1024th`), `composed` when they share one flat namespace + (`NoteTypeValue1024th`, `MX_NOTE_TYPE_VALUE_1024TH`).""" + + wire: str + name: Name + ident: str + + +@dataclass +class NumberBounds: + """Numeric facets, verbatim from the schema (strings, not parsed).""" + + min_inclusive: str | None = None + max_inclusive: str | None = None + min_exclusive: str | None = None + max_exclusive: str | None = None + + +@dataclass +class ClampStep: + """One resolved clamping rule: `if v then v = `. + This is the corpus leniency POLICY (the thing the .fixup.xml sidecars + encode), decided once in the projection: facet bounds and the + primitive-implied lower bounds are merged, the tightest wins, and an + exclusive bound's replacement is the nearest representable in-range value + (next integer, or bound +/- 1e-6 for decimals). The literals are spelled + neutrally (valid in every current target language); templates print them + verbatim.""" + + op: str # "<" | "<=" | ">" | ">=" + bound: str + replacement: str + + +@dataclass +class EnumPlate: + name: Name + ident: str + base: str # IR primitive the tokens are drawn from + variants: list[Variant] + doc: str | None = None + doc_lines: list[str] = field(default_factory=list) # doc wrapped at doc_wrap + deps: list[PlateRef] = field(default_factory=list) # non-primitive types referenced + kind: str = "enum" + strategy: str = "enum-class" + + +@dataclass +class NumberPlate: + name: Name + ident: str + base: str # IR primitive: decimal/integer/positive_integer/non_negative_integer + bounds: NumberBounds = field(default_factory=NumberBounds) # neutral core: raw facets + family: str = "" # "decimal" | "integer": which parse/format family applies + clamp: list[ClampStep] = field(default_factory=list) # resolved policy (see ClampStep) + target_type: str = "" # type_map[base]: what the wrapper wraps + doc: str | None = None + doc_lines: list[str] = field(default_factory=list) + deps: list[PlateRef] = field(default_factory=list) + kind: str = "number" + strategy: str = "numeric-wrapper" + + +@dataclass +class StringPlate: + name: Name + ident: str + base: str # IR primitive: string/token/nmtoken/date + patterns: list[str] = field(default_factory=list) # neutral core: raw XSD facets + # The pattern facets as ONE anchored regex in the portable dialect + # (literals, character classes, quantifiers, alternation, grouping -- + # parses identically in RE2, PCRE, ECMAScript, Python). XSD's implicit + # whole-value anchoring is made explicit and its \i/\c name-class + # escapes are expanded; see build.portable_pattern. None when the type + # has no pattern facet. A target that enforces patterns compiles this; + # one that does not simply never mentions it. + pattern: str | None = None + min_length: str | None = None + max_length: str | None = None + length: str | None = None + target_type: str = "" # type_map[base] + doc: str | None = None + doc_lines: list[str] = field(default_factory=list) + deps: list[PlateRef] = field(default_factory=list) + kind: str = "string" + strategy: str = "string-wrapper" + + +@dataclass +class UnionPlateMember: + """Exactly one of ref/literals is set: a resolved reference to a member + type, or an inline literal set projected like a tiny anonymous enum (each + literal carries its wire form and a variant identifier). A ref member also + carries `name`, the referenced type's name bundle, so a template can spell + the member's field without inventing a name (a primitive member like + `positive_integer` has no plate to look it up on), and `tag`, the final + discriminator-constant identifier for this member, scoped exactly like an + enum variant and covered by the same collision gate. A primitive numeric + member carries its `clamp` policy (the primitive-implied bounds), so the + union enforces the same leniency as a named number type would.""" + + ref: PlateRef | None = None + name: Name | None = None + tag: Variant | None = None + literals: list[Variant] | None = None + clamp: list[ClampStep] = field(default_factory=list) + + +@dataclass +class UnionPlate: + name: Name + ident: str + members: list[UnionPlateMember] = field(default_factory=list) + # True when a member accepts ANY string (a string-family member, last by + # the union-order gate): in-order parsers never fall through. + open_ended: bool = False + doc: str | None = None + doc_lines: list[str] = field(default_factory=list) + deps: list[PlateRef] = field(default_factory=list) + kind: str = "union" + strategy: str = "tagged-variant" + + +ValuePlate = EnumPlate | NumberPlate | StringPlate | UnionPlate + + +# --------------------------------------------------------------------------- # +# Complex plates (mirror the IR's 4 complex shapes) +# --------------------------------------------------------------------------- # + + +@dataclass +class Member: + """One field of a complex plate: an attribute, a child element, or the + text value body of a `value`-shaped type. `cardinality` (required / + optional / vector) plus the target's type map fully determine the concrete + wrapper spelling (by-value, optional, collection); the template prints it. + + `default`/`fixed` keep the wire literal. When that literal names a variant + of the member's enum type, `default_variant` carries the variant's target + identifier so an emitter writes the enum member, not a raw string.""" + + name: Name + ident: str + kind: str # "attribute" | "element" | "value" + type_ref: PlateRef + cardinality: str # "required" | "optional" | "vector" + default: str | None = None + fixed: str | None = None + default_variant: str | None = None + doc: str | None = None + + +@dataclass +class ComplexPlate: + """One complex type, projected. `members` is the flat, deduped, ordered + field list a code target emits (attributes, then the value body, then + child elements in document order); `content` is the resolved + sequence/choice particle tree for a target that cares about order and + choice structure. + + `content` deliberately re-presents the IR's particle node types + (Sequence/Choice/Element from gen.ir.model, groups already spliced): the + neutral core IS the IR re-presented, and a parallel node hierarchy would + only drift. Those node types are therefore part of this layer's public + contract. A template joining a content occurrence back to the field it + populates uses `member(wire, kind="element")` rather than re-walking. + + A derived plate exposes both the `base` edge (for a target with + inheritance) and `all_members` (the base chain merged, for one without); + `strategy` says which one this target uses. Both views are always + populated for derived plates so the collision gate covers them under + either strategy.""" + + name: Name + ident: str + shape: str # "value" | "composite" | "empty" | "derived" + strategy: str # value-class | composite-class | flag | attrs-class | inherit | flatten + members: list[Member] = field(default_factory=list) + content: ir.Particle | None = None + base: PlateRef | None = None + all_members: list[Member] | None = None + presence_only: bool = False + doc: str | None = None + doc_lines: list[str] = field(default_factory=list) + deps: list[PlateRef] = field(default_factory=list) + kind: str = "complex" + + def member(self, wire: str, kind: str | None = None) -> Member: + """The member a content occurrence or attribute wire name populates. + `kind` disambiguates the rare wire name carried by both an attribute + and an element (e.g. barline's segno).""" + for m in self.members: + if m.name.wire == wire and (kind is None or m.kind == kind): + return m + raise KeyError(f"{self.name.wire}: no member {wire!r} (kind={kind})") + + def members_view(self) -> list[Member]: + """The member list this plate's strategy renders: the merged + base-chain view when flattening a derived type, own members + otherwise. Backends render this; they never re-derive it.""" + if self.strategy == "flatten" and self.all_members is not None: + return self.all_members + return self.members + + +# --------------------------------------------------------------------------- # +# The whole projected target +# --------------------------------------------------------------------------- # + + +def attribute_members(members: list[Member]) -> list[Member]: + """The shape queries backends partition a member list with. They live + here, beside the data, so every backend asks the same question the same + way instead of filtering inline.""" + return [m for m in members if m.kind == "attribute"] + + +def element_members(members: list[Member]) -> list[Member]: + return [m for m in members if m.kind == "element"] + + +def value_member(members: list[Member]) -> Member | None: + return next((m for m in members if m.kind == "value"), None) + + +@dataclass +class Plates: + """The complete projection of one target: every plate, in the IR's + deps-first order (value types never reference complex types, so + `value_types + complex_types` is a valid total emit order).""" + + source: str # provenance: the XSD stem the IR was lowered from + target: TargetInfo + schema_version: str = "" # the MusicXML version in the source stem ("3.1") + value_types: list[ValuePlate] = field(default_factory=list) + complex_types: list[ComplexPlate] = field(default_factory=list) + roots: list[PlateRef] = field(default_factory=list) + + def __post_init__(self): + # Random-access index for templates; a plain attribute (not a + # dataclass field) so JSON dumps stay free of the duplication. + self._index = {p.name.wire: p for p in self.value_types} + self._index.update({p.name.wire: p for p in self.complex_types}) + + def plate(self, wire: str) -> ValuePlate | ComplexPlate: + """Look up any plate by its wire type name.""" + return self._index[wire] + + def has_plate(self, wire: str) -> bool: + return wire in self._index + + def children_owner(self, plate: ComplexPlate) -> ComplexPlate | None: + """For an inheriting target: the base-chain plate whose child struct + holds this type's children -- the nearest ancestor (or self) with + element members. Schema reasoning, so it lives here, not in a + template.""" + cur: ComplexPlate | None = plate + while cur is not None: + if element_members(cur.members): + return cur + cur = self.plate(cur.base.wire) if cur.base is not None else None + return None diff --git a/gen/press/__init__.py b/gen/press/__init__.py new file mode 100644 index 000000000..c762fe0fa --- /dev/null +++ b/gen/press/__init__.py @@ -0,0 +1,5 @@ +"""The press: renders the targets' templates. See gen.press.engine.""" + +from gen.press.engine import Press, PressError + +__all__ = ["Press", "PressError"] diff --git a/gen/press/context.py b/gen/press/context.py new file mode 100644 index 000000000..845492e36 --- /dev/null +++ b/gen/press/context.py @@ -0,0 +1,233 @@ +"""Build render contexts from the plates. + +The press is pure Mustache, so everything a template branches on or prints +must arrive as data. This module converts the plates into plain dicts with +three mechanical enrichments -- none of which makes a decision, language or +otherwise: + + 1. Discriminant expansion: every closed enumerated field (`kind`, + `category`, `cardinality`, `strategy`, `shape`, `node`, ...) gets a + boolean companion per vocabulary value (`kind: "enum"` -> `is_enum: + True`, `is_number: False`, ...). All flags are materialized so the + engine's strict mode never trips on a legitimate branch. + 2. Quoted companions: every string field gets `_q`, a double-quoted + backslash-escaped literal (JSON repertoire, non-ASCII as \\uXXXX -- + valid verbatim in C, C++, Go, Java, JavaScript, and Rust). + 3. Loop metadata: every list item gets `is_first` / `is_last` / `index0`; + items that are bare strings are lifted to `{value, value_q, ...}` so the + metadata has somewhere to live. + +Plus the pre-split member views templates iterate (attributes / elements / +value, own and merged), a `type` self-reference so inner scopes can reach +plate-level fields, and the generated-file banner text. +""" + +from __future__ import annotations + +import dataclasses +import json + +from gen.names import Name +from gen.plates.model import ( + ComplexPlate, + Plates, + UnionPlate, + attribute_members, + element_members, + value_member, +) +from gen.press.writer import banner + +# The closed vocabularies, by field name. A value outside its field's +# vocabulary is a build bug, so it fails loud here. +_DISCRIMINANTS: dict[str, tuple[str, ...]] = { + "kind": ( + "enum", "number", "string", "union", "complex", + "attribute", "element", "value", + "primitive-decimal", "primitive-integer", "primitive-string", + ), + "category": ("complex", "value", "primitive"), + "cardinality": ("required", "optional", "vector"), + "strategy": ( + "enum-class", "numeric-wrapper", "string-wrapper", "tagged-variant", + "value-class", "composite-class", "flag", "attrs-class", + "inherit", "flatten", + ), + "shape": ("value", "composite", "empty", "derived"), + "node": ("element", "sequence", "choice", "group"), + "variant_scope": ("bare", "composed"), + "family": ("decimal", "integer"), +} + + +def quoted(value: str) -> str: + return json.dumps(value, ensure_ascii=True) + + +def _flag(value: str) -> str: + return "is_" + value.replace("-", "_") + + +def _convert(obj): + """Dataclasses to dicts, recursively, with the enrichments applied.""" + if isinstance(obj, Name): + # Casings flatten onto the name so templates say {{name.snake}}. + out = {"wire": obj.wire, "wire_q": quoted(obj.wire)} + for convention, ident in obj.cased.items(): + out[convention] = ident + out[convention + "_q"] = quoted(ident) + return out + if dataclasses.is_dataclass(obj): + out: dict = {} + for f in dataclasses.fields(obj): + name = f.name + value = getattr(obj, name) + out[name] = _convert(value) + if isinstance(value, str): + out[name + "_q"] = quoted(value) + vocab = _DISCRIMINANTS.get(name) + if vocab is not None: + if value not in vocab: + raise ValueError( + f"{type(obj).__name__}.{name} = {value!r} is outside " + f"its vocabulary {vocab}" + ) + for v in vocab: + # Vocabularies overlap across fields of one object + # (PlateRef has category "value" and kind "enum"; + # the kind vocabulary also contains "value"). + # Earlier fields win: category's is_value/is_complex + # must not be clobbered by the kind expansion, and + # the two always agree where they overlap. + out.setdefault(_flag(v), v == value) + elif isinstance(value, (list, tuple)): + # Iterating a section already gates emptiness; has_ + # serves the non-iterating tests (wrap-once framing). + out["has_" + name] = bool(value) + return out + if isinstance(obj, (list, tuple)): + return _listify([_convert(item) for item in obj]) + if isinstance(obj, dict): + return {k: _convert(v) for k, v in obj.items()} + return obj + + +def _listify(items: list) -> list: + """Attach loop metadata; lift bare scalars so it has somewhere to live.""" + out = [] + last = len(items) - 1 + for i, item in enumerate(items): + if not isinstance(item, dict): + item = {"value": item} + if isinstance(item["value"], str): + item["value_q"] = quoted(item["value"]) + else: + item = dict(item) + item["is_first"] = i == 0 + item["is_last"] = i == last + item["index0"] = i + out.append(item) + return out + + +def _common(plates: Plates) -> dict: + return { + "target": _convert(plates.target), + "vars": dict(plates.target.vars), + "schema_version": plates.schema_version, + "source": plates.source, + "generated_banner": banner(plates.source), + } + + +def plate_context(plates: Plates, plate) -> dict: + """The context a per-type template renders against: the plate's fields, + the member views, the target facts, and a `type` self-reference so inner + scopes (a member loop, a variant loop) can still reach plate fields that + their own frame shadows.""" + ctx = _convert(plate) + if isinstance(plate, UnionPlate): + # The flattened case view: one entry per ref member and per literal, + # in schema order, each carrying its discriminator constant as + # `tag_ident` -- so loop metadata (ordinals, commas, first-member + # handling) works on the granularity the kind enum actually has. + kind_flags = [_flag(v) for v in _DISCRIMINANTS["kind"]] + cases = [] + for m in plate.members: + if m.ref is not None: + ref = _convert(m.ref) + case = { + "is_literal": False, + "tag_ident": m.tag.ident, + "ref": ref, + "name": _convert(m.name), + "clamp": _convert(m.clamp), + "has_clamp": bool(m.clamp), + "wire": None, + "wire_q": None, + } + # The referenced kind's flags, flattened onto the case so + # templates branch without reaching through `ref`. + for flag in kind_flags: + case[flag] = ref[flag] + cases.append(case) + else: + for variant in m.literals or []: + case = { + "is_literal": True, + "tag_ident": variant.ident, + "ref": None, + "name": None, + "clamp": [], + "has_clamp": False, + "wire": variant.wire, + "wire_q": quoted(variant.wire), + } + for flag in kind_flags: + case[flag] = False + cases.append(case) + ctx["cases"] = _listify(cases) + if isinstance(plate, ComplexPlate): + ctx["attributes"] = _listify( + [_convert(m) for m in attribute_members(plate.members)] + ) + ctx["elements"] = _listify( + [_convert(m) for m in element_members(plate.members)] + ) + value = value_member(plate.members) + ctx["value"] = _convert(value) if value is not None else None + merged = plate.all_members if plate.all_members is not None else plate.members + ctx["merged_attributes"] = _listify( + [_convert(m) for m in attribute_members(merged)] + ) + ctx["merged_elements"] = _listify( + [_convert(m) for m in element_members(merged)] + ) + merged_value = value_member(merged) + ctx["merged_value"] = _convert(merged_value) if merged_value is not None else None + for key in ("attributes", "elements", "merged_attributes", "merged_elements"): + ctx["has_" + key] = bool(ctx[key]) + ctx.update(_common(plates)) + ctx["type"] = ctx + return ctx + + +def target_context(plates: Plates, outputs: list[str]) -> dict: + """The context a once-per-target template renders against: every plate, + the roots, and the full output manifest (`outputs`, plus `outputs_by_ext` + grouped by final extension so a build manifest can list just its + sources).""" + ctx = _common(plates) + ctx["value_types"] = _listify([plate_context(plates, p) for p in plates.value_types]) + ctx["complex_types"] = _listify( + [plate_context(plates, p) for p in plates.complex_types] + ) + ctx["roots"] = _convert(list(plates.roots)) + paths = sorted(outputs) + ctx["outputs"] = _listify([{"path": p, "path_q": quoted(p)} for p in paths]) + by_ext: dict[str, list] = {} + for p in paths: + ext = p.rsplit(".", 1)[-1] if "." in p else "" + by_ext.setdefault(ext, []).append({"path": p, "path_q": quoted(p)}) + ctx["outputs_by_ext"] = {ext: _listify(items) for ext, items in by_ext.items()} + return ctx diff --git a/gen/press/engine.py b/gen/press/engine.py new file mode 100644 index 000000000..b368336b9 --- /dev/null +++ b/gen/press/engine.py @@ -0,0 +1,399 @@ +"""The press: a Mustache template engine. + +The press renders the targets' templates. The template language is Mustache +-- the published spec's interpolation, sections, inverted sections, partials, +comments, and set-delimiter core, with spec whitespace semantics (standalone +lines, partial call-site indentation) -- and three deliberate deviations, +because code generation is not HTML (design: generator-agnosticism.md): + + 1. Missing keys are render errors (template:line in the message). The spec + mandates silent empty output, which is the worst failure mode a code + generator can have. A key that is PRESENT with a None/empty value + renders empty and is falsey in sections; only absence is an error. + 2. No HTML escaping: `{{x}}` interpolates verbatim ({{{x}}} and {{&x}} are + accepted synonyms). + 3. No lambdas (the spec's escape hatch into logic). A callable in the + context is an error. + +Conformance to everything else is tested against the vendored official spec +suite (gen/tests/mustache_spec/); the constructor's `strict` and `escape` +parameters exist so that suite can exercise the spec's own semantics -- the +production pipeline never passes them. + +What the engine will never grow: expressions, comparisons, arithmetic, +filters, string manipulation, casing, assignment, or new syntax. Dispatch +data (booleans, loop metadata, quoted literals) is the context builder's +job; if a template cannot express something, the plates must carry it. +""" + +from __future__ import annotations + +from collections.abc import Callable, Mapping +from dataclasses import dataclass, field + + +class PressError(Exception): + """A template problem, always reported as `template:line: message`.""" + + def __init__(self, name: str, line: int, message: str): + self.template = name + self.line = line + super().__init__(f"{name}:{line}: {message}") + + +# --------------------------------------------------------------------------- # +# Parse tree +# --------------------------------------------------------------------------- # + + +@dataclass +class _Text: + text: str + + +@dataclass +class _Var: + path: tuple[str, ...] + raw: bool # {{{x}}} / {{&x}}: spec semantics; identical here by default + line: int + + +@dataclass +class _Section: + path: tuple[str, ...] + inverted: bool + line: int + children: list = field(default_factory=list) + + +@dataclass +class _Partial: + name: str + indent: str + line: int + + +# --------------------------------------------------------------------------- # +# Tokenizer (with set-delimiter support and the spec's standalone-line rules) +# --------------------------------------------------------------------------- # + +_STANDALONE_KINDS = {"open", "inv", "close", "comment", "delim", "partial"} + + +def _tokenize(template: str, name: str) -> list: + """Produce ('text', str) and ('tag', kind, key, line, indent) tokens. + kind: var | raw | open | inv | close | partial | comment | delim.""" + tokens: list = [] + odelim, cdelim = "{{", "}}" + pos = 0 + line = 1 + while pos < len(template): + start = template.find(odelim, pos) + if start < 0: + tokens.append(("text", template[pos:])) + break + if start > pos: + text = template[pos:start] + tokens.append(("text", text)) + line += text.count("\n") + + # Triple mustache is only meaningful with the default delimiters. + if odelim == "{{" and template.startswith("{{{", start): + end = template.find("}}}", start + 3) + if end < 0: + raise PressError(name, line, "unclosed '{{{' tag") + key = template[start + 3 : end].strip() + tokens.append(("tag", "raw", key, line, "")) + pos = end + 3 + continue + + end = template.find(cdelim, start + len(odelim)) + if end < 0: + raise PressError(name, line, f"unclosed '{odelim}' tag") + content = template[start + len(odelim) : end] + pos = end + len(cdelim) + line += content.count("\n") + + sigil = content[:1] + if sigil == "#": + tokens.append(("tag", "open", content[1:].strip(), line, "")) + elif sigil == "^": + tokens.append(("tag", "inv", content[1:].strip(), line, "")) + elif sigil == "/": + tokens.append(("tag", "close", content[1:].strip(), line, "")) + elif sigil == ">": + tokens.append(("tag", "partial", content[1:].strip(), line, "")) + elif sigil == "!": + tokens.append(("tag", "comment", "", line, "")) + elif sigil == "&": + tokens.append(("tag", "raw", content[1:].strip(), line, "")) + elif sigil == "=": + inner = content[1:].rstrip() + if not inner.endswith("="): + raise PressError(name, line, "malformed set-delimiter tag") + parts = inner[:-1].split() + if len(parts) != 2: + raise PressError(name, line, "malformed set-delimiter tag") + tokens.append(("tag", "delim", "", line, "")) + odelim, cdelim = parts + else: + tokens.append(("tag", "var", content.strip(), line, "")) + return _strip_standalone(tokens) + + +def _strip_standalone(tokens: list) -> list: + """The spec's standalone-line rule: a line whose text is all whitespace + and which carries exactly one section/inverted/close/comment/partial/ + set-delimiter tag contributes no output for the line itself. A standalone + partial keeps the line's leading whitespace as the indentation applied to + its rendered content.""" + # Split text tokens so each line of the template is its own token run. + split: list = [] + for tok in tokens: + if tok[0] != "text": + split.append(tok) + continue + text = tok[1] + while True: + nl = text.find("\n") + if nl < 0: + if text: + split.append(("text", text)) + break + split.append(("text", text[: nl + 1])) + text = text[nl + 1 :] + + out: list = [] + line: list = [] + + def flush(line_tokens: list) -> None: + tags = [t for t in line_tokens if t[0] == "tag"] + texts = [t[1] for t in line_tokens if t[0] == "text"] + standalone = ( + len(tags) == 1 + and tags[0][1] in _STANDALONE_KINDS + and all(not t.strip() for t in texts) + ) + if not standalone: + out.extend(line_tokens) + return + tag = tags[0] + if tag[1] == "partial": + indent = "" + for t in line_tokens: + if t[0] == "tag": + break + indent += t[1] + tag = ("tag", "partial", tag[2], tag[3], indent) + out.append(tag) + + for tok in split: + line.append(tok) + if tok[0] == "text" and tok[1].endswith("\n"): + flush(line) + line = [] + if line: + flush(line) + return out + + +# --------------------------------------------------------------------------- # +# Parser +# --------------------------------------------------------------------------- # + + +def _path(key: str) -> tuple[str, ...]: + return (".",) if key == "." else tuple(key.split(".")) + + +def _parse(tokens: list, name: str) -> list: + root: list = [] + stack: list[tuple[_Section, str]] = [] + current = root + for tok in tokens: + if tok[0] == "text": + if tok[1]: + current.append(_Text(tok[1])) + continue + _, kind, key, line, indent = tok + if kind in ("var", "raw"): + current.append(_Var(_path(key), kind == "raw", line)) + elif kind in ("open", "inv"): + section = _Section(_path(key), kind == "inv", line) + current.append(section) + stack.append((section, key)) + current = section.children + elif kind == "close": + if not stack or stack[-1][1] != key: + raise PressError(name, line, f"unexpected section close '{key}'") + stack.pop() + current = stack[-1][0].children if stack else root + elif kind == "partial": + current.append(_Partial(key, indent, line)) + # comments and delim changes contribute nothing + if stack: + section, key = stack[-1] + raise PressError(name, section.line, f"unclosed section '{key}'") + return root + + +# --------------------------------------------------------------------------- # +# Renderer +# --------------------------------------------------------------------------- # + +_MISS = object() + + +class Press: + """Renders Mustache templates. `partials` maps a partial name to its + template text (a dict, or a callable for file-backed loading). The + `strict` and `escape` knobs exist for the spec conformance suite; the + production pipeline uses the defaults (strict, verbatim).""" + + def __init__( + self, + partials: Mapping[str, str] | Callable[[str], str] | None = None, + strict: bool = True, + escape: Callable[[str], str] | None = None, + max_partial_depth: int = 64, + ): + self._partials = partials or {} + self._strict = strict + self._escape = escape + self._max_depth = max_partial_depth + self._cache: dict[tuple[str, str, str], list] = {} + + def render(self, template: str, context, name: str = "