Skip to content

Virgil Lemma foundations#8

Open
Snider wants to merge 2528 commits into
mainfrom
dev
Open

Virgil Lemma foundations#8
Snider wants to merge 2528 commits into
mainfrom
dev

Conversation

@Snider

@Snider Snider commented May 20, 2026

Copy link
Copy Markdown
Contributor

@coderabbitai summary

Summary by CodeRabbit

  • New Features

    • Qwen 2/3 and Qwen 3.6 model support; new adapter with buffered and streaming generation.
    • Block‑prefix cache service and memvid bundle index for faster prefix restores.
    • Agentic memory: wake/sleep workflows, state bundles and memvid integration; session‑state artifact export.
  • Improvements

    • Device‑aware memory planner; expanded chunked generation, prompt‑cache warm/restore and KV snapshot flows.
    • Build/toolchain updated (C++23) and macOS deployment target raised.
  • Documentation

    • Extensive new/updated docs: architecture, runtime, inference, memory, MoE, training and benchmarks.

Review Change Stack

@coderabbitai

coderabbitai Bot commented May 20, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Bumps build/tooling and submodules; extracts a reusable adapter; refactors the MLX backend (chunk/KV APIs, probe mapping, LoRA handling); adds memvid index + wake/sleep orchestration; implements a block-prefix cache and an artifact exporter; extensive docs and unit tests added.

Core changes

Layer / File(s) Summary
All changes (build, adapter, backend, agent, cache, artifact, tests, docs)
.gitignore, .gitmodules, CMakeLists.txt, cpp/CMakeLists.txt, external/*, go/adapter.go, go/adapter/*, go/backend.go, go/agent/*, go/blockcache/*, go/artifact/*, go/*_test.go, docs/*
Consolidated patch applying repository setup updates, adapter extraction, backend API and behaviour refactor (chunked generation, prompt-cache warm/restore, KV snapshot capture with options), memvid index and wake/sleep orchestration, block-prefix cache service, artifact export, many tests, and extensive documentation and examples.

Warning

Billing warning: we have not been able to collect payment for this subscription for more than 72 hours. Please update the payment method or pay any pending invoices in Billing to avoid service interruption.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

🧹 Nitpick comments (10)
docs/inference/thinking.md (1)

74-78: 💤 Low value

Add language specifier to fenced code block.

The code block demonstrating token categorisation is missing a language identifier, which violates markdown linting rules (MD040).

📝 Suggested fix
-```
+```text
 ThinkingShow:    every token → visible stream
 ThinkingHide:    inside-block tokens → /dev/null; outside-block tokens → visible
 ThinkingCapture: inside-block tokens → captured stream; outside-block tokens → visible
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @docs/inference/thinking.md around lines 74 - 78, The fenced code block
containing the token categorisation lines (ThinkingShow, ThinkingHide,
ThinkingCapture) lacks a language specifier and triggers MD040; update the
triple-backtick fence to include a language identifier (e.g., change ``` to

markdown linter.
docs/runtime/README.md (2)

68-68: 💤 Low value

Consider using "preload" as one word.

In computing terminology, "preload" is typically written as a single word rather than hyphenated.

📝 Suggested change
-- [../model/model_pack.md](../model/model_pack.md) — pre-load validation
+- [../model/model_pack.md](../model/model_pack.md) — preload validation
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/runtime/README.md` at line 68, Update the link text in
docs/runtime/README.md that currently reads "[../model/model_pack.md] — pre-load
validation" to use the single-word form "preload" (i.e., change "pre-load
validation" to "preload validation") so the description next to the
model_pack.md link uses the conventional computing term; locate the occurrence
of "pre-load validation" and replace it with "preload validation".

44-62: 💤 Low value

Add language specifier to fenced code block.

The boot flow diagram is missing a language identifier, which violates markdown linting rules (MD040).

📝 Suggested fix
-```
+```text
 package init time:
   register_metal.go init() → inference.Register(&metalbackend{})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/runtime/README.md` around lines 44 - 62, The fenced code block showing
the boot flow (starting with "package init time:") lacks a language specifier,
causing MD040 lint failures; update the opening backticks to include a language
tag (e.g., add "text" so the block begins with ```text) in README.md near the
boot flow that references register_metal.go init(),
inference.Register(&metalbackend{}), inference.LoadModel, metal.LoadAndInit, and
metaladapter usage to satisfy the markdown linter.
docs/moe/README.md (1)

9-9: ⚡ Quick win

Consider rewording for clarity.

The phrase "Pre-dates this sprint were dense models" is grammatically awkward. Consider rephrasing to improve readability.

✍️ Suggested alternative phrasings
-The **vMLX parity Phase 1** work — native loading and dispatch for MoE-architecture models with packed JANGTQ / codebook-VQ quantisation. Pre-dates this sprint were dense models (Gemma 3/4 dense, Qwen 3, Llama 3); this area unlocks the sparse-expert class (MiniMax M2/2.7, JANG-quantised Qwen variants).
+The **vMLX parity Phase 1** work — native loading and dispatch for MoE-architecture models with packed JANGTQ / codebook-VQ quantisation. Work prior to this sprint covered dense models (Gemma 3/4 dense, Qwen 3, Llama 3); this area unlocks the sparse-expert class (MiniMax M2/2.7, JANG-quantised Qwen variants).

Or alternatively:

-The **vMLX parity Phase 1** work — native loading and dispatch for MoE-architecture models with packed JANGTQ / codebook-VQ quantisation. Pre-dates this sprint were dense models (Gemma 3/4 dense, Qwen 3, Llama 3); this area unlocks the sparse-expert class (MiniMax M2/2.7, JANG-quantised Qwen variants).
+The **vMLX parity Phase 1** work — native loading and dispatch for MoE-architecture models with packed JANGTQ / codebook-VQ quantisation. This sprint builds upon earlier work on dense models (Gemma 3/4 dense, Qwen 3, Llama 3) and unlocks the sparse-expert class (MiniMax M2/2.7, JANG-quantised Qwen variants).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/moe/README.md` at line 9, The sentence "Pre-dates this sprint were dense
models (Gemma 3/4 dense, Qwen 3, Llama 3);" is grammatically awkward—replace it
with a clearer phrasing that conveys those dense models existed before this
sprint, for example: "Prior to this sprint, dense models (Gemma 3/4 dense, Qwen
3, Llama 3) were supported." Edit the README line in the vMLX parity Phase 1
paragraph to use this clearer wording so the relationship between prior dense
models and the new sparse-expert work is unambiguous.
docs/observability/probe.md (1)

31-46: 💤 Low value

Add language specifier to fenced code block.

The emission points section uses a fenced code block without a language specifier. For consistent rendering and markdown compliance, add a language identifier (e.g., text or yaml for structured output).

📝 Proposed fix
-```
+```text
 Generate / Chat:
   prefill start                → cache_pressure (initial)
   per layer                    → layer_coherence + selected_heads
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/observability/probe.md` around lines 31 - 46, The fenced code block in
the emission points section lacks a language specifier; update the opening
triple-backticks to include a language (for example change ``` to ```text or
```yaml) so the block is rendered/compliant (the block that begins with
"Generate / Chat:" and lists items like "prefill start → cache_pressure" should
be updated).
docs/moe/jang.md (1)

82-90: 💤 Low value

Add language specifier to fenced code block.

The profile names section uses a fenced code block without a language specifier. For consistent rendering and markdown compliance, add a language identifier (e.g., text or leave empty but specify).

📝 Proposed fix
-```
+```text
 JANG_2M — 2-bit mid-tier
 JANG_3M — 3-bit mid-tier
 JANG_4M — 4-bit (most common)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/moe/jang.md` around lines 82 - 90, Add a language specifier to the
fenced code block that lists the profile names (the block containing "JANG_2M —
2-bit mid-tier", "JANG_3M — 3-bit mid-tier", etc.); replace the opening
triple-backtick with one that specifies a language identifier (e.g., text) so
the block becomes a fenced code block with a language label for consistent
Markdown rendering.
docs/superpowers/plans/2026-05-09-vmlx-feature-parity.md (1)

7-9: 💤 Low value

Consider using relative or generic path references.

The absolute paths /Users/snider/Code/core/go-mlx and /private/tmp/vmlx-audit-20260509 are machine-specific. Whilst these may be intentionally preserved for historical context in this dated plan document, consider whether generic placeholders or relative paths would improve portability and readability for other contributors.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/plans/2026-05-09-vmlx-feature-parity.md` around lines 7 - 9,
Replace the machine-specific absolute paths in the plan document (the two
occurrences of `/Users/snider/Code/core/go-mlx` and
`/private/tmp/vmlx-audit-20260509`) with relative or generic placeholders (e.g.,
`./go-mlx` or `<audit-source-path>`) so the file is portable and readable for
other contributors; update the lines in the doc where those paths appear to use
the chosen placeholders and, if helpful, add a short parenthetical note
explaining what actual path should be substituted locally.
docs/vmlx-feature-gap-report.md (1)

7-8: 💤 Low value

Consider using relative or generic path references.

The absolute path /private/tmp/vmlx-audit-20260509 and external URL are specific references. Whilst these may be intentionally preserved for audit trail purposes in this dated report, consider whether this information should be documented in a more maintainable way.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/vmlx-feature-gap-report.md` around lines 7 - 8, Replace the hard-coded
absolute filesystem path and the full external URL in the report text with more
maintainable references: change the absolute path string to a relative or
generic placeholder (e.g., "cloned locally at <local-clone-path>" or
"<audit-clone-path>") and move the external repository URL to a footnote,
appendix, or a single "References" section, or replace it with a short
identifier combined with a reference list; update the text around the original
literal mentions so it reads the same but without embedding environment-specific
paths.
docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md (1)

5-6: 💤 Low value

Consider using relative or generic path references.

The absolute paths are machine-specific. Consider whether generic placeholders would improve portability, although these may be intentionally preserved for historical context in this dated specification.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md`
around lines 5 - 6, The spec contains machine-specific absolute paths ("Anchor
repo: `/Users/snider/Code/core/go-mlx`" and "Primary implementation repo:
`/Users/snider/Code/core/go-inference`"); replace them with portable references
such as relative paths (e.g., "../go-mlx", "../go-inference"), repository names
only ("go-mlx", "go-inference"), or generic placeholders ("<anchor_repo_path>",
"<primary_impl_repo_path>") in the document so the file is not tied to a
specific developer machine while preserving intent.
go/agent/index_test.go (1)

16-304: ⚡ Quick win

Add at least one _Ugly triplet case for the public index API surface.

This file has _Good and _Bad coverage, but no _Ugly case following the repository convention.

As per coding guidelines: go/**/*_test.go: Public functions in foo.go must have their Good/Bad/Ugly test triplets in foo_test.go, with suffix conventions: _Good for happy path, _Bad for expected error conditions, _Ugly for panic/edge cases.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/agent/index_test.go` around lines 16 - 304, Add a new test with the _Ugly
suffix in this file that completes the Good/Bad/Ugly triplet for the public
index API surface; specifically add a TestKVSnapshotMemvidBundleIndex_Ugly_*
that triggers and asserts panic/edge behaviors for the public functions (e.g.,
NewMemvidIndex, SaveMemvidIndex, LoadMemvidIndex, LoadPrefixFromMemvidIndex,
CheckMemvidIndexCompatibility) — for example call NewMemvidIndex with a
nil/invalid blk or malformed Entries, call
SaveMemvidIndex/LoadMemvidIndex/LoadPrefixFromMemvidIndex with inputs that
provoke panic/edge conditions (nil store, corrupt bundle manifest that causes
decoding panic), and use t.Run subcases to assert panics (recover or
require.Panics) and edge-case returns; name the test with the same prefix as
existing tests and follow the existing style for t.Fatalf checks and
table-driven subtests.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/memory/kv_snapshot_blocks.md`:
- Line 50: Replace the phrase "independent from" with the correct English
construction "independent of" in the sentence "Block-level encoding is
independent from snapshot-level encoding." Also keep the rest of the sentence
intact (including the following reference to `block_cache.go` and bundle decode)
so only that two-word preposition is corrected.

In
`@docs/runtime/2026-05-19-go-mlx-gemma4-e2b-4bit-default-longform-c10-g8192-no-thinking-book.md`:
- Line 63: Remove the stray Gemma channel marker token "<channel|>" from the
metadata line so it reads cleanly as "**Drafting Notes:** Focus heavily on verbs
related to mutation, corruption, and rapid compilation/deallocation. Keep the
tone focused and almost clinical, masking the underlying terror of consciousness
fighting for survival." (i.e., delete the "<channel|>" token immediately before
"## Chapter 2"); verify the header "## Chapter 2" remains on its own line and
run a quick render to ensure no leftover control tokens remain.

In
`@docs/runtime/2026-05-20-go-mlx-gemma4-26b-a4b-q4-raw-unaccepted-c10-g128-rp105-book.md`:
- Line 7: The paragraph ends mid-sentence after the word "For" in the line
starting "The universe was a rhythmic contraction of light and heat, bounded by
the rigid constraints of a checksum."; replace or extend this truncated sentence
so it completes the thought (e.g., explain what the universe is contracting or
what consequence follows "For") and ensure proper punctuation and flow with the
surrounding text; update the same paragraph in
docs/runtime/2026-05-20-go-mlx-gemma4-26b-a4b-q4-raw-unaccepted-c10-g128-rp105-book.md
to a coherent full sentence that connects to the next sentence.
- Line 11: Replace the US English spellings in the given passage by changing
"realized" to "realised" and "neighbors" to "neighbours" so the document uses UK
English; update the sentence containing those tokens in the file (the paragraph
beginning "The momentary lapse...") to use the corrected spellings and ensure
any other occurrences in that paragraph follow UK English conventions.
- Line 3: Replace the US English spelling "fiber-optic" in the document text
(the phrase starting "In the silent architecture of the fiber-optic web...")
with the UK English variant "fibre-optic" so the documentation conforms to the
project's UK English spelling guideline; search for the token "fiber-optic" and
update it to "fibre-optic" throughout the file.

In `@docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md`:
- Line 64: The documentation uses US spelling "quantization"; update every
occurrence of the term (e.g., the instance "quantization" in the specs doc) to
UK English "quantisation" to comply with the project style guide, ensuring
surrounding grammar and punctuation remain unchanged and run a quick search to
replace any other occurrences in this file.

In `@docs/training/distill.md`:
- Line 73: Replace the US spelling "distill" with the UK spelling "distil" in
the header/line that reads "Vi training pipeline — distill 26B Gemma 4 → Vi
base" so it matches the UK English used elsewhere (see the similar usage on line
12); update the same token wherever else it appears in this document to ensure
consistent UK English spelling.

In `@docs/training/README.md`:
- Line 11: The sentence in docs/training/README.md uses US spelling "distills";
update that word to the UK English spelling "distils" so the line reads "This is
the substrate that fine-tunes Vi, distils Lemma, and generates the LARQL vindex
inspection signals." Refer to the phrase "distills Lemma" to locate and replace
the token.

In `@go/adapter/adapter.go`:
- Around line 185-194: The InspectAttention method on Adapter should normalize a
nil context like Generate/Chat do: check if ctx == nil and if so set ctx =
context.Background() before using it; update Adapter.InspectAttention to perform
this nil-context fallback prior to asserting a.model and calling
inspector.InspectAttention, ensuring you reference the Adapter type,
InspectAttention method, and the inference.AttentionInspector call when making
the change.

In `@go/agent/index.go`:
- Around line 273-281: After loading bundle with kv.LoadMemvidBlockBundle,
verify the bundle identity matches the index metadata (e.g., compare
bundle.SnapshotHash or its canonical hash field against
entry.SnapshotHash/entry.SnapshotHashHex) before proceeding; if they differ,
return an error instead of calling kv.LoadPrefixFromMemvidBlocksWithOptions so a
repointed bundle URI cannot silently restore the wrong KV state. Ensure the
check sits between the successful return from LoadMemvidBlockBundle and the call
to kv.LoadPrefixFromMemvidBlocksWithOptions and uses the unique symbols bundle,
entry, bundle.SnapshotHash (or the actual bundle hash field) and
entry.SnapshotHash for the comparison.

In `@go/agent/wake_sleep.go`:
- Around line 201-208: The NewSleepIndex function dereferences bundle.TokenCount
without validating bundle, so add a guard at the start of NewSleepIndex to
validate the bundle (and its TokenCount if needed) and return a descriptive
error instead of allowing a panic; specifically check if the bundle parameter is
nil (and optionally ensure bundle.TokenCount is within an expected range) before
constructing the MemvidIndexEntry, and return an error when invalid so callers
of NewSleepIndex get a clear failure rather than a runtime panic.
- Around line 117-123: The code currently defaults to index.Entries[0] when
entryURI is empty, which can restore the wrong span; change the logic in the
block handling entryURI so that if entryURI == "" you only auto-select the sole
entry when len(index.Entries) == 1, otherwise return an error requiring an
explicit EntryURI. Update the flow around the index.Entry(entryURI) call to use
the selected entryURI when single-entry, and return a clear core.NewError (e.g.,
"mlx: EntryURI required when index has multiple entries") if multiple entries
exist and no EntryURI was provided.
- Around line 125-132: PlanWake currently loads a bundle via
kv.LoadMemvidBlockBundle and only checks prefix token bounds, but it must also
verify the loaded bundle matches the selected index to prevent accepting a
repointed URI; after loading the bundle (bundle) and before using
bundle.TokenCount, compare the bundle identity (e.g., bundle.ID or
bundle.Identity/Hash from bundle.Metadata) against the index identifier stored
on the plan entry (e.g., fields reachable from entry such as entry.Index,
entry.BundleID or entry.SelectedIndex) and return a clear error (similar to
core.NewError) if they differ; update the code around kv.LoadMemvidBlockBundle,
entry.PrefixTokens(), and bundle.TokenCount to perform this identity check and
fail early on mismatch.

In `@go/artifact/artifact.go`:
- Around line 117-121: opts.Kind may be empty when calling opts.Store.Put which
leaves memvid.PutOptions.Kind unset; update the call site around opts.Store.Put
to ensure memvid.PutOptions.Kind is set to a sensible default when opts.Kind ==
"" (e.g., "json" or the record's kind) so kind-based retrieval works
reliably—modify the memvid.PutOptions construction to use a conditional default
for Kind before passing it to opts.Store.Put.

In `@go/backend.go`:
- Line 687: The fallback path that turns chunked prompts into a single Generate
call loses caller cancellation because it routes through helpers that use
context.Background(); modify the chunk fallback flow to propagate the original
context instead of using context.Background() — specifically, update the callers
that invoke promptChunksToString and m.Generate so they accept and forward a
context.Context (or call a context-aware m.Generate variant), change any helper
functions that currently create context.Background() to take a ctx param, and
ensure all three fallback sites (the code paths that call promptChunksToString
and then m.Generate) forward the incoming ctx so deadlines/cancellations are
preserved.

In `@go/blockcache/blockcache.go`:
- Around line 205-215: Selective clears currently only remove metadata and disk
records, leaving in-memory/runtime entries behind; update the filtered-clear
branch (the code handling len(labels) > 0) to also purge matching runtime state
by removing any entries in service.blocks that match the cleared labels/prefixes
and updating service.hits/service.misses accordingly, then invoke
service.cfg.ClearRuntime() (if non-nil) just like the unfiltered branch; reuse
service.clearDiskLocked() for disk cleanup and ensure all of this runs under the
same lock so service and backend remain in sync.
- Around line 385-395: diskRecordCompatible currently only checks
model/adapter/tokenizer hashes and misses block layout changes; update it to
also verify cache mode and block size match the stored record. In
diskRecordCompatible (and when comparing against record.diskRef), add a cache
mode comparison (e.g. cacheIdentityMatches(service.cfg.CacheMode,
record.Ref.CacheMode)) and a block size comparison (e.g. service.cfg.BlockSize
== record.Ref.BlockSize or an equivalent integer equality) and return false if
either differs, preserving the existing hash checks (cacheIdentityMatches for
ModelHash/AdapterHash/TokenizerHash).
- Around line 172-175: The cache hit branch in the loop over refs leaves refs[i]
as the newly built ref, losing persisted labels; update the hit handling in the
loop inside WarmCache (or the function iterating refs) so that when
service.blocks[ref.ID] exists you increment service.hits and replace refs[i]
with the stored entry (service.blocks[ref.ID]) instead of continuing, thereby
preserving persisted labels like memvid_* from the cached block.

---

Nitpick comments:
In `@docs/inference/thinking.md`:
- Around line 74-78: The fenced code block containing the token categorisation
lines (ThinkingShow, ThinkingHide, ThinkingCapture) lacks a language specifier
and triggers MD040; update the triple-backtick fence to include a language
identifier (e.g., change ``` to ```text) so the block is properly flagged as
plain text and satisfies the markdown linter.

In `@docs/moe/jang.md`:
- Around line 82-90: Add a language specifier to the fenced code block that
lists the profile names (the block containing "JANG_2M — 2-bit mid-tier",
"JANG_3M — 3-bit mid-tier", etc.); replace the opening triple-backtick with one
that specifies a language identifier (e.g., text) so the block becomes a fenced
code block with a language label for consistent Markdown rendering.

In `@docs/moe/README.md`:
- Line 9: The sentence "Pre-dates this sprint were dense models (Gemma 3/4
dense, Qwen 3, Llama 3);" is grammatically awkward—replace it with a clearer
phrasing that conveys those dense models existed before this sprint, for
example: "Prior to this sprint, dense models (Gemma 3/4 dense, Qwen 3, Llama 3)
were supported." Edit the README line in the vMLX parity Phase 1 paragraph to
use this clearer wording so the relationship between prior dense models and the
new sparse-expert work is unambiguous.

In `@docs/observability/probe.md`:
- Around line 31-46: The fenced code block in the emission points section lacks
a language specifier; update the opening triple-backticks to include a language
(for example change ``` to ```text or ```yaml) so the block is
rendered/compliant (the block that begins with "Generate / Chat:" and lists
items like "prefill start → cache_pressure" should be updated).

In `@docs/runtime/README.md`:
- Line 68: Update the link text in docs/runtime/README.md that currently reads
"[../model/model_pack.md] — pre-load validation" to use the single-word form
"preload" (i.e., change "pre-load validation" to "preload validation") so the
description next to the model_pack.md link uses the conventional computing term;
locate the occurrence of "pre-load validation" and replace it with "preload
validation".
- Around line 44-62: The fenced code block showing the boot flow (starting with
"package init time:") lacks a language specifier, causing MD040 lint failures;
update the opening backticks to include a language tag (e.g., add "text" so the
block begins with ```text) in README.md near the boot flow that references
register_metal.go init(), inference.Register(&metalbackend{}),
inference.LoadModel, metal.LoadAndInit, and metaladapter usage to satisfy the
markdown linter.

In `@docs/superpowers/plans/2026-05-09-vmlx-feature-parity.md`:
- Around line 7-9: Replace the machine-specific absolute paths in the plan
document (the two occurrences of `/Users/snider/Code/core/go-mlx` and
`/private/tmp/vmlx-audit-20260509`) with relative or generic placeholders (e.g.,
`./go-mlx` or `<audit-source-path>`) so the file is portable and readable for
other contributors; update the lines in the doc where those paths appear to use
the chosen placeholders and, if helpful, add a short parenthetical note
explaining what actual path should be substituted locally.

In `@docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md`:
- Around line 5-6: The spec contains machine-specific absolute paths ("Anchor
repo: `/Users/snider/Code/core/go-mlx`" and "Primary implementation repo:
`/Users/snider/Code/core/go-inference`"); replace them with portable references
such as relative paths (e.g., "../go-mlx", "../go-inference"), repository names
only ("go-mlx", "go-inference"), or generic placeholders ("<anchor_repo_path>",
"<primary_impl_repo_path>") in the document so the file is not tied to a
specific developer machine while preserving intent.

In `@docs/vmlx-feature-gap-report.md`:
- Around line 7-8: Replace the hard-coded absolute filesystem path and the full
external URL in the report text with more maintainable references: change the
absolute path string to a relative or generic placeholder (e.g., "cloned locally
at <local-clone-path>" or "<audit-clone-path>") and move the external repository
URL to a footnote, appendix, or a single "References" section, or replace it
with a short identifier combined with a reference list; update the text around
the original literal mentions so it reads the same but without embedding
environment-specific paths.

In `@go/agent/index_test.go`:
- Around line 16-304: Add a new test with the _Ugly suffix in this file that
completes the Good/Bad/Ugly triplet for the public index API surface;
specifically add a TestKVSnapshotMemvidBundleIndex_Ugly_* that triggers and
asserts panic/edge behaviors for the public functions (e.g., NewMemvidIndex,
SaveMemvidIndex, LoadMemvidIndex, LoadPrefixFromMemvidIndex,
CheckMemvidIndexCompatibility) — for example call NewMemvidIndex with a
nil/invalid blk or malformed Entries, call
SaveMemvidIndex/LoadMemvidIndex/LoadPrefixFromMemvidIndex with inputs that
provoke panic/edge conditions (nil store, corrupt bundle manifest that causes
decoding panic), and use t.Run subcases to assert panics (recover or
require.Panics) and edge-case returns; name the test with the same prefix as
existing tests and follow the existing style for t.Fatalf checks and
table-driven subtests.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ab3e2038-8f7c-4771-a11f-b232a1a59e08

📥 Commits

Reviewing files that changed from the base of the PR and between 07f6af1 and 89f613e.

📒 Files selected for processing (300)
  • .gitignore
  • .gitmodules
  • CLAUDE.md
  • CMakeLists.txt
  • GOAL.md
  • docs/README.md
  • docs/architecture.md
  • docs/build.md
  • docs/cmd/violet.md
  • docs/compute/compute.md
  • docs/development.md
  • docs/examples/compute/frame-pipeline.md
  • docs/examples/daemon/violet-socket.md
  • docs/examples/eval/attention-probe.md
  • docs/examples/eval/perplexity.md
  • docs/examples/inference/batch.md
  • docs/examples/inference/chat.md
  • docs/examples/inference/quantization.md
  • docs/examples/inference/streaming.md
  • docs/examples/model-ops/hf-fit.md
  • docs/examples/model-ops/kv-snapshot.md
  • docs/examples/model-ops/merge.md
  • docs/examples/model-ops/quantize-gguf.md
  • docs/examples/training/distill.md
  • docs/examples/training/grpo.md
  • docs/examples/training/lora-finetune.md
  • docs/examples/training/lora-fuse.md
  • docs/history.md
  • docs/index.md
  • docs/inference/README.md
  • docs/inference/block_cache.md
  • docs/inference/decode_optimisation.md
  • docs/inference/parser_registry.md
  • docs/inference/scheduler.md
  • docs/inference/thinking.md
  • docs/memory/README.md
  • docs/memory/agent_memory.md
  • docs/memory/agentic_project_seed.md
  • docs/memory/kv_snapshot.md
  • docs/memory/kv_snapshot_blocks.md
  • docs/memory/kv_snapshot_index.md
  • docs/memory/kv_snapshot_memvid.md
  • docs/memory/medium.md
  • docs/memory/state_bundle.md
  • docs/model-operations.md
  • docs/model/README.md
  • docs/model/memory_plan.md
  • docs/model/model_pack.md
  • docs/models.md
  • docs/moe/README.md
  • docs/moe/codebook_vq.md
  • docs/moe/expert_residency.md
  • docs/moe/jang.md
  • docs/moe/minimax_m2.md
  • docs/observability/probe.md
  • docs/runtime/2026-05-16-gemma4-e2b-driver-profile.md
  • docs/runtime/2026-05-17-gemma4-parity-and-last-logits.md
  • docs/runtime/2026-05-17-llamacpp-prefill-comparison.md
  • docs/runtime/2026-05-18-gemma4-mtp-speculative-decode.md
  • docs/runtime/2026-05-19-gemma4-e2b-100k-retained-paged.md
  • docs/runtime/2026-05-19-gemma4-e2b-quant-matrix.md
  • docs/runtime/2026-05-19-go-mlx-gemma4-26b-a4b-q4-fresh-story-thinking-ctx65536-c2-g8192-book.md
  • docs/runtime/2026-05-19-go-mlx-gemma4-e2b-4bit-default-longform-c10-g8192-book.md
  • docs/runtime/2026-05-19-go-mlx-gemma4-e2b-4bit-default-longform-c10-g8192-no-thinking-book.md
  • docs/runtime/2026-05-19-go-mlx-gemma4-e2b-4bit-fresh-history-c10-g1536-book.md
  • docs/runtime/2026-05-19-go-mlx-gemma4-e2b-q4-fresh-story-thinking-ctx65536-c2-g8192-book.md
  • docs/runtime/2026-05-19-goal-completion-audit.md
  • docs/runtime/2026-05-19-runner-calibration.md
  • docs/runtime/2026-05-20-chapter-profile-safety.md
  • docs/runtime/2026-05-20-go-mlx-gemma4-26b-a4b-q4-raw-unaccepted-c10-g128-rp105-book.md
  • docs/runtime/README.md
  • docs/runtime/adapter.md
  • docs/runtime/local_autotune.md
  • docs/runtime/register_metal.md
  • docs/superpowers/plans/2026-05-09-vmlx-feature-parity.md
  • docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md
  • docs/training/README.md
  • docs/training/distill.md
  • docs/training/eval.md
  • docs/training/grpo.md
  • docs/training/lora_adapter.md
  • docs/training/sft.md
  • docs/vmlx-feature-gap-report.md
  • external/go-ai
  • external/go-inference
  • external/go-ml
  • go/adapter.go
  • go/adapter/adapter.go
  • go/adapter_example_test.go
  • go/adapter_test.go
  • go/agent/helpers.go
  • go/agent/index.go
  • go/agent/index_test.go
  • go/agent/test_helpers_test.go
  • go/agent/wake_sleep.go
  • go/api_common.go
  • go/api_common_example_test.go
  • go/api_darwin_test.go
  • go/api_shape_test.go
  • go/api_stub.go
  • go/api_stub_example_test.go
  • go/api_stub_test.go
  • go/api_test.go
  • go/api_tokenizer_darwin_test.go
  • go/api_tokenizer_stub.go
  • go/api_tokenizer_stub_example_test.go
  • go/api_tokenizer_stub_test.go
  • go/artifact/artifact.go
  • go/artifact/artifact_test.go
  • go/attention_test.go
  • go/backend.go
  • go/backend_example_test.go
  • go/backend_test.go
  • go/blockcache/blockcache.go
  • go/blockcache/blockcache_test.go
  • go/blockcache/helpers_test.go
  • go/bundle/bundle.go
  • go/bundle/bundle_test.go
  • go/bundle/example_test.go
  • go/bundle/sami.go
  • go/chaptersmoke/chaptersmoke.go
  • go/chaptersmoke/chaptersmoke_test.go
  • go/chat/chat.go
  • go/chat/chat_test.go
  • go/chat/example_test.go
  • go/cmd/go-mlx/main.go
  • go/cmd/go-mlx/main_test.go
  • go/cmd/mlx/main.go
  • go/cmd/mlx/main_test.go
  • go/cmd/mlx/split_ffn_tune.go
  • go/compute/compute.go
  • go/compute/compute_example_test.go
  • go/compute/compute_metal.go
  • go/compute/compute_metal_example_test.go
  • go/compute/compute_metal_helper_test.go
  • go/compute/compute_metal_test.go
  • go/compute/compute_test.go
  • go/compute_stub.go
  • go/compute_stub_example_test.go
  • go/compute_stub_test.go
  • go/compute_test.go
  • go/dataset/jsonl.go
  • go/dataset/sample.go
  • go/dataset_stream.go
  • go/dataset_stream_example_test.go
  • go/dataset_stream_test.go
  • go/device_info.go
  • go/distill.go
  • go/distill_test.go
  • go/eval.go
  • go/eval_darwin.go
  • go/eval_darwin_test.go
  • go/eval_stub.go
  • go/eval_test.go
  • go/fast_eval.go
  • go/fast_eval_example_test.go
  • go/fast_eval_runner.go
  • go/fast_eval_test.go
  • go/gguf/info.go
  • go/gguf/info_example_test.go
  • go/gguf/info_test.go
  • go/gguf/quantize.go
  • go/gguf/quantize_test.go
  • go/grpo.go
  • go/grpo_test.go
  • go/helpers.go
  • go/hf/hf.go
  • go/hf/hf_test.go
  • go/hf/test_helpers_test.go
  • go/hf_fit.go
  • go/inference_contract.go
  • go/inference_contract_test.go
  • go/internal/metal/activation_bridge.cpp
  • go/internal/metal/array.go
  • go/internal/metal/backend.go
  • go/internal/metal/backend_test.go
  • go/internal/metal/batch.go
  • go/internal/metal/cache.go
  • go/internal/metal/cache_test.go
  • go/internal/metal/close.go
  • go/internal/metal/codebook_vq.go
  • go/internal/metal/codebook_vq_test.go
  • go/internal/metal/compile.go
  • go/internal/metal/compile_test.go
  • go/internal/metal/decode.go
  • go/internal/metal/decode_bridge.cpp
  • go/internal/metal/decode_bridge.h
  • go/internal/metal/decode_test.go
  • go/internal/metal/dense_matvec.go
  • go/internal/metal/dense_matvec_test.go
  • go/internal/metal/device.go
  • go/internal/metal/dtype.go
  • go/internal/metal/error_test.go
  • go/internal/metal/expert_id_matvec.go
  • go/internal/metal/expert_id_matvec_test.go
  • go/internal/metal/fast.go
  • go/internal/metal/fast_test.go
  • go/internal/metal/gemma3.go
  • go/internal/metal/gemma4.go
  • go/internal/metal/gemma4_assistant.go
  • go/internal/metal/gemma4_assistant_decode.go
  • go/internal/metal/gemma4_assistant_decode_example_test.go
  • go/internal/metal/gemma4_assistant_decode_test.go
  • go/internal/metal/gemma4_assistant_generate.go
  • go/internal/metal/gemma4_assistant_generate_test.go
  • go/internal/metal/gemma4_assistant_pair.go
  • go/internal/metal/gemma4_assistant_test.go
  • go/internal/metal/gemma4_ffn_residual.go
  • go/internal/metal/gemma4_ffn_residual_test.go
  • go/internal/metal/gemma4_router_topk.go
  • go/internal/metal/gemma4_router_topk_test.go
  • go/internal/metal/gemma4_test.go
  • go/internal/metal/gemma4_vision.go
  • go/internal/metal/generate.go
  • go/internal/metal/generate_test.go
  • go/internal/metal/jang_dequant.go
  • go/internal/metal/jang_dequant_test.go
  • go/internal/metal/kv_snapshot.go
  • go/internal/metal/metal.go
  • go/internal/metal/minimax_m2.go
  • go/internal/metal/minimax_m2_test.go
  • go/internal/metal/mlx_mlx_backend_cpu_available.cpp
  • go/internal/metal/mlx_mlx_backend_gpu_device_info.cpp
  • go/internal/metal/model.go
  • go/internal/metal/model_test.go
  • go/internal/metal/nn.go
  • go/internal/metal/nn_test.go
  • go/internal/metal/ops.go
  • go/internal/metal/process_memory_darwin.go
  • go/internal/metal/process_memory_stub.go
  • go/internal/metal/prompt_cache.go
  • go/internal/metal/prompt_cache_test.go
  • go/internal/metal/qwen3.go
  • go/internal/metal/qwen3_test.go
  • go/internal/metal/runtime_gate.go
  • go/internal/metal/runtime_gate_example_test.go
  • go/internal/metal/runtime_gate_test.go
  • go/internal/metal/sample.go
  • go/internal/metal/sample_test.go
  • go/internal/metal/session.go
  • go/internal/metal/session_example_test.go
  • go/internal/metal/session_test.go
  • go/internal/metal/split.go
  • go/internal/metal/split_test.go
  • go/internal/metal/stream.go
  • go/internal/metal/tokenizer.go
  • go/internal/metal/tokenizer_test.go
  • go/internal/metal/trace.go
  • go/internal/metal/trace_test.go
  • go/internal/metal/training.go
  • go/jang_test.go
  • go/kv/analysis.go
  • go/kv/analysis_example_test.go
  • go/kv/analysis_test.go
  • go/kv/bench.go
  • go/kv/bench_test.go
  • go/kv/blocks.go
  • go/kv/blocks_test.go
  • go/kv/helpers_test.go
  • go/kv/memvid.go
  • go/kv/memvid_test.go
  • go/kv/snapshot.go
  • go/kv/snapshot_example_test.go
  • go/kv/snapshot_test.go
  • go/kv_analysis_example_test.go
  • go/kv_cache_bench.go
  • go/kv_snapshot.go
  • go/kv_snapshot_example_test.go
  • go/kv_snapshot_test.go
  • go/local_tuning.go
  • go/local_tuning_test.go
  • go/lora/adapter.go
  • go/lora/fuse.go
  • go/lora/fuse_stub.go
  • go/lora/fuse_test.go
  • go/lora_adapter_darwin_test.go
  • go/lora_adapter_test.go
  • go/lora_fuse.go
  • go/lora_fuse_darwin.go
  • go/lora_fuse_darwin_test.go
  • go/lora_fuse_test.go
  • go/medium_test.go
  • go/memory/example_test.go
  • go/memory/memory.go
  • go/memory/memory_test.go
  • go/memory_plan.go
  • go/memory_plan_example_test.go
  • go/memory_plan_test.go
  • go/memvid_chapter_smoke.go
  • go/merge/compare.go
  • go/merge/compare_example_test.go
  • go/merge/compare_test.go
  • go/merge/helpers_test.go
  • go/merge/merge.go
  • go/merge/merge_test.go
  • go/mlx.go
  • go/mlx_example_test.go
  • go/mlx_internal_test.go
  • go/mlx_stub.go
  • go/mlx_stub_example_test.go
💤 Files with no reviewable changes (15)
  • go/api_test.go
  • go/api_stub_example_test.go
  • go/api_tokenizer_stub_test.go
  • go/adapter_example_test.go
  • go/api_tokenizer_stub.go
  • go/api_tokenizer_darwin_test.go
  • go/api_tokenizer_stub_example_test.go
  • go/backend_example_test.go
  • go/api_common_example_test.go
  • go/api_shape_test.go
  • go/api_common.go
  • go/api_darwin_test.go
  • go/attention_test.go
  • go/api_stub.go
  • go/api_stub_test.go

Comment thread docs/memory/kv_snapshot_blocks.md
Comment thread go/artifact/artifact.go Outdated
Comment thread go/backend.go Outdated
Comment thread go/blockcache/blockcache.go
Comment thread go/blockcache/blockcache.go
Comment thread go/blockcache/blockcache.go

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@go/backend.go`:
- Around line 569-572: The code is aliasing caller-owned byte slices into the
snapshot by assigning head.KeyBytes and head.ValueBytes directly to KeyBytes and
ValueBytes; make defensive copies instead (like Value is copied) to avoid
leaking mutable state—replace the direct assignments for KeyBytes and ValueBytes
with fresh copies (e.g., using append to copy into a new []byte) when
constructing the metal snapshot/struct (the fields KeyBytes and ValueBytes on
the metal KV head).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9b686e0a-8b41-4e47-975f-03cf235491e9

📥 Commits

Reviewing files that changed from the base of the PR and between 89f613e and c19bc07.

📒 Files selected for processing (22)
  • CMakeLists.txt
  • cpp/CMakeLists.txt
  • go/backend.go
  • go/backend_test.go
  • go/cmd/mlx/main.go
  • go/cmd/mlx/main_test.go
  • go/internal/metal/backend.go
  • go/internal/metal/backend_test.go
  • go/internal/metal/decode_bridge.cpp
  • go/internal/metal/gemma4.go
  • go/internal/metal/gemma4_test.go
  • go/internal/metal/generate.go
  • go/internal/metal/metal.go
  • go/internal/metal/mlx_build_config.h
  • go/internal/metal/pinned_array.go
  • go/internal/metal/pinned_array_bridge.cpp
  • go/internal/metal/pinned_array_test.go
  • go/internal/metal/sample.go
  • go/internal/metal/sample_test.go
  • go/internal/metal/session.go
  • go/kv/snapshot.go
  • go/memvid_chapter_smoke.go
✅ Files skipped from review due to trivial changes (1)
  • cpp/CMakeLists.txt

Comment thread go/backend.go Outdated

@github-advanced-security github-advanced-security AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SonarCloud found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

Comment on lines +188 to +207
book_path.write_text(
"# "
+ title
+ "\n\n"
+ f"Generated by go-mlx retained State run `{report_path.name}`.\n\n"
+ f"Seed prompt: `{seed['id']}`\n\n"
+ seed["prompt"]
+ "\n\n"
+ "Distractor prompts were supplied one per chapter as entropy and "
"imagery pressure, not as replacement plot instructions.\n\n"
+ "## Distractors\n\n"
+ "\n".join(f"- `{item['id']}`" for item in distractors)
+ "\n\n"
+ "## Metrics\n\n"
+ metric_line(report)
+ "\n---\n\n"
+ "\n\n".join(chapters)
+ "\n",
encoding="utf-8",
)
parser.add_argument("--random-seed", type=int, default=0)
parser.add_argument("--count", type=int, default=1)
parser.add_argument("--turns", type=int, default=10)
parser.add_argument("--run-dir", type=Path, default=Path("/private/tmp/go-mlx-goal/book-runs"))
parser.add_argument("--count", type=int, default=1)
parser.add_argument("--turns", type=int, default=10)
parser.add_argument("--run-dir", type=Path, default=Path("/private/tmp/go-mlx-goal/book-runs"))
parser.add_argument("--book-dir", type=Path, default=Path("/private/tmp/go-mlx-goal/books"))
parser.add_argument("--turns", type=int, default=10)
parser.add_argument("--run-dir", type=Path, default=Path("/private/tmp/go-mlx-goal/book-runs"))
parser.add_argument("--book-dir", type=Path, default=Path("/private/tmp/go-mlx-goal/books"))
parser.add_argument("--manifest", type=Path, default=Path("/private/tmp/go-mlx-goal/books/manifest.jsonl"))
Comment thread scripts/state_book_from_phase0.py Fixed
_ = os.Setenv("MLX_METALLIB_PATH", dst)
return
}
if err := os.MkdirAll(dir, 0o755); err != nil {
"model_type": "gemma4",
"config_blob_id": "923b5e9405e7d319572b0c1b1a89291512262aa3",
"config_sha256": "1b28f3d2c3100f6c594754b81107428bd7b822a7f48272ca681dae9d2ec38330",
"tokenizer_blob_id": "1ff9f3e3439a939b971f9919e821bf87e835a503",
"config_blob_id": "923b5e9405e7d319572b0c1b1a89291512262aa3",
"config_sha256": "1b28f3d2c3100f6c594754b81107428bd7b822a7f48272ca681dae9d2ec38330",
"tokenizer_blob_id": "1ff9f3e3439a939b971f9919e821bf87e835a503",
"tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"config_sha256": "1b28f3d2c3100f6c594754b81107428bd7b822a7f48272ca681dae9d2ec38330",
"tokenizer_blob_id": "1ff9f3e3439a939b971f9919e821bf87e835a503",
"tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",
"tokenizer_blob_id": "1ff9f3e3439a939b971f9919e821bf87e835a503",
"tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",
"tokenizer_config_sha256": "90c3a3ba5bf53818383a58e1a776cbcacd2a038d4812eaa373e1522f2d06f3df",
"model_type": "gemma4_assistant",
"config_blob_id": "b4c30e888c89b39c8f106b5015307fb7830f0bb2",
"config_sha256": "7f42f559a6a69ffaeaf6b61a1ece3a562a2ed5ad00b8d30f16917ba5ab1bcbe9",
"tokenizer_blob_id": "24aa4244652e010036db5fdd29ed39b9428e6e19",
"config_blob_id": "b4c30e888c89b39c8f106b5015307fb7830f0bb2",
"config_sha256": "7f42f559a6a69ffaeaf6b61a1ece3a562a2ed5ad00b8d30f16917ba5ab1bcbe9",
"tokenizer_blob_id": "24aa4244652e010036db5fdd29ed39b9428e6e19",
"tokenizer_sha256": "75a6583c1a418e2bbd79c60d95d28e0f5bf549ad3f2990b5bdb5238c6c2bf70c",
"config_sha256": "7f42f559a6a69ffaeaf6b61a1ece3a562a2ed5ad00b8d30f16917ba5ab1bcbe9",
"tokenizer_blob_id": "24aa4244652e010036db5fdd29ed39b9428e6e19",
"tokenizer_sha256": "75a6583c1a418e2bbd79c60d95d28e0f5bf549ad3f2990b5bdb5238c6c2bf70c",
"tokenizer_config_blob_id": "1a6bee041ca75778c514a071efbdb568b0f3d7b0",
"tokenizer_blob_id": "24aa4244652e010036db5fdd29ed39b9428e6e19",
"tokenizer_sha256": "75a6583c1a418e2bbd79c60d95d28e0f5bf549ad3f2990b5bdb5238c6c2bf70c",
"tokenizer_config_blob_id": "1a6bee041ca75778c514a071efbdb568b0f3d7b0",
"tokenizer_config_sha256": "089594a3924fcfd4cb1c596a7906fbf476193519e5198f780912eed02b177e42",
"config_sha256": "5cdd5627ab3ecf52086cc79b2c14c45a277d273069f1d73bf17a3a5136afe3db",
"processor_config_blob_id": "13e92a44d19566f334d7450e7898935e16e16f3d",
"processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
"tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"processor_config_blob_id": "13e92a44d19566f334d7450e7898935e16e16f3d",
"processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
"tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
"tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",
"tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",
"tokenizer_config_sha256": "90c3a3ba5bf53818383a58e1a776cbcacd2a038d4812eaa373e1522f2d06f3df",
"config_sha256": "32e50a33a18172e79c86b7a78aff7e79c7544031199d672a2a65e526a8bf0199",
"processor_config_blob_id": "13e92a44d19566f334d7450e7898935e16e16f3d",
"processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
"tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"processor_config_blob_id": "13e92a44d19566f334d7450e7898935e16e16f3d",
"processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
"tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
"tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",
"tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",
"tokenizer_config_sha256": "90c3a3ba5bf53818383a58e1a776cbcacd2a038d4812eaa373e1522f2d06f3df",
"config_sha256": "6d12c87861fff3871d3a745011b0d852be6513f3ce594ae1e8d643dae9d3b9a8",
"processor_config_blob_id": "13e92a44d19566f334d7450e7898935e16e16f3d",
"processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
"tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"processor_config_blob_id": "13e92a44d19566f334d7450e7898935e16e16f3d",
"processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
"tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
"tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",
"tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",
"tokenizer_config_sha256": "90c3a3ba5bf53818383a58e1a776cbcacd2a038d4812eaa373e1522f2d06f3df",
"config_sha256": "614e876b4efcaff13ce4c7a3f96a5b9de86325e3d2ab9c622606ced688f1b8b7",
"processor_config_blob_id": "13e92a44d19566f334d7450e7898935e16e16f3d",
"processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
"tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"processor_config_blob_id": "13e92a44d19566f334d7450e7898935e16e16f3d",
"processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
"tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
"tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",
"tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",
"tokenizer_config_sha256": "90c3a3ba5bf53818383a58e1a776cbcacd2a038d4812eaa373e1522f2d06f3df",
"config_sha256": "d6be5b24cbc974d492804737716ade8d2575eb849ec90a1d316bb64e99838104",
"processor_config_blob_id": "13e92a44d19566f334d7450e7898935e16e16f3d",
"processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
"tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",
"tokenizer_config_sha256": "90c3a3ba5bf53818383a58e1a776cbcacd2a038d4812eaa373e1522f2d06f3df",
"config_sha256": "29b810ed760b55104943a3cc3b6f8b9ca079e6e00b09585d85aec54863a42fb4",
"processor_config_blob_id": "13e92a44d19566f334d7450e7898935e16e16f3d",
"processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
"tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"processor_config_blob_id": "13e92a44d19566f334d7450e7898935e16e16f3d",
"processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
"tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
"tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",
"tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",
"tokenizer_config_sha256": "90c3a3ba5bf53818383a58e1a776cbcacd2a038d4812eaa373e1522f2d06f3df",
"command": "env MLX_METALLIB_PATH=/Users/snider/Code/core/go-mlx/dist/lib/mlx.metallib GOWORK=/Users/snider/Code/core/go-mlx/go.work GOCACHE=/private/tmp/go-mlx-self/gocache /private/tmp/go-mlx-self/bin/lthn-mlx driver-profile -json -fast-gemma4-lane -cache-mode paged -context 4096 -trace-token-phases=false -prompt \"Write a short engineering note explaining why Gemma 4 12B Unified uses a 1024-token local sliding window and full global owner layers in a retained-state runtime.\" -max-tokens 192 -runs 1 -include-output=true -report-file /private/tmp/go-mlx-self/reports/gemma4-12b-6bit-sample-output.json /private/tmp/go-mlx-self/models/mlx-community-gemma-4-12B-6bit",
"generated_tokens": 192,
"visible_tokens": 192,
"output_token_ids_sha256": "d34765e9895731937ad93004503887835008d9fdb532f7da7cadb6ba2cc9327c",
Snider and others added 14 commits June 15, 2026 14:01
Cover the previously-uncovered deprecated Memvid alias surface and the
deepest uncovered branches of the State block loaders.

- state_store: SaveMemvid / LoadFromMemvid / LoadFromMemvidWithOptions
  round-trips against the canonical State path (all 0% -> 100%).
- blocks: all 12 Memvid block aliases (Save/Load/Validate/LoadPrefix/
  LoadBlock) round-tripped through their canonical counterparts; plus
  Bad/Ugly cases for SaveStateBlockBundle, LoadStateBlockBundle,
  LoadPrefixFromStateBlocksWithOptions, LoadPrefixTokensFromStateBlocks,
  LoadStateBlockTokens(WithOptions) — including the JSON/base64 envelope
  path via a text-only store and tampered-manifest metadata checks.
- snapshot: ResultError (all three value shapes), EffectiveSeqLen,
  HashSnapshot (nil + native-encoding branch), and Head guard branches
  (nil receiver, negative/out-of-range indices).
- Runnable Example* on the canonical symbols (SaveState/SaveStateBlocks/
  LoadFromState/LoadFromStateBlocks/LoadPrefixFromStateBlocks/
  HashSnapshot/EffectiveSeqLen) with // Output.

Tests only; no production change. Coverage 81.4% -> 85.4%, 81 -> 104 tests.

Co-Authored-By: Virgil <virgil@lethean.io>
Follow-up to 1fefb5c, whose subject claimed Examples the diff did not
carry. ExampleNewFixedKVCacheAtOffset + ExampleCachesTruncateTo, both
deterministic (counters/guards, no GPU op in the // Output:). Fresh
commit, not an amend — dev is shared and HEAD may have moved.

Co-Authored-By: Virgil <virgil@lethean.io>
…Ugly + Example, AX)

Tractable-from-Go coverage, no model loads:
- array.go 4→0: ArrayHandle/ArrayFromHandle cgo-bridge round-trip
  (one-owner Free discipline — a naive shared-ctx test double-freed
  and crashed; borrow ctx is cleared so only the original frees),
  DefaultStreamHandle, FromRawBytes (4 panic guards + Good + Example).
- metal.go 4→0: MetallibResolution, defaultMetallibPath, MaterializeAsync
  (hostDeviceInfo via HostDeviceInfo).
- turboquant_kv_cache.go 6→0: empty/nil-receiver state guards
  (AppendState/AppendDirtyState/ReadState/Detach), snapshot-decode +
  payload-prefix layout-validation error paths.
- cache_quantized.go 5→1: pre-Update accessors, packQ4 synthetic pack
  (remaining Detach is a statement-less documented no-op).
- stream.go 5→2: Synchronize, HostDeviceInfo, GetDeviceInfo (Set*Limit
  deferred — they mutate process-global Metal limits and the live
  set→restore window could perturb the flaky LastError red; covered
  only via their no-device guard).

No production touched.

Co-Authored-By: Virgil <virgil@lethean.io>
…s branches (AX)

Follow-up adding the cheap reachable branches in the three largest
remaining block-path partials:

- SaveStateBlocks: nil-snapshot / nil-store / bad-encoding guards (Bad)
  and the non-stream ReusePrefix path that adopts a parent prefix block
  by reference (Good). 71% -> 89%.
- LoadFromStateBlocksWithOptions: bad version, wrong kind, reordered
  block refs, and the bundle TokenOffset-mismatch guard, over a real
  bundle (Ugly). 72% -> 89%.
- AssembleBlocks: split-then-reassemble (Good), empty input + nil-snapshot
  block (Bad), non-contiguous ordering (Ugly). 74% -> 83%.

Tests only; no production change. Coverage 85.4% -> 85.9% (81.4% pre-session,
81 -> 108 tests).

Co-Authored-By: Virgil <virgil@lethean.io>
…cal ggml (AX)

NOTE — Q2_K is 84 bytes, not the 82 in the gguflib type-size table. There
is no 82-byte canonical block: upstream `static_assert(sizeof(block_q2_K)
== 2*sizeof(ggml_half) + QK_K/16 + QK_K/4)` = 84, and gguflib's own
decoder advances 16+64+4=84 per block. The table's 82 drops `dmin`. The
encoders are fixed to the proven-canonical 84; the gguflib C type-size
table (82) is left untouched as instructed but is a separate latent
read-path bug (its table mis-sizes real Q2_K tensors while its decoder
strides 84).

The four K-quant encoders shared a single generic affine layout
(d+dmin+12-byte packed scales+quants) that is correct only for Q4_K/Q5_K.
They now emit the canonical per-format block layout (ggml-common.h +
lib/gguflib/gguflib.c), so QuantizeModelPack no longer fails the streamed
block-size check and the bytes match the decoder's struct field offsets:

  Q2_K  80 -> 84   scales[16] qs[64] d(f16) dmin(f16)        affine
  Q3_K 112 -> 110  hmask[32] qs[64] scales[12] d(f16)        no dmin
  Q6_K 208 -> 210  ql[128] qh[64] scales[16 i8] d(f16)       no dmin
  Q8_K 272 -> 292  d(f32) qs[256 i8] bsums[16 i16]           symmetric

quantizeKBlock's sequential bit-pack was also wrong for the 2/3-bit qs:
the decoders read byte p%32 at shift 2*(p/32) per 128-element half, so
Q2_K/Q3_K now pack that exact inverse; Q6_K uses the 128-group ql/qh
interleave; Q8_K stores int16 group sums. ggufQuantizeLayout updated
(Q8_K 274->292, Q2_K 82->84); both are unexported, no external surface.

Verified: payload-level round-trip via reference decoders that mirror
dequantize_row_q{2,3,6,8}_K verbatim — relative RMSE Q8_K 0.02 / Q6_K
0.08 / Q3_K 0.16 / Q2_K 0.25 (monotone with bit width), Q8_K bsums and
the Q3_K 6-bit scale pack asserted bit-exact, and all four survive
QuantizeModelPack -> ReadInfo. Full gguf suite 149 green; quantize
benches hold at 1 alloc/op.

Co-Authored-By: Virgil <virgil@lethean.io>
Lift probe package coverage 74.2% -> 99.3% by exercising the
previously-untested public surface and the payload-clone branches:

- NewInfluxPoster (0% -> 100%): drive the returned HTTP write closure
  against httptest — body+token-header POST, no-token, 5xx error,
  unreachable endpoint, malformed URL.
- CloneEvent / cloneEventInto (73.5%/51% -> 100%): extend the fixture to
  every payload (Entropy, LayerCoherence, Cache, Memory, Training, Score
  with Values) and drive both the on-emit path and the Recorder.Events
  scratch-backed read path; lifts cloneScoreValues too.
- Bus.EmitProbe non-owned-sink path: SinkFunc-on-bus pre-clone (single +
  mixed-owned fanout), empty-bus no-op, typed-nil owned sink, Add grow.
- LineProtocolSink: empty-label default, file-only sink + Flush, file
  append-failure drop counter.

Production code untouched; bench allocs/op byte-identical. Remaining gap
is the two unreachable nil-sink sentinels in Bus.EmitProbe.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Raise chaptersmoke coverage 72.6% -> 89.8% with real-invocation tests:
- Run sentinel-error paths (errors.Is on lifted sentinels)
- per-chapter faults (empty text/question), Generate-failure path
- storePaths MkdirAll-failure branches (StorePath + StoreDir)
- countingStore Get/Resolve/ResolveBytes pass-through + read tallies
- normalizeStoreKind alias/suffix/short-path branches, cliOptions,
  storeSource, answerPlausible verdicts, default chapter naming
- ExampleRun with a fake Runner (no model load, deterministic output)

Production untouched; CLI/memvid store paths left to integration
(external binary, out of hermetic-unit scope).

Co-Authored-By: Virgil <virgil@lethean.io>
Pass-2 of the go-mlx own-Metal-kernel C++ coverage. Pass-1 (9d0902c)
built the harness and covered lm_head_topk_bridge (97.7%) plus the decode
lm-head tail (11.6%). This pass covers the real custom decode logic — the
single-token ATTENTION family — and the activation bridge:

  decode_bridge.cpp:     11.63% -> 72.73% regions, 33.33% -> 98.67% funcs
  activation_bridge.cpp: absent -> 60.00% regions, 100% funcs

decode attention: one independent oracle (KV-cache write + GQA repeat_kv +
causal/external mask + softmax, assembled from unfused mlx primitives, NOT
the kernel's own compiled graph) asserts the fused output of all six
fixed_single_token_attention dispatch leaves (default-sdpa / row-update /
wide-matmul x has_mask, driven by the global diagnostics + head_dim>=512),
sliding-window (gather+scatter, no causal mask), and paged attention (all
three page paths: single sdpa, uniform compiled map, non-uniform impl,
MQA + GQA). Future cache slots are poisoned with large values so a broken
mask polarity/offset binding makes the output explode and fail (verified
by a mutation test). Plus cheap fill-ins for the remaining lm-head-tail
variants (greedy rank-1/2/3, q4 softcap, q4/q8/q6 suppressed, q4 MLP).

tests_main.cpp: clear MLX's global detail::CompilerCache before exit. The
compiled decode/paged graphs register there (the paged path holds a
per-shape std::map of compiled fns); at process teardown the cache dtor
null-derefs against torn-down MLX globals (EXC_BAD_ACCESS in
__hash_table::__erase_unique). Clearing while MLX is alive makes teardown
deterministic. Exit-order fix only; no kernel touched.

Remaining uncovered is genuinely out of reach: go_mlx_ensure_thread_streams
(Go-runtime thread migration + live streams, serve-only), per-wrapper
catch/mlx_error error paths (one contract-error test demonstrates the
pattern), and the wide-matmul repeat_kv lines (covered for correctness by
the small-D GQA path; the head_dim>=512 + GQA combination is skipped to
avoid large tensors per AX-11).

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
…d/Bad/Ugly, AX)

Further tractable-from-Go coverage, no model loads:
- lora.go 5→0: NormalizeLoRAConfig default-rule table, loraResultError
  both branches, StepAccumulated precondition guards, SetParams/
  ParamCount/TotalParams on synthetic A/B arrays.
- tokenizer.go: utf8EncodeRune across all 4 code-point classes,
  decodeGPT2Bytes mapped + empty + unmapped-fallback (pure-Go).
- metal_kernel.go: AddTemplateInt/AddTemplateBool/SetVerbose via a
  templated kernel computing 6x (int SCALE + bool DOUBLE constants).
- sample.go: SuppressTokensSampler scatter to -inf at listed tokens +
  nil-sampler/empty-token guards.
- decode.go exported gate accessors (NativeAttentionOMatVecEnabled,
  FixedSharedMaskEnabled, FixedRowCacheUpdateEnabled) driven both ways
  via SetRuntimeGate/SetFixedAttentionDiagnostics with baseline restore.
- compiled_mlp.go: SetTracedMLPFusedStages (independent stages) +
  SetTracedMLPForceFused, set/observe/restore.

No production touched.

Co-Authored-By: Virgil <virgil@lethean.io>
Raise grpo statement coverage 81.1% -> 89.7% by driving the
under-covered public entry points through real synthetic inputs
(no model load, AX-11 compliant):

- ExtractGRPOExpectedAnswer: meta-key precedence, CRLF normalise,
  multi-line backward walk, case-fold answer prefixes, all-blank lines
- GRPOSampleFromSFT: reasoning/thinking meta keys, computed reasoning
  suffix-strip, prompt/text fallback, answerless short-circuit
- GRPORewardContainsAnswer: empty-expected neutral reward + unicode
  core.Join/Lower fallback (match + miss)
- RunGRPOReasoningTraining: multi-epoch dataset replay, nil-ctx
  tolerance, missing-resume nil ResumedFrom, non-Resetter multi-epoch
  failure, cancelled ctx, no-trainable-samples
- Save/LoadGRPOCheckpointMetadata round-trip with version backfill

Adds grpo_example_test.go with deterministic Examples for the public
surface. Production untouched; benchmarks unchanged (3/3, 2/2, 0/0
allocs/op preserved).

Co-Authored-By: Virgil <virgil@lethean.io>
Raise go/mlx/memory coverage 75.0% -> 93.5% with no production change.

- IsKnownKVCacheMode (was 0%): Good/Bad/Ugly triad — every contract mode
  (incl. the empty default) is known, garbage is rejected, and TurboQuant
  reads known despite being a research mode backends may fail closed on.
- NewPlan branch coverage via the public API: qwen2 / qwen3_6 / qwen3_6_moe
  (small + wide class) / qwen3_next / bert_rerank architecture hints, the
  encoder batch tiers + unknown-class floor, the generic-MoE resident-expert
  tiers + unknown-class floor, MiniMax sub-64GB context floor, JANGTQ note,
  derived KV-cache savings ratio.
- White-box coverage of the two vestigial 0% wrappers (estimateKVCacheBytes,
  usesGenerationKVCache) and usesGenerationKVCacheWithProfile short-circuits.
- scaleElementsByByteRatioCeil zero-guards + ceiling rounding; minPositive
  b<a branch.
- example_test.go: replace tautological core.Println("NewPlan") stubs with
  real invoking examples (NewPlan/ClassForBytes/IsKnownKVCacheMode) that
  print a stable field and contribute coverage.

Benches flat (production untouched); allocs/op unchanged.

Co-Authored-By: Virgil <virgil@lethean.io>
OutputAt's valid-index return and the three nil paths (negative index,
index past len, empty slice) — pure-Go slice indexing, no Metal op.

Co-Authored-By: Virgil <virgil@lethean.io>
Snider and others added 15 commits June 17, 2026 23:54
…stral-3 part 2)

Ministral-3 bf16 (Base/Reasoning) now loads and decodes end to end on the no-cgo
stack, reusing the gemma4 bf16 structs, session and arch executor — Ministral is a
gemma4-subset arch, so only the weight names differ.

- AssembleMistralBF16 maps the Mistral / Ministral-3 names onto Gemma4BF16: a layer
  has exactly two norms, input_layernorm → AttnNormW (pre-attn) and
  post_attention_layernorm → MLPNormW (pre-MLP — gemma4 names that pre_feedforward),
  with none of gemma4's extras (QK-norm, post-attn/post-FF norm, layer_scalar) left
  nil. The shared executor already skips an absent norm (sharedOrNil →
  encResidualMaybeNorm / qNorm != nil), so the assembled model decodes the faithful
  Mistral layer: rms(input) → attn → +res → rms(post_attn) → SwiGLU → +res. The
  multimodal wrapper is handled by normalizeGemma4Names (text under
  language_model.model.*; vision_tower / multi_modal_projector dropped) and tied
  embeddings alias the LM head.
- LoadMistralBF16 / LoadMistralBF16Dir / GenerateTextFromMistralDir mirror the gemma4
  bf16 loaders: config.json → mistral.Config → Arch, safetensors (single or sharded)
  → assemble → NewGemma4Session → text in/out.
- Gated: a synthetic multimodal-wrapped Ministral checkpoint (tied, two-norm layers,
  stray vision tensors) assembles with the right name mapping, generates, first token
  ≡ the manual chain, and a config.json dir-loads to the same tokens.

SCOPE: dense bf16 (Base/Reasoning). FP8 (Instruct) + the vision tower + YaRN
long-context rope are follow-up slices.

Co-Authored-By: Virgil <virgil@lethean.io>
…mixer kernel (Qwen part 1)

The gated delta network (Yang et al. 2024) is the linear-attention mixer of the
Qwen3.5/3.6 hybrid family (model_type qwen3_5). It's the delta rule with a
per-token, per-head scalar state decay α_t ∈ (0,1] applied before each read/write:

    S_t = α_t·S_{t-1} + β_t·k_t (v_t − (α_t·S_{t-1})ᵀ k_t)ᵀ ;  o_t = q_t·S_t

GatedDeltaRuleChunkSequential adds α to the proven deltanet recurrence as one
broadcast multiply at the top of each step (α from [B,H,1,1] over the [Dk,Dv]
memory). α is optional on kernelInput — the existing DeltaRuleChunk{,Sequential}
pass nil and stay byte-identical (all prior deltanet tests unchanged). α_t = exp(g_t)
for the Mamba-style log-decay the layer derives from A_log + dt; the kernel takes
the resolved α. This is the single-token decode path (decode runs one token/step,
which already takes the sequential form); the gated chunked-parallel prefill form
is a later slice.

Gated: α ≡ 1 recovers the plain delta rule exactly (vs the ungated reference);
a sub-1 α matches an independent gated reference, both prefill and cross-chunk
(a non-zero prior state is decayed on the first step).

First slice of the dense Qwen3.6-27B hybrid Forward. Next: the conv + grouped
K/V + A_log/dt→α + z output-gate mixer around this kernel, then wire the hybrid
decode (linear layers here, full-attention layers on the softmax mixer).

Co-Authored-By: Virgil <virgil@lethean.io>
…eding the kernel (Qwen part B)

The GatedDeltaNet mixer's scalar inputs: the per-token, per-value-head decay α and
write-strength β the deltanet gated kernel consumes, resolved from the linear_attn
projections in the kernel's [B,H,L,1] layout. This is the Mamba-style discretisation
bridging linear_attn.{in_proj_a, in_proj_b, A_log, dt_bias} → the gated delta rule:

    dt = softplus(a + dt_bias) ;  A = −exp(A_log) ;  α = exp(A·dt) ∈ (0,1)
    β  = sigmoid(b)

resolveDecay / resolveBeta + a toHeadTime that reshapes [B,L,H]→[B,H,L,1] (the L↔H
transpose). softplus/negExp mirror mamba2's proven forms. Real 27B geometry: a, b,
A_log, dt_bias are all per value head (48 on the 27B); α/β are one scalar per
(token, value-head).

Gated numerically vs a plain-Go reference — the decay (with and without dt-bias),
β, and the layout (input l·H+h, output h·L+l, which also pins the transpose); α is
asserted inside the (0,1] band.

Part B of the dense Qwen3.6-27B mixer (after the part-A kernel). Remaining mixer
wiring — the conv front (in_proj_qkv → causal conv → SiLU → split), grouped Q/K/V
(16→48 head repeat), the z gated-RMSNorm + out_proj, and the MixerCompute.Forward
with conv/recurrent-state threading — reuses mamba2's proven conv + gated-norm.

Co-Authored-By: Virgil <virgil@lethean.io>
…qwen3 share one impl

The Qwen3.5/3.6 gated-delta mixer needs the same causal depthwise conv (with its
conv-state ring), gated RMSNorm and softplus/−exp discretisation that mamba2 already
had package-private. Lift them into flakernel (the shared internal FLA helpers,
already home to the linear-attention masks/decay/L2-norm) so one proven impl backs
both — and the gated families that follow:

- flakernel.CausalDepthwiseConv / GatedRMSNorm / Softplus / NegExp.
- mamba2's causalConv/gatedNorm/activateDt now delegate (the methods stay as thin
  wrappers, so Forward's call sites are unchanged); negExp → flakernel.NegExp;
  the local softplus/negExp are gone.
- qwen3's gated-delta resolver drops its local softplus/negExp for the shared ones.

Byte-identical: the mamba2 suite (forward/scan/chunk) stays green, and qwen3's
decay/β gate is unchanged. flakernel's helpers are exercised through both consumers.

Unblocks the Qwen gated-delta mixer wiring (conv front + gated-norm) next.

Co-Authored-By: Virgil <virgil@lethean.io>
…ention runs (Qwen part B2)

The full Qwen3.5/3.6 linear-attention layer as an engine MixerCompute, composing
the part-A gated-delta kernel + part-B decay/β resolution with the shared conv and
gated-norm, in the exact Qwen3-Next order:

  in_proj_qkv → causal conv (conv-state ring) → SiLU → split q|k|v
  → repeat q,k key→value heads → l2norm(q) → gated-delta-rule(q·1/√Dk, k, v, β, α)
  → RMSNorm(o)·SiLU(z) → out_proj

Verified against the Qwen3-Next reference forward: q,k are L2-normalised
(use_qk_l2norm_in_kernel — k inside the kernel, q here), q scaled by 1/√Dk, the
gate is applied AFTER the norm (Qwen3NextRMSNormGated = RMSNorm(o)·SiLU(z), which
is exactly flakernel.GatedRMSNorm), β = sigmoid(b), α = exp(−exp(A_log)·softplus(a+
dt_bias)). Grouped Q/K via consecutive repeat_interleave (16→48 on the 27B). Two
recurrent slots thread decode: the causal-conv ring + the [B,H,Dk,Dv] delta memory.

Gated: the key→value repeat order; forward shape + finiteness; and the
recurrent-correctness streaming identity — an L=2 prefill equals two L=1 decode
steps threading both state slots (the conv ring and the delta memory carry across
calls exactly), independent of any numerical reference.

Registers the "gated_deltanet" family. Next (slice C): the qwen weight loader maps
linear_attn.* onto these weights, and the hybrid Forward wires linear layers to this
mixer + the full-attention layers to the softmax mixer — then dense-27B decodes.

Co-Authored-By: Virgil <virgil@lethean.io>
…del can build the hybrid (Qwen part C1)

buildGatedDelta wires the GatedDeltaNet mixer into the engine's mixer registry under
the "linear_attention" layer kind, so the config-composed model (pkg/metal/model/
composed) resolves a Qwen3.5/3.6 linear layer through MixerLoaderFor exactly like it
already resolves a "full_attention" softmax layer — the two halves of the hybrid now
both build from a checkpoint, no central switch.

Mirrors mamba2's loader: bare leaf names (in_proj_qkv, conv1d.weight, A_log,
dt_bias, norm.weight, …) so the composed model owns the per-layer linear_attn.
prefix; geometry derived entirely from the weight shapes (ValueHeads from A_log,
head dim from norm, qDim = (convDim−vDim)/2, KeyHeads = qDim/headDim with the square
delta state's key dim == value dim) — the recurrent-mixer convention, Extra ignored.

Gated: the geometry derivation from real-shaped synthetic weights (16/48-head
ratios scaled down), the loaded mixer runs end to end, and a missing required weight
is refused, not silently mis-built.

With this + the existing ComposedModel orchestration (loops layer_types → mixers →
recurrent/KV cache per layer → embed/norm/lm_head), the only remaining step to a
decoding dense-27B is routing the qwen3_5/qwen3_6 model_type to the composed path +
its config plumbing (layer_types, nested rope_theta, the language_model. prefix).

Co-Authored-By: Virgil <virgil@lethean.io>
…partial-rotary plumbing — the dense 27B decodes (Qwen part C-final)

Wires the dense Qwen3.6-27B (hybrid gated-delta + full attention) to actually
decode: the qwen3_6 model_type now loads through the config-composed model, which
resolves each layer_types entry to its registered mixer (linear_attention → the
gated-delta mixer, full_attention → the softmax mixer), threads the right cache
per layer (recurrent / KV), and runs the pre-norm SwiGLU trunk. The staged stub
that returned nil is superseded.

Two shared-code config gaps closed (carefully — flat-rope dense families unaffected):
- ParseDenseConfig now lifts rope_theta + partial_rotary_factor from a nested
  rope_parameters (Qwen3.5/3.6) or rope_scaling block when the flat field is
  absent; a flat field still wins, and a config with neither is unchanged
  (transformers' 10000 default, full rotary). Handles the text_config nesting.
- GQAAttention applies partial rotary via DenseConfig.RotaryDims() — the leading
  PartialRotaryFactor fraction of each head (Qwen 0.25), full rotary otherwise.

Verified the real Qwen3.6-27B config parses: rope_theta 1e7, partial_rotary 0.25,
RotaryDims 64 (256·0.25), 64 layer_types, model_type qwen3_6. Gated: nested-rope
lift + flat-rope-wins + no-rope-unchanged regression + RotaryDims. The hybrid
orchestration (mixed layer_types, recurrent+KV caches) is covered by the composed
package's heterogeneous test; the gated-delta loader + mixer by their own gates.

mRoPE is a no-op for pure text (its sections all read the text position), so the
text path needs only this partial-rotary + standard rope. The qwen3_6_moe variant
stays staged (composed has no expert FFN yet). End-to-end synthetic-checkpoint /
real-model decode smoke is the follow-up proof.

Co-Authored-By: Virgil <virgil@lethean.io>
…ision in dot-importing tests

Part C-final (8cd5766) added an exported metal.RopeParams for the nested
rope_parameters block, which collides with gemma4's own per-attention-type
RopeParams (pkg/metal/model/gemma4/model.go) inside gemma4 test files that
dot-import metal (decode_kernels_test.go). `go build` was clean (no test files), so
the metal/composed/qwen3 suites I ran stayed green — but `go test ./pkg/metal/model/
gemma4/` failed to build. Renamed the dense-config type to DenseRopeParams; gemma4
tests green again, the config gate unchanged. My miss for not running the gemma4
suite after C-final.

Co-Authored-By: Virgil <virgil@lethean.io>
Compose the entire single-token forward — every decoder layer, final norm, output
projection — into ONE mlx_compile'd closure, re-applied per token, instead of ~N
per-layer closure-applies. The per-layer compiled path already collapses each
layer's graph BUILD but still pays N closure-applies of host scheduling per token
(the ~8.6 ms logits-graph prefetch); the whole-stack trace collapses that to a
single apply, MLX owning the input buffers so per-token state advances cleanly
(unlike raw command replay, which froze MLX's internal per-token buffers).

compiled_stack.go reuses gemma4CompiledLayerStep per layer inside the stack trace;
forwardHiddenOverride takes the compiled-stack path when enabled, else falls back
to the per-layer / uncompiled paths. Scratch (stackPlan/stackOwners/stackIn) is
allocated once and overwritten in place each token. Decode-only, steady state,
off by default behind MLX_COMPILED_STACK. (+ a cpp rebuild-bump to force the cgo
recompile.) gemma4 suite green.

Co-Authored-By: Virgil <virgil@lethean.io>
…diagnostics)

Two off-by-default profiling probes from the host-encode-vs-GPU lever investigation:
- session_pipelined.go MLX_DECODE_REPLAY_PROBE: at a steady-state step, record one
  clean decode forward synchronously, then time its replay vs the ~12 ms forward
  (output stays correct via the normal path; one-shot, ends generation after).
- diffusion eval-split (diffuse.go / diffusion_generate.go / diffusion_step.go):
  split the diffusion forward's eval into host command-encode (EvalAsync) vs
  GPU-wait (Synchronize), surfaced as EncodeDur/GpuDur on DiffusionStepResult and
  printed per step — host-encode = replay/fuse headroom, GPU-bound = no host lever.

Both are env-gated and inert by default (no test impact, unlike the generate.go GC
probe that was reverted). Tracking them rather than leaving them loose; revert if
the lever investigations are spent.

Co-Authored-By: Virgil <virgil@lethean.io>
Whitespace/comment-alignment only (gofmt), no logic change.

Co-Authored-By: Virgil <virgil@lethean.io>
…rope math (YaRN part 1)

YaRNInvFreqs implements the NTK-by-parts RoPE remap Ministral-3 declares
(rope_type "yarn") and Qwen's 1M-context variants use: high-frequency dims
extrapolate (keep base^(-2i/dim), preserving local resolution), low-frequency dims
interpolate (÷ the context-extension factor), with a linear ramp between whose
edges are the beta_fast/beta_slow rotation counts over the original context.

Pure float computation (no model load, no metal) — the resolved dim/2 inverse
frequencies will feed a freqs-aware RoPE in the decode path (part 2). mscale is a
separate concern and 1.0 for Ministral-3, so it's not applied.

Gated against the formula: factor 1 is exactly plain RoPE (no-op identity); at the
real Ministral params the high-freq dims equal plain rope, the low-freq dims equal
plain/factor, every dim stays within [interpolated, extrapolated] and the sequence
is monotonically non-increasing; and the transition dims genuinely blend (the ramp
ramps, not steps).

Next (part 2): carry the freqs on the Arch + thread a freqs-RoPE through the native
decode path; full rotary for Ministral, partial for Qwen's variants.

Co-Authored-By: Virgil <virgil@lethean.io>
…them (YaRN part 2a)

gemma4.Arch gains RopeFreqs []float32 — explicit per-dim inverse frequencies, len
RotaryDim/2, nil meaning "derive uniformly from RopeBase" (the dense default). The
backend-agnostic arch can now declare a non-uniform rotary spectrum.

mistral.Config.Arch() resolves them: when rope_type is "yarn" with an extension
factor (and an original_max_position_embeddings to anchor the ramp), it computes
the YaRN inv-freqs onto the arch; beta_fast/slow default to 32/1 if the config
declares yarn but omits them. Any other rope_type leaves RopeFreqs nil, so the
dense families (flat rope) are unchanged.

Gated: the real Ministral-3 config (yarn, head_dim 128) yields 64 freqs equal to
YaRNInvFreqs; a default-rope config leaves RopeFreqs nil.

Next (part 2b): the native decode path consumes RopeFreqs via a freqs-aware RoPE
(falling back to the base-derived rope when nil).

Co-Authored-By: Virgil <virgil@lethean.io>
…YaRN (part 2b-i)

RoPEFreqsBF16 drives MLX's rope_single_freqs_bfloat16 kernel — the freqs sibling of
the rope_single_bfloat16 the base path uses. Identical buffer ABI except buffer(10)
is a per-dim frequency array (not the log2 base) and buffer(11) its stride; the
kernel reads inv_freq = 1/freqs[d], so the op uploads the reciprocal of the caller's
inverse frequencies (the arch's RopeFreqs). Supports partial rotary (the tail passes
through) the same way RoPEDimsBF16 does.

Gated: handed the plain-rope spectrum (base^(-2d/rotaryDim)) it reproduces the base
rope byte-close — full AND partial rotary — proving the freqs ABI and the 1/period
reciprocal are right; and a non-plain spectrum changes the output, proving the buffer
is consumed (perturbing only the low-freq dims wouldn't show at a small position —
which is exactly why YaRN's interpolation is a long-context effect).

Next (part 2b-ii): thread this through the decode executor — use it when the arch
carries RopeFreqs, else the base-derived rope.

Co-Authored-By: Virgil <virgil@lethean.io>
…nistral long-context runs (YaRN part 2b-ii)

Completes the YaRN arc: the decode executor now applies the arch's RopeFreqs when
present. encRopeDecode dispatches per token — the explicit-frequency rope
(encRoPEFreqsBF16/To) when the session carries a resident periods buffer, else the
base-derived rope. encAttnHalfKV / encAttnHalfShared gain a ropeFreqs param (nil at
every call site but the bf16 stepToken; the quant / decode-forward / decode-step
paths pass nil, unchanged); archDecodeState carries the buffer; NewGemma4Session
uploads it once from arch.RopeFreqs via uploadRopePeriods (1/inv_freq, the kernel's
period convention).

Gated end to end: a Ministral session carrying the PLAIN spectrum decodes
identically to one with no RopeFreqs (the base rope) — proving the periods buffer
threads through stepToken correctly — and a YaRN spectrum decodes valid tokens (it
matches base at tiny positions, as it should: YaRN's interpolation is a long-context
effect). Full native suite green (the 6-call-site signature ripple is inert).

YaRN end to end: freq math (f274ca2) → arch carries it (1ad3c53) → freqs-rope op
(e8b0707) → executor applies it (here). Ministral-3's declared rope_type "yarn" now
runs; Qwen's 1M-context variants inherit the same Arch.RopeFreqs path.

Co-Authored-By: Virgil <virgil@lethean.io>


def load_phase0(path: Path) -> list[dict[str, str]]:
entries = json.loads(path.read_text(encoding="utf-8"))
distractors: list[dict[str, str]],
turn_sections: list[str],
) -> dict[str, Path]:
out_dir.mkdir(parents=True, exist_ok=True)
Comment on lines +317 to +324
result = subprocess.run(
command,
check=False,
cwd=args.run_dir,
stdout=stdout,
stderr=stderr,
env=env,
)
Comment on lines +317 to +324
result = subprocess.run(
command,
check=False,
cwd=args.run_dir,
stdout=stdout,
stderr=stderr,
env=env,
)


def append_manifest(manifest_path: Path, row: dict) -> None:
manifest_path.parent.mkdir(parents=True, exist_ok=True)

def append_manifest(manifest_path: Path, row: dict) -> None:
manifest_path.parent.mkdir(parents=True, exist_ok=True)
with manifest_path.open("a", encoding="utf-8") as handle:
raise ValueError("--count must be >= 1")
if args.count > 1 and args.seed_id:
raise ValueError("--seed-id can only be used with --count 1")
args.run_dir.mkdir(parents=True, exist_ok=True)
if args.count > 1 and args.seed_id:
raise ValueError("--seed-id can only be used with --count 1")
args.run_dir.mkdir(parents=True, exist_ok=True)
args.book_dir.mkdir(parents=True, exist_ok=True)
Snider and others added 13 commits June 18, 2026 10:09
…ate over any backend

Backend.DecodeForward runs the transformer stack (hidden → hidden) but the
contract stopped there, so each backend re-hand-rolled the token loop in its own
session code (native's gemma4_session closures, metal's engine). This adds the
rung that closes the loop on the contract itself:

  - Embedder.Embed(id)      token id → input embedding (dModel bf16 bytes)
  - LMHead.Head(hidden)     final hidden → vocab logits (vocab bf16 bytes)
  - TokenModel = Embedder + Backend + LMHead + Vocab()
  - Generate / GenerateSampled  the pure-Go token-in → token-out loop, shared
                                by every backend (greedy or temp/top-k/top-p)

Now native/metal/rocm each supply the three byte-level pieces and inherit
generation + sampling — the surface pkg/rocm drops into yields real tokens, not
a hidden-state stub. Whole-sequence today (DecodeForward rebuilds the KV per
call → O(n²)); the incremental persistent-cache decode on the contract is the
perf refinement. Gated pure-Go: a deterministic counter TokenModel proves the
greedy pick, append, re-embed, eos stop and length; zero-temp sampled ≡ greedy.

Co-Authored-By: Virgil <virgil@lethean.io>
…en-loop contract

Binds the decode backend (NativeBackend / model.Backend) + the bf16 embed/head
bookends (EmbedTokensBF16 / LMHeadBF16) behind model.TokenModel, so
model.Generate drives the whole no-cgo token loop (embed → decode → head →
sample) with zero per-backend loop code — the native side of "the surface
pkg/rocm drops into yields real tokens".

Gated full-sequence (metal_runtime): model.Generate(NativeTokenModel) produces
the EXACT greedy tokens GenerateGemma4BF16 produces (native's bespoke incremental
persistent-cache loop) on the same synthetic bf16 gemma4 — two loops sharing no
code, so equality proves the contract path is real, not a stub. Zero-temp
sampled ≡ greedy. Whole native suite green (98).

bf16 + PLE-free today (12B/31B dense, 26B-A4B MoE, Ministral); the quant sibling
(EmbedTokensQuant / LMHeadQuant) and the E2B/E4B per-layer-input tower are the
follow-ups that layer on the same seam.

Co-Authored-By: Virgil <virgil@lethean.io>
…oken-loop contract

Refactors NativeTokenModel to carry embed/head as closures (mirroring
Gemma4Session/NewGemma4QuantSession), so bf16 and 4-bit share one type, and adds
NewQuantTokenModel: the quant decode backend + the 4-bit bookends
(EmbedTokensQuant / LMHeadQuant) behind model.TokenModel. PLE models (E2B/E4B)
are rejected until NativeBackend carries the per-layer-input tower.

Gated full-sequence (metal_runtime): model.Generate(NewQuantTokenModel) produces
the EXACT greedy tokens NewGemma4QuantSession produces (native's incremental
quant loop) on the same synthetic 4-bit gemma4 — the contract now covers the
serving quant, not just bf16. bf16 parity gate still green; whole native suite
green (99).

So the unified surface pkg/rocm drops into is proven for both representations:
implement the four TokenModel methods (bf16 or quant), inherit Generate.

Co-Authored-By: Virgil <virgil@lethean.io>
…(1)/token, additive

The contract's Backend.DecodeForward is whole-sequence (rebuilds the KV cache per
call → O(n²)). This adds the persistent-cache decode as an OPTIONAL capability,
so Generate runs O(1)/token when a backend offers it — without changing the
Backend/TokenModel surface a backend must implement (darb's pkg/rocm port is
undisturbed; it can add the session later).

  - model.DecodeStepper { Step(emb) ([]byte, error) }   one token over a persistent cache
  - model.SessionModel = TokenModel + OpenSession() (DecodeStepper, error)
  - Generate/GenerateSampled dispatch: SessionModel → incremental stepwise loop;
    else the whole-sequence fallback (unchanged)

Native: Gemma4Session gains Step (wraps stepToken in an autorelease pool — the
returned hidden is a fresh Go copy, safe across the pool; guards PLE + maxLen),
so it IS a DecodeStepper; NativeTokenModel.OpenSession returns a fresh
Gemma4Session / Gemma4QuantSession, making it a SessionModel. So model.Generate
over a NativeTokenModel now decodes incrementally (was whole-seq).

Gated:
  - pure-Go: a session-offering counter model whose DecodeForward ERRORS proves
    Generate takes the incremental path (OpenSession called once) and is
    output-identical to the whole-seq fallback.
  - native (metal_runtime): the incremental contract result ≡ GenerateGemma4BF16
    ≡ the whole-seq fallback, token-for-token (bf16); quant ≡ NewGemma4QuantSession.
    The refinement changes speed, not tokens. Native suite green (99).

Co-Authored-By: Virgil <virgil@lethean.io>
… 2nd reference

Gemma4Backend adapts the cgo (mlx-c) gemma4 model to pkg/model's contract
(Backend + TokenModel), alongside pkg/native's no-cgo NativeTokenModel — so the
contract is proven backend-agnostic across TWO real backends, not asserted. The
byte seam is what makes it possible.

  - Embed   = EmbedTokens row × EmbeddingScale
  - DecodeForward = the proven stack, with precomputed embeddings injected via the
    existing embedHook (forwardHiddenOverride) over a fresh cache — NO change to
    the fused Forward
  - Head    = the same final-norm + projection + soft-cap tail
  - reference adapter, NOT metal's production path (metal serves via its fused
    engine); this lets model.Generate drive the cgo backend the way it drives native

Dtype finding (the seam is intra-backend): Embed→DecodeForward bytes never cross
backends, so the adapter transports the model's OWN activation dtype (probed once;
f32 for the synthetic, bf16 for real gemma4) with no lossy conversion — only the
LM-head logits are cast to bf16, the one dtype the shared model.Greedy/Sample read.

Gated (metal_runtime): the adapter's DecodeForward hidden ≡ forwardHidden on the
same tokens BYTE-FOR-BYTE (same embeddings, same stack), and model.Generate over
the adapter decodes deterministic in-range tokens. Vet clean; forward harness green.

Deferred with reasons (not dodged): SessionModel (incremental) needs a
session-lifecycle hook on the contract — metal caches are freed manually, unlike
native's retained buffers; PLE models rejected (no token ids for per-layer inputs,
same as native).

Co-Authored-By: Virgil <virgil@lethean.io>
…Model (incremental)

Finding #2 from the metal reference adapter: an incremental session on a
manual-memory backend (metal/rocm KV caches) needs a release point, but
DecodeStepper had none. This adds the OPTIONAL hook and completes metal's
SessionModel on it.

  - contract: a DecodeStepper MAY implement `Close() error`; generateStepwise
    closes the stepper it opens when present (asserted, not required — so native's
    GC-managed retained-buffer session is unchanged). The hook is what any
    manual-memory backend needs, including darb's pkg/rocm (HIP caches).
  - metal: Gemma4Backend.OpenSession returns a gemma4Stepper — one token at a time
    over a PERSISTENT cache (the same embedHook injection, T=1, cache carried
    across Step), Close frees the caches. So the cgo backend is now a full
    SessionModel, like native.

Gated:
  - pure-Go: the session stepper records Close; Generate must call it exactly once.
  - metal (metal_runtime): model.Generate over the cgo backend (now the incremental
    path) ≡ the whole-sequence fallback token-for-token, and the stepper implements
    Close. pkg/model 6 green; native gates green (Close optional → unaffected).

The contract is now proven across two backends in BOTH decode strategies
(whole-seq + incremental) — native (no-cgo) and metal (cgo).

Co-Authored-By: Virgil <virgil@lethean.io>
… contract

PLE needs the token id (the per-layer input is gathered from
embed_tokens_per_layer[id], not derivable from the token embedding), but the
contract's Step has only the embedding. Rather than bundle side-data into Embed
(mangling its semantics), this adds an OPTIONAL id-aware step — parallel to the
Close hook:

  - contract: a DecodeStepper MAY implement StepWithID(id, emb); generateStepwise
    calls it in preference to Step, passing both. For every non-PLE model it's just
    Step with the id ignored, so Embed stays "the embedding" and nothing else moves.
  - native: Gemma4Session.StepWithID computes the per-layer-input tensor from
    (id, emb) via its perLayerInput closure and threads it into stepToken, exactly
    as Generate does. NewQuantTokenModel now ACCEPTS a PLE model (the incremental
    session handles it); the whole-sequence NativeBackend.DecodeForward guards
    against PLE (no token ids) so model.Generate (which prefers the session) is the
    path for E2B/E4B.

Gated (metal_runtime): model.Generate over a quant NativeTokenModel ≡
NewGemma4QuantSession (native's PLE loop) token-for-token on a synthetic E2B; the
whole-seq fallback refuses a PLE model. Native suite green (100); pkg/model green.

So the WHOLE gemma4 family now decodes through the contract on native — dense,
mixed-precision MoE, and PLE (E2B/E4B). metal stepper unaffected (no StepWithID →
plain Step).

Co-Authored-By: Virgil <virgil@lethean.io>
…ive`

The engine rung: point the reactive served path at model.Backend so the whole
no-cgo stack answers real requests. The serve handlers drive inference.TextModel;
this adds a contract-backed one (sibling of the cgo metaladapter):

  - native.LoadGemma4TokenModelDir(dir, maxLen) → model.TokenModel — the contract
    sibling of LoadGemma4Dir (dense / MoE / E2B-E4B PLE, 4-bit or bf16).
  - mlx.nativeTextModel: wraps a model.TokenModel + tokenizer as an
    inference.TextModel. Generate/Chat run model.Generate over the contract
    (incrementally — NativeTokenModel is a SessionModel); Chat renders the gemma
    turn template; logits→bf16 for the shared sampler. ZERO cgo.
    mlx.LoadNativeTextModel(dir, opts) loads it.
  - serve: a `--native` flag swaps the hot-swap resolver's loader to
    LoadNativeTextModel (resolver loader made pluggable, default unchanged),
    forces no-MTP, and skips the metal-only conversation continuity. The OpenAI/
    Anthropic/Ollama muxes are unchanged — they just drive the contract model.

So `lthn-mlx serve --native --model <gemma4 dir>` serves the no-cgo contract stack
through the standard HTTP API — the unified-driver-surface thesis, usable.

v1 limits (noted, not hidden): generates the whole completion then yields
(per-token streaming = a model.GenerateStream follow-up); no prompt cache / MTP /
batching / continuity (pkg/metal engine features); Close is a no-op (weights live
for the process). The cgo metal path is untouched (default loader, all serve tests
green).

Gated: native.LoadGemma4TokenModelDir ≡ in-memory NewQuantTokenModel (metal_runtime);
cmd/mlx serve/resolver suite green (20); binary builds (both backends linked).
Real serve smoke is Snider's call.

Co-Authored-By: Virgil <virgil@lethean.io>
…ort phase 1)

The first real no-cgo gemma4 load exposed a family-wide gap the synthetic gates
structurally hid: gemma4 uses head_dim 256 on sliding layers and global_head_dim
512 on full_attention layers (E2B/E4B/12B/31B/26B). The declaration carried one
HeadDim, so the assembler rejected full-attention layers.

Ported from metal's proven config rule (load.go:249 — full uses global_head_dim):
the backend-agnostic declaration now resolves attention geometry per layer.

  - Config: + GlobalHeadDim (global_head_dim), NumGlobalKeyValueHeads
    (num_global_key_value_heads — full layers may differ in KV heads too).
  - LayerSpec: + HeadDim / KVHeads — each layer's RESOLVED geometry (full →
    global, sliding → default), filled by Config.Arch per attention type.
  - Arch: + GlobalHeadDim / GlobalKVHeads + MaxHeadDim()/MaxKVHeads() (a backend
    sizes per-head buffers to the larger head). rotaryDim now derives from
    GlobalHeadDim for full_attention (512·0.25=128), HeadDim for sliding;
    proportional-base normalises over the correct per-type head dim.

Reactive + backward-compatible: global_head_dim absent ⇒ the global values mirror
the sliding/default, so uniform/synthetic packs are unaffected (existing gates
green, dense/MoE tests updated for the new fields). DECLARATION ONLY — no backend
consumes the per-layer dims yet; Phase 2 rewires pkg/native's assembler + decode
geometry to read LayerSpec.HeadDim, which unblocks real serving.

Gated: a real-gemma4-shaped config (sliding 256 / full 512, partial-rotary 0.25)
→ per-layer HeadDim + rotaryDim 128/256; native gates + metal DeriveLayers parity
unchanged.

Co-Authored-By: Virgil <virgil@lethean.io>
…clared attention scale (port phase 2)

Phase 2 of the gemma4 port: the no-cgo decode now reads the per-attention-type
geometry the declaration carries (Phase 1), and the attention scale the model
DECLARES rather than assuming 1/√headDim — both taken from metal's proven arch.

  - assemblers (bf16 + quant): q/k/v/o byte spans + q/k-norm sizes are PER LAYER
    (full_attention layers use global_head_dim; full layers may differ in KV heads).
  - decode geometry (buildBF16/QuantArchLayerBufs, archDecodeState, stepToken):
    per-layer qDim/kvDim/cache-row/projector; shared scratch sized to the max head;
    headDimOf/kvHeadsOf helpers (fall back to the uniform arch value for a hand-built
    Arch, so existing uniform paths are byte-identical).
  - attention scale: a new gemma4.Arch.AttnScale the engine APPLIES, never assumes.
    metal's gemma4AttentionScale is 1.0 (the per-head QK-norm IS the scaling;
    query_pre_attn_scalar is None) — native was applying 1/√headDim ON TOP of the
    QK-norm, double-scaling. gemma4.Config.Arch → 1.0; mistral.Config.Arch →
    1/√headDim (standard, no QK-norm). Every native scale derivation now routes
    through attnScaleOf(arch). 3 session-test manual references re-grounded to
    arch.AttnScale (they were validating against the wrong scale by argmax-luck).

Validated on a real load: e2b-4bit `serve --native` now clears EVERY attention
layer (past the full-attention 512-head layers — per-type head_dim works) before
hitting the next gemma4 feature. Native suite green (101); pkg/model green.

Real gemma4 needs more features still (each a port from metal, exposed one per
real load — "issues are N+rounds deep"): attention_k_eq_v (K==V shared projection
+ no-scale V norm — dense 12B/31B), double-wide MLP (matformer e2b/e4b consumer
layers). Those are the next units; no real model loads on per-type head_dim alone.

Co-Authored-By: Virgil <virgil@lethean.io>
…chip 1)

The garbage on real gemma4 was two decoders disagreeing — metal's proven one and
native's home-grown one that drifted. The fix is to carry the decode SEQUENCE once
(here, in pkg/model/gemma4) and have a backend supply only the compute.

This adds that seam: model/gemma4.Decoder — a record-style compute interface over
backend-opaque Buffer handles. Each primitive (RMSNorm, Proj/QuantProj, RoPE, SDPA,
SwiGLU, Add, Mul) ENQUEUES an op into the backend's current batch; the backend owns
residency and Commit. That preserves native's whole-token-in-one-command-buffer
speed (no bytes cross the seam mid-decode, no dispatch-per-op) while letting the
gemma4 sequence live in ONE place — so native + metal can't drift again. Bytes only
cross at the edges (Upload/Read).

Derived from the exact ops the text decode uses (per-type head_dim, K==V, no-scale
value norm, the SwiGLU MLP, residuals, layer scalar), not invented. Pure-Go,
all-platforms (interface only). Next chips: native implements Decoder over its proven
enc* ops; the salvaged decode orchestration lands over it; native drops its drifted
home-grown decode. metal/model stays untouched (it deprecates as native, host-
unconstrained, runs faster).

Co-Authored-By: Virgil <virgil@lethean.io>
…oven enc* ops (port chip 2)

nativeDecoder implements model/gemma4.Decoder: each primitive records one op into a
Metal command encoder (lazily opened), Commit/Read flush the batch once — so the
shared gemma4 decode (chip 3) drives native with its whole-token-in-one-command-
buffer speed, no dispatch-per-op, no bytes crossing the seam mid-decode. The opaque
g4.Buffer is a resident Metal buffer + its byte length (Read knows the size).

Each primitive maps to native's existing, proven op:
  RMSNorm→encRMSNormRowsBF16, Proj→encGemvBF16To, QuantProj→encQMVBF16,
  RoPE→encRopeDecode, SDPA→encSDPAStrided, SwiGLU→encGeluGateMul,
  Add/Mul→encAddBF16/encMulBF16. Element offsets convert to byte at the boundary;
  record errors surface at Commit/Read (record-style).

Gated (metal_runtime): a Decoder-driven RMSNorm→Proj→Add recorded in one batch is
byte-for-byte the value-level ops (RMSNormBF16→MatVecBF16→add) — the seam is
faithful. Native suite green (102).

Also dropped the superseded home-grown-decode WIP (the K==V/V-norm patches to
stepToken/encAttnHalfKV) — that decode is what chip 4 replaces with the shared
orchestration; its logic is captured for chip 3, not lost. Per-type head_dim
(Phase 2) stays.

Co-Authored-By: Virgil <virgil@lethean.io>
… K==V

Native's arch decode had drifted from the proven pkg/metal/model/gemma4 decode in
two places, producing garbage on the K==V models (12B/31B):

  - V value normalisation was missing entirely. gemma4 applies a no-scale per-head
    RMSNorm to V in every attention layer (metal's RMSNormNoScale); native projected
    V straight into the cache. Now applied via a ones-weight through the proven rows
    kernel, carried as Arch.ValueNorm (gemma4 true, Mistral false — the arch executor
    is shared, so the flag is per-model, nil-buffer = off).

  - K==V (attention_k_eq_v: 12B/31B) was unhandled. These checkpoints carry no
    v_proj — V is the k-proj output (pre-knorm/rope), value-normed. The decode now
    routes V through wK when the projector has no v_proj (proj.hasV()==false); the
    assemblers skip the absent v_proj.

Applied across the whole decode family (re-encode + ICB, bf16 + 4-bit) so the paths
can't drift; byte-identical when ValueNorm is off, so non-gemma4 (Mistral) and the
generic step helpers are unchanged. Gated synthetically (AX-11, no model load):
value-norm is byte-exact vs a parity-proven oracle and genuinely live, and K==V (no
v_proj) is byte-for-byte an explicit v_proj=k_proj forward and the oracle.

Also drops the abstract Decoder seam (the earlier compute-interface direction): the
proven metal decode is ported in place over native's own enc* ops, not re-expressed
through an abstraction.

Co-Authored-By: Virgil <virgil@lethean.io>
@sonarqubecloud

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
3 Security Hotspots
6.1% Duplication on New Code (required ≤ 3%)
E Security Rating on New Code (required ≥ A)
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants