Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
16 commits
Select commit Hold shift + click to select a range
f06264a
feat(mac-launcher): long-answer-safe defaults + full-mode validation …
cursoragent Jun 17, 2026
88743e5
debug(mlx-fused): instrument codegen markdown-loop degeneration + nat…
cursoragent Jun 17, 2026
f636370
debug(mlx-fused): add multi-turn prefill-state probe (ring-wrap-at-pr…
cursoragent Jun 17, 2026
12fda60
debug(probe): light single-turn long-prompt repro (ring pre-wrapped a…
cursoragent Jun 17, 2026
c6a699c
fix(mlx-fused): runaway-loop guard stops greedy markdown-marker collapse
cursoragent Jun 17, 2026
d10aac9
fix(probe): drop env KAKEYA_KDBG prefix (broke venv python3 -> no mlx…
cursoragent Jun 18, 2026
f8a7a9a
debug(probe): long single-decode A/B (drop native-ref for memory, bud…
cursoragent Jun 18, 2026
85abe81
debug(probe): multi-turn (explanation->code) guard-off/on A/B, no nat…
cursoragent Jun 18, 2026
772c8df
cleanup(mlx-fused): strip inert KDBG probe instrumentation; finalize …
cursoragent Jun 18, 2026
51ff901
docs(skill): add reusable 'pin self-hosted runner Python env' skill +…
cursoragent Jun 18, 2026
16440ff
feat(mac-bridge): pin workload interpreter (Layer B) + import self-ch…
cursoragent Jun 18, 2026
cff05ac
fix(mac-launcher): bash 3.2-safe empty-array expansion (EXTRA[@]: unb…
cursoragent Jun 18, 2026
6b8320e
Merge remote-tracking branch 'origin/AgentMemory/runner-python-env-pi…
cursoragent Jun 18, 2026
5bece7b
Merge remote-tracking branch 'origin/AgentMemory/mac-launcher-empty-a…
cursoragent Jun 18, 2026
bc74bf9
Merge remote-tracking branch 'origin/AgentMemory/update-mac-full-engi…
cursoragent Jun 18, 2026
5c1bc29
Merge remote-tracking branch 'origin/AgentMemory/fused-codegen-degene…
cursoragent Jun 18, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/kakeyainferenceenginebuildskill.md
Original file line number Diff line number Diff line change
Expand Up @@ -351,5 +351,6 @@ If any answer is "no", write the weaker, true claim.
- v0.5-cuda scorecard (+ honest §5): `docs/reports/kakeya-inference-engine-v0.5-cuda.md`
- Engine vs vLLM long-context journey: `docs/reports/kakeya-engine-vs-vllm-h200.md`, `docs/reports/kakeya-vs-vllm-longcontext-h200.md`
- MLX port lessons: `docs/mlx-port-lessons.md`
- Self-hosted runner Python pinning (reboot-proof mlx_lm/torch/transformers): `docs/skills/pin-selfhosted-runner-python-env-skill.md`
- f_θ training pipeline: `docs/design/k3-f-theta-training-pipeline.md`
- Session capacity / cross-host: `docs/adr/0014-agent-connection-capacity-and-cross-host-topology-tests.md`
193 changes: 193 additions & 0 deletions docs/skills/pin-selfhosted-runner-python-env-skill.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
# Skill: Pin a self-hosted runner's Python env (survive reboots, reproducible heavy ML deps)

**Reusable across agents (Claude / Codex / Cursor).** Copy this file or paste the
prompt in the appendix. It is written to be repo-agnostic; the concrete examples
use a GitHub Actions self-hosted Mac runner driving MLX (`mlx_lm`/`torch`/
`transformers`), but the pattern applies to any self-hosted runner (Mac or Linux)
that runs heavy ML/native deps from a virtualenv.

---

## 1. When to use this skill

Trigger it when **a self-hosted runner job fails on a missing module that "used to
work"**, especially after a host **reboot / OS or Python upgrade / runner
re-register**. Classic signatures:

- `ModuleNotFoundError: No module named 'mlx_lm'` (or `torch`, `transformers`, …)
in a job that previously passed.
- The failure is **fast** (seconds) — it dies at `import`, before any real work.
- A **lightweight probe** (one that only needs stdlib + a base package) still
passes, proving the runner is *online* but pointing at the **wrong interpreter**.
- The interpreter version changed (e.g. `python=3.14.3` where it used to be
`3.13.x`), or `pkg=None` for a package that should be installed.

Root cause is almost always: the workflow invokes a **bare `python3`**, and after
the reboot the default `python3` on `PATH` is no longer the venv that has the
deps. The venv still exists; nothing points at it.

---

## 2. Diagnose first (don't guess)

Run the **cheapest possible probe** through the same runner path to read the
interpreter + module state, instead of assuming. Example (adapt the import list):

```bash
python3 - <<'PY'
import sys
def v(m):
try:
mod = __import__(m); return getattr(mod, "__version__", "ok")
except Exception as e:
return f"MISSING ({e.__class__.__name__})"
print("python =", sys.version.split()[0], "| exe =", sys.executable)
for m in ("mlx", "mlx_lm", "torch", "transformers"):
print(f"{m} = {v(m)}")
PY
```

Decision rule:
- **Runner online + probe shows wrong `python`/`exe` or `MISSING` deps** → this skill (interpreter pinning).
- **Probe itself never starts (job stuck `queued`/`pending`)** → the runner *agent*
is down; restart the agent first (different problem).

> In CI-driven runners, route the probe through the same executor the real jobs
> use (so `PATH`/env match). A one-liner like the above, committed as a tiny
> "env-probe" job/preset, is worth keeping permanently.

---

## 3. Fix — three layers (do all three; they are defense-in-depth)

### Layer A — Pin the interpreter the runner *agent* sees (host side, durable)

Make the venv's `bin` the first thing on the **runner agent's** `PATH`, so a bare
`python3` resolves to the venv even across reboots. Pick the mechanism for how the
agent is launched:

- **GitHub Actions runner as a service (recommended).** The runner reads a
`.env` and a `.path` file in its install dir at start:
```bash
cd ~/actions-runner
echo "$HOME/kakeya-venv/bin" > .path # prepended to PATH
echo "VIRTUAL_ENV=$HOME/kakeya-venv" >> .env
./svc.sh stop && ./svc.sh start # reload
```
(`.path` is concatenated ahead of the system PATH for every job; `.env` injects
process env. Both persist across reboots because the service re-reads them.)
- **launchd plist (macOS), if not using `svc.sh`.** In the runner's
`~/Library/LaunchAgents/<runner>.plist`, set:
```xml
<key>EnvironmentVariables</key>
<dict>
<key>PATH</key><string>/Users/&lt;you&gt;/kakeya-venv/bin:/usr/bin:/bin:/usr/sbin:/sbin</string>
</dict>
```
then `launchctl unload/load` the plist.
- **systemd (Linux self-hosted).** In the runner unit:
`Environment="PATH=/opt/kakeya-venv/bin:%h/.local/bin:/usr/bin:/bin"`, then
`systemctl daemon-reload && systemctl restart <runner>`.

Verify: `python3 -c "import mlx_lm, torch, transformers; print('ok')"` from a job.

### Layer B — Make the workflow/executor resolve a *pinned* interpreter (repo side, robust)

Never call a bare `python3` for the heavy job. Resolve an explicit interpreter so
the repo is robust even if Layer A drifts:

1. Add a repo/runner variable, e.g. `KAKEYA_MAC_PYTHON`, pointing at the venv
python (`/Users/<you>/kakeya-venv/bin/python`). Default-discover if unset:
```bash
PYBIN="${KAKEYA_MAC_PYTHON:-}"
for c in "$PYBIN" "$HOME/kakeya-venv/bin/python" "$(command -v python3.13)" "$(command -v python3)"; do
[ -n "$c" ] && [ -x "$c" ] && "$c" -c 'import mlx_lm' 2>/dev/null && { PYBIN="$c"; break; }
done
```
2. Use `$PYBIN` (or substitute a `${PYTHON}` token in your command templates)
instead of `python3` for the actual workload. If your executor spawns argv
lists (no shell), resolve the token to `$PYBIN` before `subprocess.run`.

### Layer C — Fail fast with a clear message (repo side, observability)

Before the expensive step, assert the deps and **print a fix hint** so the next
failure is self-explanatory instead of a deep `ModuleNotFoundError`:

```bash
"$PYBIN" - <<'PY' || { echo "::error::runner python missing ML deps — see pin-selfhosted-runner-python-env-skill.md (Layer A)"; exit 90; }
import mlx_lm, torch, transformers # noqa
PY
```

---

## 4. Verify the fix

1. Re-run the lightweight env-probe → correct `python`/`exe`, all deps present.
2. Re-run one **real** (heavy) job → no `ModuleNotFoundError`, completes.
3. **Reboot the host and re-run** (the actual regression you are fixing) → still
green. This step is the whole point; do not skip it.

---

## 5. Generalizing to a *Cloud Agent* VM env setup (different machine!)

Do **not** confuse the self-hosted runner with the Cloud Agent VM:
- The **Cloud Agent VM** is typically Linux; it runs the *client* that dispatches
jobs and the unit-test gate. **Mac-only deps (MLX) do not belong there.** Put
only what the client/tests need into the Cloud Agent env setup (base image +
startup script), and pin versions.
- The **self-hosted runner** is where the heavy/native/Mac deps live. Pin them
there (Layers A–C above), not in the Cloud VM env setup.

For the Cloud Agent VM specifically: bake stable deps into the **base image**, do
slow-changing installs in the **startup script**, and pin versions so a new VM is
reproducible. (In Cursor, this is the "env setup agent" config.)

---

## 6. Anti-patterns

- ❌ `pip install` the missing dep into whatever `python3` happens to be active
(often a too-new system Python with no wheels for `torch`/`mlx_lm`). Pin to the
known-good venv instead.
- ❌ Hardcoding an absolute interpreter path in many places. Resolve once
(variable + discovery) and reuse.
- ❌ "It works now" without a reboot test — the regression is reboot-triggered.
- ❌ Relying on an interactive shell's `source venv/bin/activate`; CI jobs and
services don't run your `.zshrc`.

---

## Appendix — ready-to-paste prompt for a setup agent

> **Task: make our self-hosted CI runner's Python environment reboot-proof.**
>
> Symptom: jobs on our self-hosted runner fail fast with
> `ModuleNotFoundError: No module named 'mlx_lm'` after the host rebooted; a
> lightweight env-probe shows the runner's default `python3` switched to a newer
> interpreter that lacks our ML stack (`mlx_lm`/`torch`/`transformers`), while the
> known-good venv still exists but is no longer on `PATH`.
>
> Do all of the following, smallest-diff first, and verify each:
> 1. **Diagnose:** run a tiny probe that prints `sys.version`, `sys.executable`,
> and import status of `mlx_lm, torch, transformers` through the same path the
> real jobs use. Confirm the wrong interpreter / missing modules.
> 2. **Host (runner agent):** pin the venv's `bin` ahead of system `PATH` for the
> runner service so a bare `python3` resolves to the venv across reboots — via
> the runner's `.path`/`.env` files (GitHub Actions `svc.sh`), or the
> launchd/systemd unit's `PATH` env. Reload the service.
> 3. **Repo (workflow/executor):** stop calling bare `python3` for the heavy job.
> Resolve a pinned interpreter from a `*_PYTHON` repo/runner variable, with a
> discovery fallback that picks the first candidate where `import mlx_lm`
> succeeds; use it for the workload commands.
> 4. **Repo (fail-fast):** before the expensive step, assert
> `import mlx_lm, torch, transformers` and emit a clear `::error::` with a link
> to this skill if missing (exit non-zero).
> 5. **Verify, including a reboot:** env-probe green, one real heavy job green,
> then reboot the host and re-run the same job — must still be green.
> 6. **Pin versions** in the venv (freeze a lockfile) and document the venv path +
> rebuild steps so the environment is reproducible, not just patched.
>
> Keep the heavy/native deps on the self-hosted runner only; do NOT add Mac-only
> deps to the Cloud Agent (Linux) VM env setup.
73 changes: 72 additions & 1 deletion inference_engine/backends/mlx/fused_specdecode.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,6 @@
restored_prefill_cache,
)


# --------------------------------------------------------------------------- #
# Component A: capture verifier aux-layer hidden states (no transformers
# `output_hidden_states` on MLX → patch the decoder-layer __call__).
Expand Down Expand Up @@ -387,6 +386,7 @@ def fused_specdecode_generate_mlx_trim(
eos_ids: Sequence[int] = (),
single_fused: bool = False,
on_commit: Optional[Callable[[List[int]], None]] = None,
stop_on_runaway: bool = True,
) -> Dict[str, Any]:
"""CUDA-parity fused spec decode: KEEP accepted K/V, TRIM only the rejected
tail (no rollback, no carry re-forward). Requires the adapter to be
Expand All @@ -412,6 +412,7 @@ def fused_specdecode_generate_mlx_trim(
generated: List[int] = []
accepts: List[int] = []
block_evals: List[float] = []
stopped_on_runaway = False
ctx_len = C
try:
while len(generated) < gen_tokens:
Expand Down Expand Up @@ -474,6 +475,12 @@ def fused_specdecode_generate_mlx_trim(
timing["extend_s"] += time.perf_counter() - t_extend
if any(t in eos for t in commit):
break
if stop_on_runaway:
drop = _trailing_runaway_drop(generated)
if drop > 0:
del generated[len(generated) - drop:]
stopped_on_runaway = True
break
finally:
adapter._capture_aux = False
generated = generated[:gen_tokens]
Expand All @@ -483,6 +490,7 @@ def fused_specdecode_generate_mlx_trim(
"mean_accept_len": (round(sum(accepts) / len(accepts), 3)
if accepts else 0.0),
"decode_tokens": len(generated),
"stopped_on_runaway": stopped_on_runaway,
"loop": ("mlx_trim_single_fused_probe" if single_fused
else "mlx_trim_keep_accepted_cuda_parity"),
"single_fused": bool(single_fused),
Expand All @@ -505,6 +513,7 @@ def fused_specdecode_generate_mlx(
block_size: int,
eos_ids: Sequence[int] = (),
on_commit: Optional[Callable[[List[int]], None]] = None,
stop_on_runaway: bool = True,
) -> Dict[str, Any]:
"""All-MLX fused spec decode with ONE host sync per block.

Expand Down Expand Up @@ -546,6 +555,7 @@ def fused_specdecode_generate_mlx(

generated: List[int] = []
accepts: List[int] = []
stopped_on_runaway = False
# Rollback-carry state: rejected blocks roll the WHOLE forward back
# (rollback_block — see its docstring for why trim is unsound on the
# wrapped sliding ring) and carry the stream-committed-but-not-cached
Expand Down Expand Up @@ -630,6 +640,12 @@ def fused_specdecode_generate_mlx(
timing["extend_s"] += time.perf_counter() - t_extend
if any(t in eos for t in commit):
break
if stop_on_runaway:
drop = _trailing_runaway_drop(generated)
if drop > 0:
del generated[len(generated) - drop:]
stopped_on_runaway = True
break
finally:
adapter._capture_aux = False
generated = generated[:gen_tokens]
Expand All @@ -639,6 +655,7 @@ def fused_specdecode_generate_mlx(
"mean_accept_len": (round(sum(accepts) / len(accepts), 3)
if accepts else 0.0),
"decode_tokens": len(generated),
"stopped_on_runaway": stopped_on_runaway,
"loop": "mlx_rollback_carry_v3",
"time_breakdown_s": {k: round(v, 3) for k, v in timing.items()},
}
Expand Down Expand Up @@ -671,6 +688,40 @@ def _sliding_ring_would_wrap(cache: Any, n_new: int) -> bool:
return False


def _trailing_runaway_drop(
ids: Sequence[int],
*,
max_period: int = 8,
min_reps: int = 12,
keep_reps: int = 3,
) -> int:
"""Return how many TRAILING tokens to drop if ``ids`` ends in a runaway
short-period loop, else 0.

A runaway loop is a unit of ``1..max_period`` tokens repeated ``>= min_reps``
times back-to-back at the tail (e.g. the ``**``/``.2``/``*`` markdown-marker
collapse greedy decoding falls into on code prompts). When found, we keep
``keep_reps`` instances and drop the rest, so callers can stop generation
with a clean tail instead of emitting an unbounded wall of repeats.

Deliberately CONSERVATIVE (>= 12 back-to-back repeats of a <= 8-token unit)
so legitimately repetitive text — numbered lists, ``矿工 A/B/C`` enumerations,
structured code — is never trimmed. Returns 0 when no runaway is present."""
n = len(ids)
for p in range(1, max_period + 1):
if n < p * min_reps:
continue
unit = list(ids[n - p:])
reps = 0
i = n
while i - p >= 0 and list(ids[i - p:i]) == unit:
reps += 1
i -= p
if reps >= min_reps:
return max((reps - keep_reps) * p, 0)
return 0


# --------------------------------------------------------------------------- #
# The fused spec-decode loop (control flow; MLX/torch ops via injected fns).
# --------------------------------------------------------------------------- #
Expand All @@ -689,6 +740,7 @@ def fused_specdecode_generate(
cat_aux_fn: Callable[[Sequence[Any]], Any],
allow_greedy_fallback: bool = True,
on_commit: Optional[Callable[[List[int]], None]] = None,
stop_on_runaway: bool = True,
) -> Dict[str, Any]:
"""Run the fused engine. ``adapter`` must already be prefilled. Per block:
draft from the cached drafter context (B), verify+capture-aux incrementally
Expand Down Expand Up @@ -717,6 +769,7 @@ def fused_specdecode_generate(
generated: List[int] = []
accepts: List[int] = []
fallback_to_greedy = False
stopped_on_runaway = False
try:
while len(generated) < gen_tokens:
L = min(block_size, gen_tokens - len(generated))
Expand Down Expand Up @@ -792,6 +845,17 @@ def fused_specdecode_generate(
_emit(on_commit, generated)
if any(t in eos for t in commit):
break
# Greedy decoding can collapse into a runaway short-period loop (e.g.
# the **/.2/* markdown-marker wall on code prompts); the drafter then
# trivially predicts the repeats and the greedy verifier accepts them,
# so acceptance stays HIGH while the output is garbage. Stop on it
# instead of emitting an unbounded wall (keeps a short clean tail).
if stop_on_runaway:
drop = _trailing_runaway_drop(generated)
if drop > 0:
del generated[len(generated) - drop:]
stopped_on_runaway = True
break
if (allow_greedy_fallback and len(accepts) >= 2
and (sum(accepts) / len(accepts)) < 1.5):
fallback_to_greedy = True
Expand All @@ -810,6 +874,12 @@ def fused_specdecode_generate(
_emit(on_commit, generated)
if tok in eos:
break
if stop_on_runaway:
drop = _trailing_runaway_drop(generated)
if drop > 0:
del generated[len(generated) - drop:]
stopped_on_runaway = True
break
timing["fallback_greedy_s"] += time.perf_counter() - t_fb
finally:
adapter._capture_aux = False
Expand All @@ -820,5 +890,6 @@ def fused_specdecode_generate(
"mean_accept_len": (round(sum(accepts) / len(accepts), 3)
if accepts else 0.0),
"decode_tokens": len(generated),
"stopped_on_runaway": stopped_on_runaway,
"time_breakdown_s": {k: round(v, 3) for k, v in timing.items()},
}
Loading
Loading