Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/kakeyainferenceenginebuildskill.md
Original file line number Diff line number Diff line change
Expand Up @@ -351,5 +351,6 @@ If any answer is "no", write the weaker, true claim.
- v0.5-cuda scorecard (+ honest §5): `docs/reports/kakeya-inference-engine-v0.5-cuda.md`
- Engine vs vLLM long-context journey: `docs/reports/kakeya-engine-vs-vllm-h200.md`, `docs/reports/kakeya-vs-vllm-longcontext-h200.md`
- MLX port lessons: `docs/mlx-port-lessons.md`
- Self-hosted runner Python pinning (reboot-proof mlx_lm/torch/transformers): `docs/skills/pin-selfhosted-runner-python-env-skill.md`
- f_θ training pipeline: `docs/design/k3-f-theta-training-pipeline.md`
- Session capacity / cross-host: `docs/adr/0014-agent-connection-capacity-and-cross-host-topology-tests.md`
193 changes: 193 additions & 0 deletions docs/skills/pin-selfhosted-runner-python-env-skill.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
# Skill: Pin a self-hosted runner's Python env (survive reboots, reproducible heavy ML deps)

**Reusable across agents (Claude / Codex / Cursor).** Copy this file or paste the
prompt in the appendix. It is written to be repo-agnostic; the concrete examples
use a GitHub Actions self-hosted Mac runner driving MLX (`mlx_lm`/`torch`/
`transformers`), but the pattern applies to any self-hosted runner (Mac or Linux)
that runs heavy ML/native deps from a virtualenv.

---

## 1. When to use this skill

Trigger it when **a self-hosted runner job fails on a missing module that "used to
work"**, especially after a host **reboot / OS or Python upgrade / runner
re-register**. Classic signatures:

- `ModuleNotFoundError: No module named 'mlx_lm'` (or `torch`, `transformers`, …)
in a job that previously passed.
- The failure is **fast** (seconds) — it dies at `import`, before any real work.
- A **lightweight probe** (one that only needs stdlib + a base package) still
passes, proving the runner is *online* but pointing at the **wrong interpreter**.
- The interpreter version changed (e.g. `python=3.14.3` where it used to be
`3.13.x`), or `pkg=None` for a package that should be installed.

Root cause is almost always: the workflow invokes a **bare `python3`**, and after
the reboot the default `python3` on `PATH` is no longer the venv that has the
deps. The venv still exists; nothing points at it.

---

## 2. Diagnose first (don't guess)

Run the **cheapest possible probe** through the same runner path to read the
interpreter + module state, instead of assuming. Example (adapt the import list):

```bash
python3 - <<'PY'
import sys
def v(m):
try:
mod = __import__(m); return getattr(mod, "__version__", "ok")
except Exception as e:
return f"MISSING ({e.__class__.__name__})"
print("python =", sys.version.split()[0], "| exe =", sys.executable)
for m in ("mlx", "mlx_lm", "torch", "transformers"):
print(f"{m} = {v(m)}")
PY
```

Decision rule:
- **Runner online + probe shows wrong `python`/`exe` or `MISSING` deps** → this skill (interpreter pinning).
- **Probe itself never starts (job stuck `queued`/`pending`)** → the runner *agent*
is down; restart the agent first (different problem).

> In CI-driven runners, route the probe through the same executor the real jobs
> use (so `PATH`/env match). A one-liner like the above, committed as a tiny
> "env-probe" job/preset, is worth keeping permanently.

---

## 3. Fix — three layers (do all three; they are defense-in-depth)

### Layer A — Pin the interpreter the runner *agent* sees (host side, durable)

Make the venv's `bin` the first thing on the **runner agent's** `PATH`, so a bare
`python3` resolves to the venv even across reboots. Pick the mechanism for how the
agent is launched:

- **GitHub Actions runner as a service (recommended).** The runner reads a
`.env` and a `.path` file in its install dir at start:
```bash
cd ~/actions-runner
echo "$HOME/kakeya-venv/bin" > .path # prepended to PATH
echo "VIRTUAL_ENV=$HOME/kakeya-venv" >> .env
./svc.sh stop && ./svc.sh start # reload
```
(`.path` is concatenated ahead of the system PATH for every job; `.env` injects
process env. Both persist across reboots because the service re-reads them.)
- **launchd plist (macOS), if not using `svc.sh`.** In the runner's
`~/Library/LaunchAgents/<runner>.plist`, set:
```xml
<key>EnvironmentVariables</key>
<dict>
<key>PATH</key><string>/Users/&lt;you&gt;/kakeya-venv/bin:/usr/bin:/bin:/usr/sbin:/sbin</string>
</dict>
```
then `launchctl unload/load` the plist.
- **systemd (Linux self-hosted).** In the runner unit:
`Environment="PATH=/opt/kakeya-venv/bin:%h/.local/bin:/usr/bin:/bin"`, then
`systemctl daemon-reload && systemctl restart <runner>`.

Verify: `python3 -c "import mlx_lm, torch, transformers; print('ok')"` from a job.

### Layer B — Make the workflow/executor resolve a *pinned* interpreter (repo side, robust)

Never call a bare `python3` for the heavy job. Resolve an explicit interpreter so
the repo is robust even if Layer A drifts:

1. Add a repo/runner variable, e.g. `KAKEYA_MAC_PYTHON`, pointing at the venv
python (`/Users/<you>/kakeya-venv/bin/python`). Default-discover if unset:
```bash
PYBIN="${KAKEYA_MAC_PYTHON:-}"
for c in "$PYBIN" "$HOME/kakeya-venv/bin/python" "$(command -v python3.13)" "$(command -v python3)"; do
[ -n "$c" ] && [ -x "$c" ] && "$c" -c 'import mlx_lm' 2>/dev/null && { PYBIN="$c"; break; }
done
```
2. Use `$PYBIN` (or substitute a `${PYTHON}` token in your command templates)
instead of `python3` for the actual workload. If your executor spawns argv
lists (no shell), resolve the token to `$PYBIN` before `subprocess.run`.

### Layer C — Fail fast with a clear message (repo side, observability)

Before the expensive step, assert the deps and **print a fix hint** so the next
failure is self-explanatory instead of a deep `ModuleNotFoundError`:

```bash
"$PYBIN" - <<'PY' || { echo "::error::runner python missing ML deps — see pin-selfhosted-runner-python-env-skill.md (Layer A)"; exit 90; }
import mlx_lm, torch, transformers # noqa
PY
```

---

## 4. Verify the fix

1. Re-run the lightweight env-probe → correct `python`/`exe`, all deps present.
2. Re-run one **real** (heavy) job → no `ModuleNotFoundError`, completes.
3. **Reboot the host and re-run** (the actual regression you are fixing) → still
green. This step is the whole point; do not skip it.

---

## 5. Generalizing to a *Cloud Agent* VM env setup (different machine!)

Do **not** confuse the self-hosted runner with the Cloud Agent VM:
- The **Cloud Agent VM** is typically Linux; it runs the *client* that dispatches
jobs and the unit-test gate. **Mac-only deps (MLX) do not belong there.** Put
only what the client/tests need into the Cloud Agent env setup (base image +
startup script), and pin versions.
- The **self-hosted runner** is where the heavy/native/Mac deps live. Pin them
there (Layers A–C above), not in the Cloud VM env setup.

For the Cloud Agent VM specifically: bake stable deps into the **base image**, do
slow-changing installs in the **startup script**, and pin versions so a new VM is
reproducible. (In Cursor, this is the "env setup agent" config.)

---

## 6. Anti-patterns

- ❌ `pip install` the missing dep into whatever `python3` happens to be active
(often a too-new system Python with no wheels for `torch`/`mlx_lm`). Pin to the
known-good venv instead.
- ❌ Hardcoding an absolute interpreter path in many places. Resolve once
(variable + discovery) and reuse.
- ❌ "It works now" without a reboot test — the regression is reboot-triggered.
- ❌ Relying on an interactive shell's `source venv/bin/activate`; CI jobs and
services don't run your `.zshrc`.

---

## Appendix — ready-to-paste prompt for a setup agent

> **Task: make our self-hosted CI runner's Python environment reboot-proof.**
>
> Symptom: jobs on our self-hosted runner fail fast with
> `ModuleNotFoundError: No module named 'mlx_lm'` after the host rebooted; a
> lightweight env-probe shows the runner's default `python3` switched to a newer
> interpreter that lacks our ML stack (`mlx_lm`/`torch`/`transformers`), while the
> known-good venv still exists but is no longer on `PATH`.
>
> Do all of the following, smallest-diff first, and verify each:
> 1. **Diagnose:** run a tiny probe that prints `sys.version`, `sys.executable`,
> and import status of `mlx_lm, torch, transformers` through the same path the
> real jobs use. Confirm the wrong interpreter / missing modules.
> 2. **Host (runner agent):** pin the venv's `bin` ahead of system `PATH` for the
> runner service so a bare `python3` resolves to the venv across reboots — via
> the runner's `.path`/`.env` files (GitHub Actions `svc.sh`), or the
> launchd/systemd unit's `PATH` env. Reload the service.
> 3. **Repo (workflow/executor):** stop calling bare `python3` for the heavy job.
> Resolve a pinned interpreter from a `*_PYTHON` repo/runner variable, with a
> discovery fallback that picks the first candidate where `import mlx_lm`
> succeeds; use it for the workload commands.
> 4. **Repo (fail-fast):** before the expensive step, assert
> `import mlx_lm, torch, transformers` and emit a clear `::error::` with a link
> to this skill if missing (exit non-zero).
> 5. **Verify, including a reboot:** env-probe green, one real heavy job green,
> then reboot the host and re-run the same job — must still be green.
> 6. **Pin versions** in the venv (freeze a lockfile) and document the venv path +
> rebuild steps so the environment is reproducible, not just patched.
>
> Keep the heavy/native deps on the self-hosted runner only; do NOT add Mac-only
> deps to the Cloud Agent (Linux) VM env setup.
123 changes: 123 additions & 0 deletions inference_engine/bridge/runner_python.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
"""Pin the Mac-bridge workload interpreter (Layer B) + import self-check (Layer C).

A self-hosted runner's default ``python3`` can silently change across reboots /
OS upgrades (observed 2026-06-18: it flipped to a Python 3.14 without ``mlx_lm``,
breaking every full-engine preset with a deep ``ModuleNotFoundError``). The
mac-bridge executor used to invoke a bare ``python3`` for the workload, so it
inherited whatever interpreter happened to be first on ``PATH``.

This module makes the workload interpreter **explicit and verified**:

* **Layer B — resolution.** Build an ordered candidate list (a pinned
``KAKEYA_MAC_PYTHON``, common venv paths, then ``PATH`` pythons) and pick the
first one that can import the gate module (``mlx_lm``); fall back to the first
existing candidate otherwise.
* **Layer C — gate.** For presets whose workload needs ``mlx_lm`` (the ``mlx-`` /
``k3-`` engine families, minus the env-probe / upgrade tools that exist to
diagnose/repair the env), fail fast with a clear message instead of a deep
import error when no capable interpreter exists.

All functions here are pure / dependency-injected so they are unit-tested on the
Linux gate (the CLI ``scripts/mac_bridge/run_preset.py`` is a thin caller). See
``docs/skills/pin-selfhosted-runner-python-env-skill.md``.
"""

from __future__ import annotations

import os
import shutil
from dataclasses import dataclass
from typing import Callable, List, Mapping, Optional, Sequence

# The single module whose absence broke the runner; importing it implies the
# full MLX-LM stack is wired for the interpreter.
GATE_MODULE = "mlx_lm"

# ``mlx-``/``k3-`` presets that must NOT be import-gated: these exist precisely
# to probe or repair the environment, so they must run even when mlx_lm is gone.
_IMPORT_GATE_SKIP = frozenset({"mlx-env-probe", "mlx-upgrade"})

SKILL_DOC = "docs/skills/pin-selfhosted-runner-python-env-skill.md"


def workload_python_candidates(
environ: Mapping[str, str],
*,
which: Callable[[str], Optional[str]] = shutil.which,
expanduser: Callable[[str], str] = os.path.expanduser,
) -> List[str]:
"""Ordered, de-duplicated interpreter candidates for the heavy workload.

Priority: the explicit pin (``KAKEYA_MAC_PYTHON``), then conventional venv
locations, then ``PATH`` pythons (a pinned minor version before the bare
``python3`` that a reboot may have repointed)."""
raw = [
environ.get("KAKEYA_MAC_PYTHON"),
expanduser("~/kakeya-venv/bin/python"),
expanduser("~/.venv/bin/python"),
which("python3.13"),
which("python3"),
]
out: List[str] = []
for c in raw:
if c and c not in out:
out.append(c)
return out


@dataclass(frozen=True)
class ResolvedPython:
"""The interpreter chosen for the workload."""

path: str
gate_module_ok: bool # whether ``path`` can import GATE_MODULE
from_pin: bool # whether it came from ``KAKEYA_MAC_PYTHON``


def resolve_workload_python(
candidates: Sequence[str],
can_import: Callable[[str], bool],
*,
pinned: Optional[str] = None,
) -> Optional[ResolvedPython]:
"""Pick the first candidate that can import :data:`GATE_MODULE`; otherwise
the first candidate (a fallback whose ``gate_module_ok`` is ``False``).
Returns ``None`` only when there are no candidates at all."""
first: Optional[str] = None
for c in candidates:
if first is None:
first = c
if can_import(c):
return ResolvedPython(c, True, c == pinned)
if first is None:
return None
return ResolvedPython(first, False, first == pinned)


def preset_requires_gate(preset_name: str) -> bool:
"""True iff a preset's workload needs :data:`GATE_MODULE` (so a missing
import must fail fast). The ``mlx-`` / ``k3-`` engine presets do; the
env-probe and upgrade tools (which diagnose/repair the env) are exempt."""
if preset_name in _IMPORT_GATE_SKIP:
return False
return preset_name.startswith(("mlx-", "k3-"))


def substitute_python(argv: Sequence[str], pybin: str) -> List[str]:
"""Rewrite a leading bare ``python3`` to the resolved interpreter ``pybin``.
Non-``python3`` argv (e.g. ``bash run_kakeya_mac.sh``, which reads
``KAKEYA_MAC_PYTHON`` itself) is returned unchanged."""
a = list(argv)
if a and a[0] == "python3":
a[0] = pybin
return a


def gate_error_message(preset_name: str, pybin: str) -> str:
"""The fail-fast message when a gated preset has no mlx_lm-capable python."""
return (
f"runner python '{pybin}' cannot import {GATE_MODULE!r}, which preset "
f"'{preset_name}' requires. The runner's default python likely changed "
f"(e.g. after a reboot). Pin the venv via KAKEYA_MAC_PYTHON or the runner "
f"agent PATH and reinstall the ML stack — see {SKILL_DOC}."
)
44 changes: 43 additions & 1 deletion scripts/mac_bridge/run_preset.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,10 +32,29 @@
build_commands,
parse_manifest_text,
)
from inference_engine.bridge.runner_python import (
GATE_MODULE,
gate_error_message,
preset_requires_gate,
resolve_workload_python,
substitute_python,
workload_python_candidates,
)

LOG_DIR = Path(".mac-bridge/logs")


def _can_import_gate_module(pybin: str) -> bool:
"""True iff interpreter ``pybin`` can import the gate module (mlx_lm)."""
try:
return subprocess.run(
[pybin, "-c", f"import {GATE_MODULE}"],
stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL,
).returncode == 0
except OSError:
return False


def main() -> int:
ap = argparse.ArgumentParser(description=__doc__)
ap.add_argument("--manifest", default=".mac-bridge/request.json")
Expand All @@ -59,20 +78,43 @@ def main() -> int:
print(json.dumps(argv))
return 0

# Layer B — resolve a PINNED workload interpreter instead of trusting the
# bare ``python3`` on PATH (which a reboot can repoint to a python without
# mlx_lm). Layer C — gate: mlx-/k3- engine presets fail fast with a clear
# message when no mlx_lm-capable interpreter exists.
pinned = os.environ.get("KAKEYA_MAC_PYTHON")
candidates = workload_python_candidates(os.environ)
resolved = resolve_workload_python(
candidates, _can_import_gate_module, pinned=pinned)
pybin = resolved.path if resolved else "python3"
gate_ok = bool(resolved and resolved.gate_module_ok)
print(f"[mac-bridge] workload python={pybin} {GATE_MODULE}_ok={gate_ok} "
f"pinned={pinned!r} candidates={candidates}", file=sys.stderr)
if preset_requires_gate(request.preset.name) and not gate_ok:
print(f"::error::{gate_error_message(request.preset.name, pybin)}",
file=sys.stderr)
return 90

LOG_DIR.mkdir(parents=True, exist_ok=True)
summary = {
"preset": request.preset.name,
"params": dict(request.params),
"nonce": request.nonce,
"commands": [],
}
# Make the resolved interpreter authoritative for BOTH bare-``python3``
# commands (rewritten here) and the launcher (which reads KAKEYA_MAC_PYTHON).
sub_env = dict(os.environ)
sub_env["KAKEYA_MAC_PYTHON"] = pybin
rc = 0
for idx, argv in enumerate(commands):
argv = substitute_python(argv, pybin)
log_path = LOG_DIR / f"{request.preset.name}-{idx}.log"
print(f"[mac-bridge] exec[{idx}]: {argv}", file=sys.stderr)
t0 = time.perf_counter()
with log_path.open("wb") as log:
proc = subprocess.run(argv, stdout=log, stderr=subprocess.STDOUT)
proc = subprocess.run(argv, stdout=log, stderr=subprocess.STDOUT,
env=sub_env)
elapsed = time.perf_counter() - t0
summary["commands"].append({
"argv": argv,
Expand Down
Loading
Loading