Skip to content

Feat/validation evals#32

Merged
gitsad merged 26 commits into
mainfrom
feat/validation-evals
May 21, 2026
Merged

Feat/validation evals#32
gitsad merged 26 commits into
mainfrom
feat/validation-evals

Conversation

@gitsad
Copy link
Copy Markdown
Member

@gitsad gitsad commented May 21, 2026

What does this PR do?

Adds a new Preview page to the demo that showcases a real end-to-end MDMA product flow (insurance-claim intake) with live validation + LLM auto-fix on every assistant turn. Reworks the underlying agent into a two-agent architecture: a conversation agent that only talks (no MDMA in its visible text), plus an author sub-agent — same model + provider — that produces the MDMA via the generate_mdma tool. The change is opt-in via useAgent({ useAuthorSubAgent: true }) and now powers both the Preview and the Agent Chat views.

Along the way: validator API is split into per-block validate() and multi-message validateConversation(); form.onSubmit is now required; the action-references rule is dropped; the conversation-judge prompt was promoted out of mdma-fixer/; and many new model-specific fixer/author/agent-tool prompt variants land (gpt-5.x family, Claude Opus/Sonnet/Haiku, Gemini 2.5/3.x, Grok 4.x).

The fixer prompt now ships with model-tailored variants across every major family — OpenAI's full gpt-5.x lineup (5, 5-mini, 5-nano, 5.1, 5.2, 5.4, 5.4-mini, 5.4-nano, 5.5) plus gpt-4.1/-mini/-nano, all four Anthropic Claude tiers (Opus 4.6/4.7, Sonnet, Haiku), the Gemini 2.5 + 3.x families (Pro, Flash, Flash-Lite, plus the customtools Pro variant), and xAI Grok 4.20/4.3. Each variant composes from a shared MDMA_FIXER_* base plus vendor-local guards we discovered during eval runs (no-leading-separator, preserve-input-structure, table-key-direction, replace-all-placeholders, etc.), so the same validate() → LLM fixer → re-validate loop hits ≥14/15 single-block fix tests on every supported model. Reasoning-leak suppression for Gemini Pro and Grok 4.3 is handled at the provider layer via an OpenRouter reasoning.exclude passthrough in the eval config rather than per-prompt, keeping the fixer prompts themselves clean.

The repo now ships a ## Best Practices section covering custom-prompt design lessons learned across the eval matrix — concrete advice on when a flow needs explicit step boundaries, how to scope action labels as opaque handlers (don't reference back into the document), and why "one interactive component per assistant turn" is enforced. It also documents the two-agent architecture used by the Preview view as the recommended pattern for real product flows: keep MDMA generation strictly behind the generate_mdma tool and let a sub-agent (same model + provider, author prompt as system) own the document, so the conversation agent's visible text stays plain prose. Both surfaces — the README and the demo's docs view (CustomPromptBestPractices.tsx) — render the same guidance so external readers and in-app explorers see consistent recommendations.

Type of Change

  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing behavior to change)
  • Refactor (no functional change)
  • CI / tooling

Breaking notes:

  • @mobile-reality/mdma-validator: validateFlowvalidateConversation (rename); action-references rule removed.
  • @mobile-reality/mdma-spec: form.onSubmit is now required.
  • @mobile-reality/mdma-prompt-pack: MDMA_FIXER_CONVERSATION_JUDGEMDMA_CONVERSATION_JUDGE (rename + relocated).

Packages Affected

  • @mobile-reality/mdma-specform.onSubmit required; action-label fields documented as opaque labels.
  • @mobile-reality/mdma-parser — fixtures only.
  • @mobile-reality/mdma-runtime
  • @mobile-reality/mdma-attachables-core — test fixture update for required onSubmit.
  • @mobile-reality/mdma-renderer-react
  • @mobile-reality/mdma-validatorvalidateFlowvalidateConversation; action-references rule removed; per-block vs multi-message split.
  • @mobile-reality/mdma-prompt-pack — many model variants; conversation-judge promoted; new sub-agent-friendly composition.
  • @mobile-reality/mdma-cli — test fixture update for required onSubmit.
  • Demo (private) — new Preview page, sub-agent architecture wiring, backend log drawer, callout polish.
  • Evals (private) — new conversation-flow eval (LLM judge + deterministic cross-check), single-block fixer eval, reasoning-passthrough JS config for OpenRouter reasoning models.

Checklist

  • I have read the CONTRIBUTING guide.
  • My code follows the existing code style (pnpm format and pnpm lint pass).
  • I have added or updated tests that cover my changes.
  • All tests pass (pnpm test).
  • Type-checking passes (pnpm typecheck).
  • I have added a changeset (pnpm changeset) for the three affected published packages.
  • New or changed MDMA schemas are backwards-compatible (or marked as breaking).
  • Sensitive fields are marked with sensitive: true where appropriate (IBAN in the insurance flow).

How to Test

  1. pnpm install && pnpm build
  2. pnpm demo and open /preview in the browser.
  3. In Agent Settings, plug an API key (Anthropic, OpenAI, or OpenRouter) and pick a model.
  4. Type "Hi" — the agent should respond with one short conversational sentence in the chat and render Step 1 (personal-info-form) in the live preview pane on the right.
  5. Submit the form. A backend "200 OK" entry should land in the Backend log drawer (bottom-right) with the IBAN/PII masked; the agent advances to Step 2.
  6. Repeat through Step 4 — the final claim-submitted-callout should render with the polished Preview-specific styling.

Screenshots / Examples

Example flow definition driving the Preview (excerpt from demo/src/preview/insurance-flow-prompt.ts):

Zrzut ekranu 2026-05-21 o 13 44 32 Zrzut ekranu 2026-05-21 o 13 45 19 Zrzut ekranu 2026-05-21 o 13 44 59 Zrzut ekranu 2026-05-21 o 13 44 44

@gitsad gitsad merged commit c50c2ef into main May 21, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants