Feat/validation evals by gitsad · Pull Request #32 · MobileReality/mdma

gitsad · 2026-05-21T11:53:49Z

What does this PR do?

Adds a new Preview page to the demo that showcases a real end-to-end MDMA product flow (insurance-claim intake) with live validation + LLM auto-fix on every assistant turn. Reworks the underlying agent into a two-agent architecture: a conversation agent that only talks (no MDMA in its visible text), plus an author sub-agent — same model + provider — that produces the MDMA via the generate_mdma tool. The change is opt-in via useAgent({ useAuthorSubAgent: true }) and now powers both the Preview and the Agent Chat views.

Along the way: validator API is split into per-block validate() and multi-message validateConversation(); form.onSubmit is now required; the action-references rule is dropped; the conversation-judge prompt was promoted out of mdma-fixer/; and many new model-specific fixer/author/agent-tool prompt variants land (gpt-5.x family, Claude Opus/Sonnet/Haiku, Gemini 2.5/3.x, Grok 4.x).

The fixer prompt now ships with model-tailored variants across every major family — OpenAI's full gpt-5.x lineup (5, 5-mini, 5-nano, 5.1, 5.2, 5.4, 5.4-mini, 5.4-nano, 5.5) plus gpt-4.1/-mini/-nano, all four Anthropic Claude tiers (Opus 4.6/4.7, Sonnet, Haiku), the Gemini 2.5 + 3.x families (Pro, Flash, Flash-Lite, plus the customtools Pro variant), and xAI Grok 4.20/4.3. Each variant composes from a shared MDMA_FIXER_* base plus vendor-local guards we discovered during eval runs (no-leading-separator, preserve-input-structure, table-key-direction, replace-all-placeholders, etc.), so the same validate() → LLM fixer → re-validate loop hits ≥14/15 single-block fix tests on every supported model. Reasoning-leak suppression for Gemini Pro and Grok 4.3 is handled at the provider layer via an OpenRouter reasoning.exclude passthrough in the eval config rather than per-prompt, keeping the fixer prompts themselves clean.

The repo now ships a ## Best Practices section covering custom-prompt design lessons learned across the eval matrix — concrete advice on when a flow needs explicit step boundaries, how to scope action labels as opaque handlers (don't reference back into the document), and why "one interactive component per assistant turn" is enforced. It also documents the two-agent architecture used by the Preview view as the recommended pattern for real product flows: keep MDMA generation strictly behind the generate_mdma tool and let a sub-agent (same model + provider, author prompt as system) own the document, so the conversation agent's visible text stays plain prose. Both surfaces — the README and the demo's docs view (CustomPromptBestPractices.tsx) — render the same guidance so external readers and in-app explorers see consistent recommendations.

Type of Change

New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing behavior to change)
Refactor (no functional change)
CI / tooling

Breaking notes:

@mobile-reality/mdma-validator: validateFlow → validateConversation (rename); action-references rule removed.
@mobile-reality/mdma-spec: form.onSubmit is now required.
@mobile-reality/mdma-prompt-pack: MDMA_FIXER_CONVERSATION_JUDGE → MDMA_CONVERSATION_JUDGE (rename + relocated).

Packages Affected

Checklist

I have read the CONTRIBUTING guide.
My code follows the existing code style (pnpm format and pnpm lint pass).
I have added or updated tests that cover my changes.
All tests pass (pnpm test).
Type-checking passes (pnpm typecheck).
I have added a changeset (pnpm changeset) for the three affected published packages.
New or changed MDMA schemas are backwards-compatible (or marked as breaking).
Sensitive fields are marked with sensitive: true where appropriate (IBAN in the insurance flow).

How to Test

pnpm install && pnpm build
pnpm demo and open /preview in the browser.
In Agent Settings, plug an API key (Anthropic, OpenAI, or OpenRouter) and pick a model.
Type "Hi" — the agent should respond with one short conversational sentence in the chat and render Step 1 (personal-info-form) in the live preview pane on the right.
Submit the form. A backend "200 OK" entry should land in the Backend log drawer (bottom-right) with the IBAN/PII masked; the agent advances to Step 2.
Repeat through Step 4 — the final claim-submitted-callout should render with the polished Preview-specific styling.

Screenshots / Examples

Example flow definition driving the Preview (excerpt from demo/src/preview/insurance-flow-prompt.ts):

…w to validationConversation

gitsad added 26 commits May 14, 2026 11:44

feat: added more validation tests, and passed gpt-5.5

88c79c6

feat: make onSubmit required and adjust fixer tests

2738d9d

chore: WIP 5.5, 5.4 and 5.4-mini

94a0100

chore: WIP gpt-5.4-min

f2d0ad0

feat: added best practices and wip in next gpt models

363e178

fix: fixed made up components

28eddb6

chore: revised all gpt models

bf5b5fa

chore: evaluated claude models

3656f15

chore: gemini WIP

5869d0f

chore: gemini 2.5 wip

9994c18

chore: finished gemini

b5b27f1

fix: updated schema

4526b3b

chore: clenaup, change validation to one block, changed validationFlo…

c9ac353

…w to validationConversation

feat: added variants for fixer prompts

3eed97d

feat: added preview

dc60a02

feat: working preview with fixer

6b53674

feat: added backend log

9c42a2b

chore: changed naming

bd64460

fix: improved callout for preview

7a998ce

feat: working preview on all models

6173a7f

chore: switched places

dae22b0

chore: changeset

5bb8529

chore: update tests

019778a

chore: lint adn format

dfc8c86

chore: updated Readme and Docs with fixer prompt matrix

5aaaef8

chore: updated readme

094f046

ssmrmmk approved these changes May 21, 2026

View reviewed changes

gitsad merged commit c50c2ef into main May 21, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/validation evals#32

Feat/validation evals#32
gitsad merged 26 commits into
mainfrom
feat/validation-evals

gitsad commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants