Skip to content

fix(producer): retry probe navigation timeouts#1713

Merged
miguel-heygen merged 1 commit into
mainfrom
fix/beginframe-browser-crash
Jun 25, 2026
Merged

fix(producer): retry probe navigation timeouts#1713
miguel-heygen merged 1 commit into
mainfrom
fix/beginframe-browser-crash

Conversation

@miguel-heygen

@miguel-heygen miguel-heygen commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Summary

Treat Chromium navigation timeouts during browser probe/init as transient browser failures so runProbeStage retries once with a fresh browser session.

This preserves the preferred Linux BeginFrame path. It does not force screenshot mode.

Why

Render-streaming probe/init failures can surface as browser lifecycle errors such as:

  • Navigating frame was detached
  • Protocol error (Runtime.evaluate): Target closed
  • Navigation timeout of 60000 ms exceeded

Main already retries the first two classes through isTransientBrowserError, but navigation timeout was explicitly excluded. This PR closes that gap so probe-stage navigation timeouts use the same fresh-browser retry path before the producer emits a terminal render error.

Validation

  • Adds classifier coverage for Navigation timeout of 60000 ms exceeded.
  • Adds producer probe-stage coverage showing a transient navigation timeout retries with a fresh browser session and succeeds.
  • Operational prod-path validation was performed separately and is tracked in the internal incident notes; workflow/run identifiers are intentionally omitted from this public PR description.

Verification

  • bun test packages/engine/src/services/frameCapture-transientErrors.test.ts packages/producer/src/services/render/stages/probeStage.test.ts — 30 pass, 0 fail
  • bunx oxfmt --check packages/engine/src/services/frameCapture.ts packages/engine/src/services/frameCapture-transientErrors.test.ts packages/producer/src/services/render/stages/probeStage.test.ts
  • bunx oxlint packages/engine/src/services/frameCapture.ts packages/engine/src/services/frameCapture-transientErrors.test.ts packages/producer/src/services/render/stages/probeStage.test.ts
  • git diff --check HEAD~1..HEAD
  • pre-commit hooks passed: lint, format, fallow audit, typecheck, commitlint

Note: I also tried the broader render-stage test bundle including captureStreamingStage.test.ts; that test file has unrelated mock drift in isolation (getCapturePerfSummary export/session options mocks), so I did not treat it as verification for this PR.

@miguel-heygen

Copy link
Copy Markdown
Collaborator Author

Closing this for now because it is retry classification / mitigation, not the root-cause fix for the Chromium renderer crash. We will reopen or replace with a root fix once the crash trigger is proven.

@miguel-heygen

Copy link
Copy Markdown
Collaborator Author

Reopening based on the deeper prod-path repro. We now have exact-image A/B evidence: the deployed prod sidecar image (0.7.2, Chrome 150) fails the exact failed streamed-preview source under /v1/render-stream with the observed browser_probe signatures (6/8 failures: Navigating frame was detached / Target closed), while the 0.7.6 candidate image passes the identical replay 8/8. The candidate classifier still returns false for Navigation timeout of 60000 ms exceeded, so this follow-up remains required to cover all observed production signatures. This should be treated as part of the root fix, not just a Temporal retry bandaid: it lets the sidecar recover inside probe by relaunching a fresh browser session before emitting SSE error.

@miguel-heygen miguel-heygen reopened this Jun 25, 2026
@miguel-heygen miguel-heygen merged commit 0558b87 into main Jun 25, 2026
105 checks passed
@miguel-heygen miguel-heygen deleted the fix/beginframe-browser-crash branch June 25, 2026 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant