Skip to content

fix(engine): honor producer BeginFrame disable env#1711

Closed
miguel-heygen wants to merge 1 commit into
mainfrom
fix/producer-browser-crash-root
Closed

fix(engine): honor producer BeginFrame disable env#1711
miguel-heygen wants to merge 1 commit into
mainfrom
fix/producer-browser-crash-root

Conversation

@miguel-heygen

Copy link
Copy Markdown
Collaborator

Summary

Fixes the producer/browser-crash root cause by making the engine honor the existing production env knob:

  • PRODUCER_ENABLE_BEGIN_FRAME=false|off|0 now resolves to forceScreenshot=true
  • PRODUCER_FORCE_SCREENSHOT=true remains the direct screenshot-mode knob
  • explicit resolveConfig({ forceScreenshot }) overrides still win over env defaults

Production evidence

The failing Temporal activities surfaced as browser/session failures:

  • Navigation timeout of 60000 ms exceeded
  • Navigating frame was detached
  • Protocol error (Runtime.evaluate): Target closed

Root finding from prod:

  • temporal-hyperframes-producer-worker-sidecar-configmap already sets PRODUCER_ENABLE_BEGIN_FRAME=false
  • every inspected producer sidecar process had PRODUCER_ENABLE_BEGIN_FRAME=false in PID 1 env
  • the shipped sidecar bundle only read PRODUCER_FORCE_SCREENSHOT; it did not read PRODUCER_ENABLE_BEGIN_FRAME
  • prod logs still showed captureMode:"beginframe" and forceScreenshot:false
  • the failed pod had fresh Chrome core files at the exact failure minute, plus defunct chrome-headless children

So prod was configured to avoid BeginFrame, but the engine ignored that compatibility env and kept taking the BeginFrame path.

Verification

  • bun test packages/engine/src/config.test.ts packages/engine/src/services/frameCapture-transientErrors.test.ts packages/producer/src/services/render/stages/probeStage.test.ts — 58 pass, 0 fail
  • bunx oxfmt --check packages/engine/src/config.ts packages/engine/src/config.test.ts
  • bunx oxlint packages/engine/src/config.ts packages/engine/src/config.test.ts
  • git diff --check
  • pre-commit hooks passed: lint, format, fallow audit, typecheck, commitlint
  • direct probe: PRODUCER_ENABLE_BEGIN_FRAME=false now returns resolveConfig().forceScreenshot === true
  • direct probe: PRODUCER_ENABLE_BEGIN_FRAME=true keeps forceScreenshot === false

I also pulled the exact failed production artifacts and rendered all three locally through the built CLI with PRODUCER_ENABLE_BEGIN_FRAME=false; all selected screenshot mode and completed:

  • nav-timeout-23c1 — rendered successfully
  • frame-detached-1a80 — rendered successfully
  • target-closed-40c8 — rendered successfully

Deploy / rerun plan

After a producer sidecar image with this patch is deployed, reset or rerun the failed Temporal workflows to verify they no longer hit the browser crash path. I did not reset them before this deploy because current prod would still run the old sidecar image and old config behavior.

@miguel-heygen

Copy link
Copy Markdown
Collaborator Author

Closing this as the wrong root-fix direction. The investigation evidence showed the prod compatibility env is stale/dead, but James clarified BeginFrame is the preferred Linux path and this env flag is intentionally not the approach we want to use. Continuing investigation at the BeginFrame/browser lifecycle layer instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant