Skip to content

fix(replication): key logical replication leader lock on slot name#4151

Merged
d-cs merged 5 commits into
mainfrom
fix/runs-replication-leader-lock-per-slot
Jul 4, 2026
Merged

fix(replication): key logical replication leader lock on slot name#4151
d-cs merged 5 commits into
mainfrom
fix/runs-replication-leader-lock-per-slot

Conversation

@d-cs

@d-cs d-cs commented Jul 4, 2026

Copy link
Copy Markdown
Collaborator

Problem

LogicalReplicationClient uses a Redlock leader lock to guarantee a single active consumer per Postgres logical replication slot. The lock resource was keyed on the client name:

logical-replication-client:${this.options.name}

A slot permits exactly one consumer, so the lock's job is to serialize consumers of a given slot. Keying it on name breaks that whenever two clients target the same slot with different names — most notably across a rolling deploy where the client name changes but slotName does not. Both acquire distinct locks, both consider themselves leader, and the second to reach START_REPLICATION hits replication slot "<slot>" is active for PID <n>. Because that query was fire-and-forget and its failure was only logged (no retry), the consumer stopped and replication stalled until the process was restarted.

Fix

1. Key the leader lock on slotName — the actual single-consumer resource:

logical-replication-client:${this.options.slotName}

Consumers of the same slot now contend on the same lock and hand off cleanly across restarts/deploys; different slots stay independent. name is kept for logging and the pg application_name.

2. Self-healing resubscribe (resubscribeOnFailure, opt-in) — instead of logging-and-dying, a client re-subscribes with exponential backoff after a lost election or a failed START_REPLICATION, so a rolling deploy self-heals: the incoming pod retries until the draining pod releases the slot, then takes over. Safety:

  • #cleanupAttempt() unconditionally ends the pg client (freeing the walsender) and releases the leader lock before rescheduling — retries never leak connections/locks.
  • shutdown() sets an intentional-stop latch re-checked after every await in subscribe() (and aborts the lock-acquire spin), so a resubscribe can never race or outlive an intentional shutdown.
  • Backoff resets only on genuine stream start, so a permanently stuck slot backs off to the ceiling and logs loudly rather than tight-looping; an epoch guard neutralises stale START_REPLICATION catches.

Runs- and sessions-replication opt in and use shutdown() for all intentional stops.

3. Observability — the admin runs-replication status route probed the old name-keyed Redis key (would report leader:false for every source after fix #1); now probes the slot-keyed key.

Tests

internal-packages/replication/src/client.test.ts (real Postgres + Redis containers):

  • same-slot/different-name → second client must not double-lead or race into "slot is active" (the regression)
  • a failing START_REPLICATION retry loop must not leak connections or locks
  • shutdown() during an in-flight subscribe() must not leave a zombie leader
  • subscribe() after shutdown() re-arms resubscribeOnFailure
  • self-heals once the leader releases the slot

Plus the multi-source wiring test updated to the slot-keyed lock keys.

Rollout

With the self-healing resubscribe, this ships as a plain rolling deploy — the incoming pods retry across the one-time lock-key transition and take over once the old pods drain (a brief replication stall that the durable slot replays on reconnect — no data loss). No stop-before-start required.

The LogicalReplicationClient leader lock (Redlock) was keyed on the client
name, but a Postgres logical replication slot allows exactly one consumer, so
the lock must serialize consumers of a given slot. When two clients target the
same slot with different names (e.g. across a rolling deploy where the name
changed but the slot did not), each acquired a distinct lock, both became
leader, and the second to run START_REPLICATION failed with 'replication slot
is active'. Since START_REPLICATION is fire-and-forget and only logged on error,
that consumer stopped and replication stalled until restarted.

Key the lock on slotName instead. Adds a regression test (two clients, same
slot, different name) verified to fail before and pass after.
@changeset-bot

changeset-bot Bot commented Jul 4, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: 4ab8c7d

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai

coderabbitai Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 6f230aeb-d588-44f1-a9d8-619cd84fc831

📥 Commits

Reviewing files that changed from the base of the PR and between 5e13412 and 4ab8c7d.

📒 Files selected for processing (1)
  • internal-packages/replication/src/client.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal-packages/replication/src/client.ts
📜 Recent review details
⏰ Context from checks skipped due to timeout. (24)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (7, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (10, 10)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (6, 12)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (9, 10)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (8, 12)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (8, 10)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (9, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (4, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (1, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (10, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (12, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (2, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (3, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (7, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (11, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (5, 12)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (6, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (2, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (1, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (5, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (3, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (4, 10)
  • GitHub Check: e2e-webapp / 🧪 E2E Tests: Webapp
  • GitHub Check: typecheck / typecheck
⚠️ CI failures not shown inline (4)

GitHub Actions: 📝 Agent Instructions Audit / audit: fix(replication): key logical replication leader lock on slot name

Conclusion: failure

View job details

##[group]Run anthropics/claude-code-action@428971d2ecd6e3a7cb0ee0da2a3a8b33fdb3678d
 with:
   anthropic_***REDACTED***
   use_sticky_comment: true
   allowed_bots: devin-ai-integration[bot]
   claude_args: --max-turns 25
--model claude-opus-4-8
--allowedTools "Read,Glob,Grep,Bash(git diff:*)"
   prompt: You are reviewing a PR to check whether any agent instruction files need updating.
In this repo:
- Root shared agent guidance lives in `AGENTS.md`.
- Root `CLAUDE.md` is only a Claude Code adapter that imports `AGENTS.md`.
- Subdirectories may still have scoped `CLAUDE.md` files.
- `.claude/rules/` contains additional Claude Code guidance.
## Your task
1. Run `git diff origin/main...HEAD --name-only` to see which files changed in this PR.
2. For each changed directory, check the applicable instruction files: root `AGENTS.md`, any `CLAUDE.md` in that directory or a parent directory, and relevant `.claude/rules/` files.
3. Determine if any instruction file should be updated based on the changes. Consider:
   - New files/directories that aren't covered by existing documentation
   - Changed architecture or patterns that contradict current agent guidance
   - New dependencies, services, or infrastructure that agents should know about
   - Renamed or moved files that are referenced in an instruction file
   - Changes to build commands, test patterns, or development workflows
## Response format
If NO updates are needed, respond with exactly:
✅ Agent instruction files look current for this PR.
If updates ARE needed, respond with a short list:
📝 **Agent instruction updates suggested:**
- `AGENTS.md`: [what should be added/changed]
- `path/to/CLAUDE.md`: [what should be added/changed]
- `.claude/rules/file.md`: [what should be added/changed]
Keep suggestions specific and brief. Only flag things that would actually mislead agents in future sessions.
Do NOT suggest updates for trivial changes (bug fixes, small refactors within existing patterns).
Do NOT suggest creating new...

GitHub Actions: 📝 Agent Instructions Audit / 0_audit.txt: fix(replication): key logical replication leader lock on slot name

Conclusion: failure

View job details

     build-batching-rc.1         -> build-batching-rc.1
  * [new tag]             build-batching-rc.2         -> build-batching-rc.2
  * [new tag]             build-billing-0.0.1         -> build-billing-0.0.1
  * [new tag]             build-billing-0.0.2         -> build-billing-0.0.2
  * [new tag]             build-billing-0.0.3         -> build-billing-0.0.3
  * [new tag]             build-buildinfo-rc.0        -> build-buildinfo-rc.0
  * [new tag]             build-buildinfo-rc.1        -> build-buildinfo-rc.1
  * [new tag]             build-checkpoint-failover-rc.1 -> build-checkpoint-failover-rc.1
  * [new tag]             build-checkpoint-race-condition-1 -> build-checkpoint-race-condition-1
  * [new tag]             build-checkpoint-race-condition-2 -> build-checkpoint-race-condition-2
  * [new tag]             build-checkpoint-race-condition-3 -> build-checkpoint-race-condition-3
  * [new tag]             build-chris-test-blacksmith -> build-chris-test-blacksmith
  * [new tag]             build-chris-test-blacksmith-2 -> build-chris-test-blacksmith-2
  * [new tag]             build-cli-build-upgrade-rc.1 -> build-cli-build-upgrade-rc.1
  * [new tag]             build-clickhouse-reads-rc0  -> build-clickhouse-reads-rc0
  * [new tag]             build-clickhouse-reads-rc1  -> build-clickhouse-reads-rc1
  * [new tag]             build-compute.rc0           -> build-compute.rc0
  * [new tag]             build-compute.rc1           -> build-compute.rc1
  * [new tag]             build-compute.rc2           -> build-compute.rc2
  * [new tag]             build-compute.rc3           -> build-compute.rc3
  * [new tag]             build-compute.rc4           -> build-compute.rc4
  * [new tag]             build-compute.rc5           -> build-compute.rc5
  * [new tag]             build-compute.rc6           -> build-compute.rc6
  * [new tag]             build-corepack-offline-rc.0 -> build-corepack-offline-rc.0
  * [new tag]             build-current-deployment-rc.0 ...

GitHub Actions: 🔎 REVIEW.md Drift Audit / audit: fix(replication): key logical replication leader lock on slot name

Conclusion: failure

View job details

##[group]Run anthropics/claude-code-action@428971d2ecd6e3a7cb0ee0da2a3a8b33fdb3678d
 with:
   anthropic_***REDACTED***
   use_sticky_comment: true
   allowed_bots: devin-ai-integration[bot]
   claude_args: --max-turns 30
--allowedTools "Read,Glob,Grep,Bash(git diff:*)"
   prompt: You are auditing this PR for drift against `.claude/REVIEW.md`.
## Context
`.claude/REVIEW.md` is the repo's source of truth for what AI / agent code reviewers should treat as critical findings (rolling-deploy safety, hot-table indexes, recovery-path queries, testcontainers usage, Lua versioning, etc.). It is consumed by review agents to calibrate severity. If REVIEW.md goes stale, every future agent review degrades.
## Strategy — read this first
You have a hard turn budget. Spend it on signal, not coverage. The audit is allowed to miss things; it is NOT allowed to time out.
1. Read `.claude/REVIEW.md` once, in full.
2. Run `git diff origin/main...HEAD --name-only` to get the list of changed files. Do NOT read the diff content yet.
3. Scan the file-list for relevance to REVIEW.md scope. Relevance signals: changes to Prisma schema, Redis / queue / Lua code, hot tables, recovery / restart loops, new packages, deletions of paths REVIEW.md cites. Skim everything else.
4. Open at most **5 files** total — only the ones most likely to surface a real signal. If nothing in the file-list looks relevant to any REVIEW.md rule, do NOT read any files; go straight to the verdict.
5. Form a verdict and stop. Do not exhaust the turn budget exploring.
Large PRs (>50 files changed) are a strong signal to be MORE selective, not more thorough. Pick 3-5 files at most.
## What to look for
- **Stale references** — does any REVIEW.md rule cite a file, directory, function, table, Prisma model, or package name that has been removed or renamed in this PR (or is already gone from `main`)?
- **Contradictions** — does code in this PR clearly violate a current REVIEW.md rule? (Don't re-review the PR. Only flag if REVIE...

GitHub Actions: 🔎 REVIEW.md Drift Audit / 0_audit.txt: fix(replication): key logical replication leader lock on slot name

Conclusion: failure

View job details

y-run-engine.fix3
  * [new tag]             build-manual-checkpoints.rc1 -> build-manual-checkpoints.rc1
  * [new tag]             build-metadata-upgrade-logging.rc1 -> build-metadata-upgrade-logging.rc1
  * [new tag]             build-metadata-upgrade-logging.rc2 -> build-metadata-upgrade-logging.rc2
  * [new tag]             build-metadata-upgrade-logging.rc3 -> build-metadata-upgrade-logging.rc3
  * [new tag]             build-new-build-system.rc.1 -> build-new-build-system.rc.1
  * [new tag]             build-otel-upgrade-rc.0     -> build-otel-upgrade-rc.0
  * [new tag]             build-otel-upgrade-rc.1     -> build-otel-upgrade-rc.1
  * [new tag]             build-pre-pull-deployments-rc.1 -> build-pre-pull-deployments-rc.1
  * [new tag]             build-prod-rescue-rc.1      -> build-prod-rescue-rc.1
  * [new tag]             build-rate-limiter-fix-rc.1 -> build-rate-limiter-fix-rc.1
  * [new tag]             build-re2.rc0               -> build-re2.rc0
  * [new tag]             build-realtime-v2-stream-fix -> build-realtime-v2-stream-fix
  * [new tag]             build-realtime-v2-stream-fix-2 -> build-realtime-v2-stream-fix-2
  * [new tag]             build-realtime-v2-stream-fix-3 -> build-realtime-v2-stream-fix-3
  * [new tag]             build-realtime-v2-stream-fix-4 -> build-realtime-v2-stream-fix-4
  * [new tag]             build-realtime-v2-stream-fix-5 -> build-realtime-v2-stream-fix-5
  * [new tag]             build-realtimestreams-dedupe -> build-realtimestreams-dedupe
  * [new tag]             build-registry-maintenance-rc.1 -> build-registry-maintenance-rc.1
  * [new tag]             build-registry-maintenance-rc.2 -> build-registry-maintenance-rc.2
  * [new tag]             build-remote-ecr-rc.0       -> build-remote-ecr-rc.0
  * [new tag]             build-reschedule-hotfix.rc1 -> build-reschedule-hotfix.rc1
  * [new tag]             build-resume-fixes.rc1      -> build-resume-fixes.rc1
  * [new tag]             build-resume-fixes.rc2   ...

Walkthrough

This change updates logical replication leadership to use the replication slot name as the Redis/Redlock key, adds resubscribe-on-failure behavior and a new shutdown() client lifecycle path, and rewires webapp replication services to call the new lifecycle methods. The runs-replication status probe now checks leadership per (id, slotName) pair. New integration tests cover same-slot contention, resubscribe recovery, retry cleanup, shutdown races, and re-arming after shutdown.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description covers the change well, but it misses required template sections like Closes #, checklist, Testing, Changelog, and Screenshots. Add the missing template sections, including the issue reference, checklist items, testing steps, changelog, and screenshots or a clear note if none.
Linked Issues check ❓ Inconclusive The linked issue is a placeholder with no functional requirements, so compliance can't be meaningfully verified. Provide a real linked issue with acceptance criteria or implementation requirements to validate the PR against.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and accurately summarizes the main change: moving the replication leader lock to the slot name.
Out of Scope Changes check ✅ Passed The resubscribe, shutdown, status, and test changes all support the stated lock-keying fix and are in scope.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/runs-replication-leader-lock-per-slot

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@d-cs d-cs self-assigned this Jul 4, 2026
d-cs added 2 commits July 4, 2026 14:34
Follow-on to keying the leader lock on slotName: the admin runs-replication
status route probed the old name-keyed Redis key (would report leader:false
for every source), and the multi-source wiring test asserted the old
name-keyed lock keys (CI-red). Both updated to the slot-keyed format.
… deploys)

Replication clients now optionally re-subscribe (with exponential backoff) after
a lost leader election or a failed START_REPLICATION, instead of logging once and
staying dead. This makes a rolling deploy self-heal: the incoming pod retries
until the draining pod releases the slot, then takes over — no stop-before-start.

Safety:
- #cleanupAttempt() unconditionally ends the pg client (freeing the walsender)
  and releases the leader lock before rescheduling, so retries never leak
  connections/locks (the plain stop() early-returns while _isStopped is set).
- shutdown() (used for intentional stop) sets an intentional-stop latch that is
  re-checked after every await in subscribe() and aborts #acquireLeaderLock's
  spin, so a resubscribe can never race or outlive a shutdown.
- backoff resets only on genuine stream start (replicationStart), so a
  permanently stuck slot backs off to the ceiling and logs loudly rather than
  tight-looping; a subscribeEpoch neutralises stale START_REPLICATION catches.

Runs- and sessions-replication opt in and use shutdown() for all intentional
stops. Adds container tests for the leak, shutdown-race, reset, and self-heal.
@d-cs d-cs marked this pull request as ready for review July 4, 2026 14:22

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

coderabbitai[bot]

This comment was marked as resolved.

@d-cs

d-cs commented Jul 4, 2026

Copy link
Copy Markdown
Collaborator Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Full review finished.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal-packages/replication/src/client.ts (1)

334-356: 🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

teardown() leaks the leader lock and pg client if it throws mid-flight.

After #acquireLeaderLock(false) succeeds, this.client.connect() (or a throw before end()) will bypass both this.client.end() and #releaseLeaderLock(). Because the lock is now slot-keyed, a leaked lock blocks any new leader for the slot until leaderLockTimeoutMs expires, and the pg backend is held open too. Wrap the acquire-to-release body in try/finally.

🛡️ Proposed fix
     const leaderLockAcquired = await this.#acquireLeaderLock(false);

     if (!leaderLockAcquired) {
       return false;
     }

-    this.client = new Client({
-      ...this.options.pgConfig,
-      // `@ts-expect-error`
-      replication: "database",
-      application_name: this.options.name,
-    });
-    await this.client.connect();
-
-    // Drop the slot
-    const slotDropped = await this.#dropSlot();
-
-    await this.client.end();
-    this.client = null;
-
-    await this.#releaseLeaderLock();
-
-    return slotDropped;
+    try {
+      this.client = new Client({
+        ...this.options.pgConfig,
+        // `@ts-expect-error`
+        replication: "database",
+        application_name: this.options.name,
+      });
+      await this.client.connect();
+
+      // Drop the slot
+      return await this.#dropSlot();
+    } finally {
+      if (this.client) {
+        await tryCatch(this.client.end());
+        this.client = null;
+      }
+      await this.#releaseLeaderLock();
+    }

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: e6800410-ed74-4f05-a387-e43019f21832

📥 Commits

Reviewing files that changed from the base of the PR and between 70bca82 and 1ae9413.

📒 Files selected for processing (7)
  • .server-changes/fix-replication-leader-lock-per-slot.md
  • apps/webapp/app/routes/admin.api.v1.runs-replication.status.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/test/runsReplicationInstance.test.ts
  • internal-packages/replication/src/client.test.ts
  • internal-packages/replication/src/client.ts
📜 Review details
⏰ Context from checks skipped due to timeout. (1)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (4, 10)
⚠️ CI failures not shown inline (4)

GitHub Actions: 📝 Agent Instructions Audit / 0_audit.txt: fix(replication): key logical replication leader lock on slot name

Conclusion: failure

View job details

-batching-rc.1
  * [new tag]             build-batching-rc.2         -> build-batching-rc.2
  * [new tag]             build-billing-0.0.1         -> build-billing-0.0.1
  * [new tag]             build-billing-0.0.2         -> build-billing-0.0.2
  * [new tag]             build-billing-0.0.3         -> build-billing-0.0.3
  * [new tag]             build-buildinfo-rc.0        -> build-buildinfo-rc.0
  * [new tag]             build-buildinfo-rc.1        -> build-buildinfo-rc.1
  * [new tag]             build-checkpoint-failover-rc.1 -> build-checkpoint-failover-rc.1
  * [new tag]             build-checkpoint-race-condition-1 -> build-checkpoint-race-condition-1
  * [new tag]             build-checkpoint-race-condition-2 -> build-checkpoint-race-condition-2
  * [new tag]             build-checkpoint-race-condition-3 -> build-checkpoint-race-condition-3
  * [new tag]             build-chris-test-blacksmith -> build-chris-test-blacksmith
  * [new tag]             build-chris-test-blacksmith-2 -> build-chris-test-blacksmith-2
  * [new tag]             build-cli-build-upgrade-rc.1 -> build-cli-build-upgrade-rc.1
  * [new tag]             build-clickhouse-reads-rc0  -> build-clickhouse-reads-rc0
  * [new tag]             build-clickhouse-reads-rc1  -> build-clickhouse-reads-rc1
  * [new tag]             build-compute.rc0           -> build-compute.rc0
  * [new tag]             build-compute.rc1           -> build-compute.rc1
  * [new tag]             build-compute.rc2           -> build-compute.rc2
  * [new tag]             build-compute.rc3           -> build-compute.rc3
  * [new tag]             build-compute.rc4           -> build-compute.rc4
  * [new tag]             build-compute.rc5           -> build-compute.rc5
  * [new tag]             build-compute.rc6           -> build-compute.rc6
  * [new tag]             build-corepack-offline-rc.0 -> build-corepack-offline-rc.0
  * [new tag]             build-current-deployment-rc.0 -> build-current-deployment-rc.0
  * [new...

GitHub Actions: 📝 Agent Instructions Audit / audit: fix(replication): key logical replication leader lock on slot name

Conclusion: failure

View job details

##[group]Run anthropics/claude-code-action@428971d2ecd6e3a7cb0ee0da2a3a8b33fdb3678d
 with:
   anthropic_***REDACTED***
   use_sticky_comment: true
   allowed_bots: devin-ai-integration[bot]
   claude_args: --max-turns 25
--model claude-opus-4-8
--allowedTools "Read,Glob,Grep,Bash(git diff:*)"
   prompt: You are reviewing a PR to check whether any agent instruction files need updating.
In this repo:
- Root shared agent guidance lives in `AGENTS.md`.
- Root `CLAUDE.md` is only a Claude Code adapter that imports `AGENTS.md`.
- Subdirectories may still have scoped `CLAUDE.md` files.
- `.claude/rules/` contains additional Claude Code guidance.
## Your task
1. Run `git diff origin/main...HEAD --name-only` to see which files changed in this PR.
2. For each changed directory, check the applicable instruction files: root `AGENTS.md`, any `CLAUDE.md` in that directory or a parent directory, and relevant `.claude/rules/` files.
3. Determine if any instruction file should be updated based on the changes. Consider:
   - New files/directories that aren't covered by existing documentation
   - Changed architecture or patterns that contradict current agent guidance
   - New dependencies, services, or infrastructure that agents should know about
   - Renamed or moved files that are referenced in an instruction file
   - Changes to build commands, test patterns, or development workflows
## Response format
If NO updates are needed, respond with exactly:
✅ Agent instruction files look current for this PR.
If updates ARE needed, respond with a short list:
📝 **Agent instruction updates suggested:**
- `AGENTS.md`: [what should be added/changed]
- `path/to/CLAUDE.md`: [what should be added/changed]
- `.claude/rules/file.md`: [what should be added/changed]
Keep suggestions specific and brief. Only flag things that would actually mislead agents in future sessions.
Do NOT suggest updates for trivial changes (bug fixes, small refactors within existing patterns).
Do NOT suggest creating new...

GitHub Actions: 🔎 REVIEW.md Drift Audit / 0_audit.txt: fix(replication): key logical replication leader lock on slot name

Conclusion: failure

View job details

uild-metadata-upgrade-logging.rc1
  * [new tag]             build-metadata-upgrade-logging.rc2 -> build-metadata-upgrade-logging.rc2
  * [new tag]             build-metadata-upgrade-logging.rc3 -> build-metadata-upgrade-logging.rc3
  * [new tag]             build-new-build-system.rc.1 -> build-new-build-system.rc.1
  * [new tag]             build-otel-upgrade-rc.0     -> build-otel-upgrade-rc.0
  * [new tag]             build-otel-upgrade-rc.1     -> build-otel-upgrade-rc.1
  * [new tag]             build-pre-pull-deployments-rc.1 -> build-pre-pull-deployments-rc.1
  * [new tag]             build-prod-rescue-rc.1      -> build-prod-rescue-rc.1
  * [new tag]             build-rate-limiter-fix-rc.1 -> build-rate-limiter-fix-rc.1
  * [new tag]             build-re2.rc0               -> build-re2.rc0
  * [new tag]             build-realtime-v2-stream-fix -> build-realtime-v2-stream-fix
  * [new tag]             build-realtime-v2-stream-fix-2 -> build-realtime-v2-stream-fix-2
  * [new tag]             build-realtime-v2-stream-fix-3 -> build-realtime-v2-stream-fix-3
  * [new tag]             build-realtime-v2-stream-fix-4 -> build-realtime-v2-stream-fix-4
  * [new tag]             build-realtime-v2-stream-fix-5 -> build-realtime-v2-stream-fix-5
  * [new tag]             build-realtimestreams-dedupe -> build-realtimestreams-dedupe
  * [new tag]             build-registry-maintenance-rc.1 -> build-registry-maintenance-rc.1
  * [new tag]             build-registry-maintenance-rc.2 -> build-registry-maintenance-rc.2
  * [new tag]             build-remote-ecr-rc.0       -> build-remote-ecr-rc.0
  * [new tag]             build-reschedule-hotfix.rc1 -> build-reschedule-hotfix.rc1
  * [new tag]             build-resume-fixes.rc1      -> build-resume-fixes.rc1
  * [new tag]             build-resume-fixes.rc2      -> build-resume-fixes.rc2
  * [new tag]             build-resume-fixes.rc3      -> build-resume-fixes.rc3
  * [new tag]             build-resume-large-batches.rc1 -> b...

GitHub Actions: 🔎 REVIEW.md Drift Audit / audit: fix(replication): key logical replication leader lock on slot name

Conclusion: failure

View job details

##[group]Run anthropics/claude-code-action@428971d2ecd6e3a7cb0ee0da2a3a8b33fdb3678d
 with:
   anthropic_***REDACTED***
   use_sticky_comment: true
   allowed_bots: devin-ai-integration[bot]
   claude_args: --max-turns 30
--allowedTools "Read,Glob,Grep,Bash(git diff:*)"
   prompt: You are auditing this PR for drift against `.claude/REVIEW.md`.
## Context
`.claude/REVIEW.md` is the repo's source of truth for what AI / agent code reviewers should treat as critical findings (rolling-deploy safety, hot-table indexes, recovery-path queries, testcontainers usage, Lua versioning, etc.). It is consumed by review agents to calibrate severity. If REVIEW.md goes stale, every future agent review degrades.
## Strategy — read this first
You have a hard turn budget. Spend it on signal, not coverage. The audit is allowed to miss things; it is NOT allowed to time out.
1. Read `.claude/REVIEW.md` once, in full.
2. Run `git diff origin/main...HEAD --name-only` to get the list of changed files. Do NOT read the diff content yet.
3. Scan the file-list for relevance to REVIEW.md scope. Relevance signals: changes to Prisma schema, Redis / queue / Lua code, hot tables, recovery / restart loops, new packages, deletions of paths REVIEW.md cites. Skim everything else.
4. Open at most **5 files** total — only the ones most likely to surface a real signal. If nothing in the file-list looks relevant to any REVIEW.md rule, do NOT read any files; go straight to the verdict.
5. Form a verdict and stop. Do not exhaust the turn budget exploring.
Large PRs (>50 files changed) are a strong signal to be MORE selective, not more thorough. Pick 3-5 files at most.
## What to look for
- **Stale references** — does any REVIEW.md rule cite a file, directory, function, table, Prisma model, or package name that has been removed or renamed in this PR (or is already gone from `main`)?
- **Contradictions** — does code in this PR clearly violate a current REVIEW.md rule? (Don't re-review the PR. Only flag if REVIE...
🧰 Additional context used
📓 Path-based instructions (13)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

Files:

  • apps/webapp/test/runsReplicationInstance.test.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/routes/admin.api.v1.runs-replication.status.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
  • internal-packages/replication/src/client.test.ts
  • internal-packages/replication/src/client.ts
{packages/core,apps/webapp}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use zod for validation in packages/core and apps/webapp

Files:

  • apps/webapp/test/runsReplicationInstance.test.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/routes/admin.api.v1.runs-replication.status.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

Files:

  • apps/webapp/test/runsReplicationInstance.test.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/routes/admin.api.v1.runs-replication.status.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
  • internal-packages/replication/src/client.test.ts
  • internal-packages/replication/src/client.ts
**/*.{test,spec}.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use vitest for all tests in the Trigger.dev repository

Files:

  • apps/webapp/test/runsReplicationInstance.test.ts
  • internal-packages/replication/src/client.test.ts
**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

  • apps/webapp/test/runsReplicationInstance.test.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/routes/admin.api.v1.runs-replication.status.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
  • internal-packages/replication/src/client.test.ts
  • internal-packages/replication/src/client.ts
apps/webapp/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

apps/webapp/**/*.{ts,tsx}: Access environment variables through the env export of env.server.ts instead of directly accessing process.env
Use subpath exports from @trigger.dev/core package instead of importing from the root @trigger.dev/core path

Always use findFirst instead of findUnique for Prisma queries.

Files:

  • apps/webapp/test/runsReplicationInstance.test.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/routes/admin.api.v1.runs-replication.status.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
apps/webapp/**/*.test.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

Do not import env.server.ts directly or indirectly into test files; instead pass environment-dependent values through options/parameters to make code testable

Files:

  • apps/webapp/test/runsReplicationInstance.test.ts
**/*.{ts,tsx,js,jsx,mts,cts,mjs,cjs}

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.{ts,tsx,js,jsx,mts,cts,mjs,cjs}: Use pnpm run typecheck for changes in apps (apps/*) and internal packages (internal-packages/*), and never use build to verify those changes.
Use Vitest for tests, and never mock anything; use testcontainers instead.
Prefer static imports over dynamic import(), and only use dynamic imports for unresolved circular dependencies, genuine code-splitting needs, or conditional runtime loading.

Files:

  • apps/webapp/test/runsReplicationInstance.test.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/routes/admin.api.v1.runs-replication.status.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
  • internal-packages/replication/src/client.test.ts
  • internal-packages/replication/src/client.ts
**/*.{test,spec}.{ts,tsx,js,jsx,mts,cts,mjs,cjs}

📄 CodeRabbit inference engine (CLAUDE.md)

Place test files next to the source files they cover (for example, MyService.ts -> MyService.test.ts).

Files:

  • apps/webapp/test/runsReplicationInstance.test.ts
  • internal-packages/replication/src/client.test.ts
**/*.{ts,tsx,js,jsx,mts,cts,mjs,cjs,md,mdx}

📄 CodeRabbit inference engine (CLAUDE.md)

Always import from @trigger.dev/sdk when writing Trigger.dev tasks; never use @trigger.dev/sdk/v3 or deprecated client.defineJob.

Files:

  • apps/webapp/test/runsReplicationInstance.test.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/routes/admin.api.v1.runs-replication.status.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
  • internal-packages/replication/src/client.test.ts
  • internal-packages/replication/src/client.ts
**/*.{test,spec}.{js,jsx,ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

**/*.{test,spec}.{js,jsx,ts,tsx}: Keep test files beside the files under test and write tests with descriptive describe and it blocks.
Avoid mocks or stubs in tests, and use helpers from @internal/testcontainers when Redis or Postgres are needed.

Files:

  • apps/webapp/test/runsReplicationInstance.test.ts
  • internal-packages/replication/src/client.test.ts
apps/webapp/**/*.{test,spec}.{ts,tsx}

📄 CodeRabbit inference engine (apps/webapp/CLAUDE.md)

In test files, never import env.server.ts; pass configuration as options instead.

Files:

  • apps/webapp/test/runsReplicationInstance.test.ts
apps/webapp/app/routes/**/*.ts

📄 CodeRabbit inference engine (apps/webapp/CLAUDE.md)

Use Remix flat-file route naming with dot-separated segments in app/routes/ (for example, api.v1.tasks.$taskId.trigger.ts maps to /api/v1/tasks/:taskId/trigger).

Files:

  • apps/webapp/app/routes/admin.api.v1.runs-replication.status.ts
🧠 Learnings (21)
📚 Learning: 2026-05-14T14:54:39.095Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3545
File: .server-changes/agent-view-sessions.md:10-10
Timestamp: 2026-05-14T14:54:39.095Z
Learning: In the `trigger.dev` repository, do not flag inconsistent dot vs slash notation in route/path strings inside `.server-changes/*.md` files. These markdown files are consumed verbatim into the changelog, so the mixed notation (e.g., `resources.orgs.../runs.$runParam/...`) is intentional and should be preserved as-is.

Applied to files:

  • .server-changes/fix-replication-leader-lock-per-slot.md
📚 Learning: 2026-03-22T13:26:12.060Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3244
File: apps/webapp/app/components/code/TextEditor.tsx:81-86
Timestamp: 2026-03-22T13:26:12.060Z
Learning: In the triggerdotdev/trigger.dev codebase, do not flag `navigator.clipboard.writeText(...)` calls for `missing-await`/`unhandled-promise` issues. These clipboard writes are intentionally invoked without `await` and without `catch` handlers across the project; keep that behavior consistent when reviewing TypeScript/TSX files (e.g., usages like in `apps/webapp/app/components/code/TextEditor.tsx`).

Applied to files:

  • apps/webapp/test/runsReplicationInstance.test.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/routes/admin.api.v1.runs-replication.status.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
  • internal-packages/replication/src/client.test.ts
  • internal-packages/replication/src/client.ts
📚 Learning: 2026-03-22T19:24:14.403Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3187
File: apps/webapp/app/v3/services/alerts/deliverErrorGroupAlert.server.ts:200-204
Timestamp: 2026-03-22T19:24:14.403Z
Learning: In the triggerdotdev/trigger.dev codebase, webhook URLs are not expected to contain embedded credentials/secrets (e.g., fields like `ProjectAlertWebhookProperties` should only hold credential-free webhook endpoints). During code review, if you see logging or inclusion of raw webhook URLs in error messages, do not automatically treat it as a credential-leak/secrets-in-logs issue by default—first verify the URL does not contain embedded credentials (for example, no username/password in the URL, no obvious secret/token query params or fragments). If the URL is credential-free per this project’s conventions, allow the logging.

Applied to files:

  • apps/webapp/test/runsReplicationInstance.test.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/routes/admin.api.v1.runs-replication.status.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
  • internal-packages/replication/src/client.test.ts
  • internal-packages/replication/src/client.ts
📚 Learning: 2026-05-18T08:21:27.694Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3632
File: apps/webapp/sentry.server.ts:4-21
Timestamp: 2026-05-18T08:21:27.694Z
Learning: When handling Prisma error P1001 ("Can't reach database server") in TypeScript, don’t assume a single error shape. Prisma can surface P1001 via two different error classes/fields: `PrismaClientKnownRequestError` exposes it as `err.code === "P1001"` (common during mid-query connection drops), while `PrismaClientInitializationError` exposes it as `err.errorCode === "P1001"` (common on client startup failure). Therefore, predicates should use `err.code === "P1001" || err.errorCode === "P1001"`. Do not flag `err.code === "P1001"` as “unreachable/never matches,” as it is expected in production.

Applied to files:

  • apps/webapp/test/runsReplicationInstance.test.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/routes/admin.api.v1.runs-replication.status.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
  • internal-packages/replication/src/client.test.ts
  • internal-packages/replication/src/client.ts
📚 Learning: 2026-05-18T08:21:27.694Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3632
File: apps/webapp/sentry.server.ts:4-21
Timestamp: 2026-05-18T08:21:27.694Z
Learning: When handling Prisma errors for P1001 ("Can't reach database server"), do not assume it only appears under a single property name. Prisma may surface P1001 via either `PrismaClientKnownRequestError` (`err.code === "P1001"`, e.g., mid-query connection drops) or `PrismaClientInitializationError` (`err.errorCode === "P1001"`, e.g., client startup connection failure). To reliably detect the condition, check `err.code === "P1001" || err.errorCode === "P1001"`, and avoid review rules that would incorrectly flag `err.code === "P1001"` as unreachable/never-matching.

Applied to files:

  • apps/webapp/test/runsReplicationInstance.test.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/routes/admin.api.v1.runs-replication.status.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
  • internal-packages/replication/src/client.test.ts
  • internal-packages/replication/src/client.ts
📚 Learning: 2026-06-13T19:53:13.759Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3937
File: packages/trigger-sdk/skills/realtime-and-frontend/SKILL.md:258-260
Timestamp: 2026-06-13T19:53:13.759Z
Learning: When reviewing code that uses `trigger.dev/react-hooks`’s `useRealtimeRun`, preserve the call signature where the first argument is the full realtime handle object (not `handle.id`). This is intentional to maintain type-safety and is consistent with the official docs; do not suggest changing the first argument from the handle object to `handle.id`.

Applied to files:

  • apps/webapp/test/runsReplicationInstance.test.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/routes/admin.api.v1.runs-replication.status.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
  • internal-packages/replication/src/client.test.ts
  • internal-packages/replication/src/client.ts
📚 Learning: 2026-06-17T17:13:49.929Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3948
File: apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.bulk-actions.$bulkActionParam/route.tsx:48-62
Timestamp: 2026-06-17T17:13:49.929Z
Learning: In triggerdotdev/trigger.dev, within `dashboardLoader`/`dashboardAction` (or similar context resolver code) whenever you resolve an organization ID from an organization slug for RBAC/enterprise authorization scope, always read from the primary Prisma client (`prisma`), not `$replica`. Using `$replica` can hit replica-lag and cause the RBAC lookup/authorization to run without the correct org scope (bypassing intended role enforcement). Implement the slug→org lookup with `prisma.organization.findFirst(...)` (or equivalent primary-client query) and add an inline comment documenting why the primary client is required (replica lag could lead to unscoped RBAC checks).

Applied to files:

  • apps/webapp/test/runsReplicationInstance.test.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/routes/admin.api.v1.runs-replication.status.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
  • internal-packages/replication/src/client.test.ts
  • internal-packages/replication/src/client.ts
📚 Learning: 2026-06-23T13:04:21.413Z
Learnt from: carderne
Repo: triggerdotdev/trigger.dev PR: 4023
File: apps/webapp/app/services/upsertBranch.server.ts:14-18
Timestamp: 2026-06-23T13:04:21.413Z
Learning: In TypeScript, it’s valid to `import { type X }` and then use `typeof X` in a type-only position, e.g. `type Alias = z.infer<typeof X>`. The `type` modifier suppresses the runtime import, but the type checker still has the full exported type so `z.infer<typeof X>` can resolve correctly. In code reviews, don’t flag this as a TypeScript compile error as long as `typeof X` is used in a type context (e.g., with `z.infer`, `type` aliases, generics), not as a runtime value.

Applied to files:

  • apps/webapp/test/runsReplicationInstance.test.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/routes/admin.api.v1.runs-replication.status.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
  • internal-packages/replication/src/client.test.ts
  • internal-packages/replication/src/client.ts
📚 Learning: 2026-05-07T12:25:18.271Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3531
File: apps/webapp/test/sentryTraceContext.server.test.ts:9-47
Timestamp: 2026-05-07T12:25:18.271Z
Learning: In the triggerdotdev/trigger.dev webapp test suite, it is acceptable to leave `createInMemoryTracing()` calls that register a global `NodeTracerProvider` without `afterEach`/`afterAll` teardown. Do not flag this as a test-ordering risk when the code follows the established pattern used across webapp tests (e.g., replication service/benchmark/backfiller tests). This is considered safe because `trace.getActiveSpan()` when called outside a `context.with(...)` block reads `AsyncLocalStorage.getStore()` (undefined when no `run()` scope exists), so it falls back to `ROOT_CONTEXT` with no attached span—regardless of which provider is registered.

Applied to files:

  • apps/webapp/test/runsReplicationInstance.test.ts
📚 Learning: 2026-05-28T20:02:10.647Z
Learnt from: myftija
Repo: triggerdotdev/trigger.dev PR: 3772
File: apps/webapp/test/findOrCreateBackgroundWorker.test.ts:1-1
Timestamp: 2026-05-28T20:02:10.647Z
Learning: In the triggerdotdev/trigger.dev monorepo, for the `apps/webapp` package use the established convention of storing Vitest tests (unit, integration, and e2e) under `apps/webapp/test/` rather than colocating them next to source files. Do not flag files located in `apps/webapp/test/` as violating any rule that says to colocate tests with source.

Applied to files:

  • apps/webapp/test/runsReplicationInstance.test.ts
📚 Learning: 2026-05-12T21:04:05.815Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3542
File: apps/webapp/app/components/sessions/v1/SessionStatus.tsx:1-3
Timestamp: 2026-05-12T21:04:05.815Z
Learning: In this Remix + TypeScript codebase, do not flag a server/client boundary violation when a file imports only types from a module matching `*.server`.

Specifically, it’s safe to import types using `import type { Foo } from "*.server"` or `import { type Foo } from "*.server"` because TypeScript erases type-only imports at compile time and they emit no JavaScript, so they won’t cross the Remix server/client bundle boundary.

Only raise the boundary concern for value imports (e.g., `import { Foo }` without `type`, or `import Foo`), since those produce JavaScript output.

Applied to files:

  • apps/webapp/test/runsReplicationInstance.test.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/routes/admin.api.v1.runs-replication.status.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
📚 Learning: 2026-06-25T18:21:51.905Z
Learnt from: carderne
Repo: triggerdotdev/trigger.dev PR: 4039
File: apps/webapp/app/routes/invite-revoke.tsx:0-0
Timestamp: 2026-06-25T18:21:51.905Z
Learning: During the Zod v4 migration in the triggerdotdev/trigger.dev webapp, ensure any imports from `conform-to/zod` use the Zod-4 subpath: `conform-to/zod/v4` (e.g., `import { parseWithZod } from "conform-to/zod/v4"`). Do not import from the package root `conform-to/zod`, because it is the Zod 3 implementation and may load Zod-3-only symbols (e.g., `ZodBranded`, `ZodEffects`), which can throw at module load (notably with `zod4.4.3`). This should be enforced across `apps/webapp/**/*` where helpers like `parseWithZod` and `conformZodMessage` are used.

Applied to files:

  • apps/webapp/test/runsReplicationInstance.test.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/routes/admin.api.v1.runs-replication.status.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
📚 Learning: 2026-07-03T17:10:21.498Z
Learnt from: 0ski
Repo: triggerdotdev/trigger.dev PR: 4148
File: apps/webapp/app/models/orgMember.server.ts:149-168
Timestamp: 2026-07-03T17:10:21.498Z
Learning: In triggerdotdev/trigger.dev, `User.email` (Prisma schema: `internal-packages/database/prisma/schema.prisma`) currently does NOT use `citext` and does NOT have a `lower(email)` functional unique index. Therefore, do not introduce Prisma queries like `where: { email: { equals: <value>, mode: "insensitive" } }` (or any case-insensitive lookup) against `User.email`, because it can force sequential scans of the `users` table under load. During review, ensure email is normalized (e.g., lowercased/trimmed) before both writes and subsequent lookups, and if true case-insensitive behavior/uniqueness is required, implement it via a separate app-wide migration (e.g., switch to `citext` and/or add a functional unique index with backfill) rather than bolting it onto individual feature PRs.

Applied to files:

  • apps/webapp/test/runsReplicationInstance.test.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/routes/admin.api.v1.runs-replication.status.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
📚 Learning: 2026-05-18T14:40:02.173Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3658
File: packages/core/src/v3/realtimeStreams/manager.test.ts:1-147
Timestamp: 2026-05-18T14:40:02.173Z
Learning: In the triggerdotdev/trigger.dev repo, the policy “Never mock anything — use testcontainers instead” should only be enforced for integration tests that interact with real external services (e.g., Redis, Postgres) via actual infrastructure. For unit tests that exercise pure in-memory logic (e.g., cache semantics) it is OK to stub collaborators such as `ApiClient` using Vitest (`vi.fn()`) to assert call counts or control behavior. Do not flag `vi.fn()`-based `ApiClient` stubs in unit tests as violations of the testcontainers policy.

Applied to files:

  • apps/webapp/test/runsReplicationInstance.test.ts
  • internal-packages/replication/src/client.test.ts
📚 Learning: 2026-06-04T18:16:35.386Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3836
File: apps/supervisor/src/backpressure/backpressureMonitor.ts:3-5
Timestamp: 2026-06-04T18:16:35.386Z
Learning: When reviewing TypeScript in this repo, apply the rule “prefer type aliases over interfaces” only to data/object shapes and union/intersection type modeling. If an interface is being used as a behavioral contract for collaborators to implement (e.g., method-shape interfaces that define required behavior, such as `BackpressureLogger` / `BackpressureSignalSource` in `apps/supervisor/src/backpressure/backpressureMonitor.ts`), keep it as an `interface` and do not flag it as a type-alias-vs-interface violation.

Applied to files:

  • apps/webapp/test/runsReplicationInstance.test.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/routes/admin.api.v1.runs-replication.status.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
  • internal-packages/replication/src/client.test.ts
  • internal-packages/replication/src/client.ts
📚 Learning: 2026-06-09T17:58:04.699Z
Learnt from: 0ski
Repo: triggerdotdev/trigger.dev PR: 3879
File: apps/webapp/app/models/vercelIntegration.server.ts:619-630
Timestamp: 2026-06-09T17:58:04.699Z
Learning: In this codebase, outbound raw `fetch` calls should typically rely on Node/undici’s default request timeout (about ~300s) rather than adding a per-call `AbortController` + `setTimeout` wrapper inside individual functions (e.g. in files like `apps/webapp/app/models/vercelIntegration.server.ts`). During code review, do not flag the absence of a per-call timeout on a single `fetch` as an issue; if per-call timeouts are needed, they should be implemented via a codebase-wide convention (e.g., a shared fetch wrapper or documented pattern) rather than ad-hoc per-function changes.

Applied to files:

  • apps/webapp/test/runsReplicationInstance.test.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/routes/admin.api.v1.runs-replication.status.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
  • internal-packages/replication/src/client.test.ts
  • internal-packages/replication/src/client.ts
📚 Learning: 2026-06-16T09:19:47.637Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3960
File: apps/webapp/test/prismaInfrastructureErrorCapture.test.ts:0-0
Timestamp: 2026-06-16T09:19:47.637Z
Learning: In this repo’s Vitest setup, `vitest.config.ts` uses `globals: true`, so identifiers like `vi`, `describe`, `it`, and `expect` are available as globals in Vitest test files. During code review, do not flag missing `vi`/`describe`/`it`/`expect` imports as a runtime error or correctness issue when they’re used in `*.test.ts/tsx` or `*.spec.ts/tsx` files. Explicit imports are still preferred for consistency, but they’re not required for runtime behavior.

Applied to files:

  • apps/webapp/test/runsReplicationInstance.test.ts
  • internal-packages/replication/src/client.test.ts
📚 Learning: 2026-03-26T09:02:07.973Z
Learnt from: myftija
Repo: triggerdotdev/trigger.dev PR: 3274
File: apps/webapp/app/services/runsReplicationService.server.ts:922-924
Timestamp: 2026-03-26T09:02:07.973Z
Learning: When parsing Trigger.dev task run annotations in server-side services, keep `TaskRun.annotations` strictly conforming to the `RunAnnotations` schema from `trigger.dev/core/v3`. If the code already uses `RunAnnotations.safeParse` (e.g., in a `#parseAnnotations` helper), treat that as intentional/necessary for atomic, schema-accurate annotation handling. Do not recommend relaxing the annotation payload schema or using a permissive “passthrough” parse path, since the annotations are expected to be written atomically in one operation and should not contain partial/legacy payloads that would require a looser parser.

Applied to files:

  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
📚 Learning: 2026-04-20T14:50:16.440Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3417
File: apps/webapp/app/services/sessionsReplicationService.server.ts:224-231
Timestamp: 2026-04-20T14:50:16.440Z
Learning: In Trigger.dev’s replication services (e.g., sessionsReplicationService.server.ts and runsReplicationService.server.ts), the “acknowledge-before-flush” behavior is intentional. The `_latestCommitEndLsn` should be updated at Postgres commit time and acknowledged on a periodic interval (via methods like `#acknowledgeLatestTransaction`) without waiting for ClickHouse batch flush to complete. Reviewers should not flag this as a durability/ordering bug; it is an established project-wide at-least-once delivery trade-off used across both runs and sessions replication services.

Applied to files:

  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
📚 Learning: 2026-04-20T15:08:49.959Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3417
File: apps/webapp/app/services/sessionsReplicationService.server.ts:204-215
Timestamp: 2026-04-20T15:08:49.959Z
Learning: For replication services in `apps/webapp/app/services/*ReplicationService.server.ts`, keep the `ConcurrentFlushScheduler` deduplication key shape consistent across the related services (e.g., sessions vs runs) by using the same `${item.event}_${item.session.id}` / `${item.event}_${item.run.id}` pattern. If the key format ever needs to change (such as keying only by session/run id), make the update in all related replication services together—never in just one—so deduplication behavior stays aligned across services.

Applied to files:

  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
📚 Learning: 2026-05-05T09:38:02.512Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3523
File: apps/webapp/app/routes/api.v3.batches.ts:178-181
Timestamp: 2026-05-05T09:38:02.512Z
Learning: When reviewing code that catches `ServiceValidationError` in `*.server.ts` files, do not blindly forward `error.status` to HTTP responses, because SVEs may be thrown with non-default statuses (e.g., 400/500) and forwarding them can cause client-visible behavioral regressions (e.g., surfacing 500s to clients). Prefer a safe default response status of `error.status ?? 422`, but only after confirming via the reachable call graph that the caught `ServiceValidationError` instances are expected to carry those non-default statuses; otherwise, normalize to `422` to avoid unexpected client-visible 5xx behavior.

Applied to files:

  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
🔇 Additional comments (8)
apps/webapp/app/services/sessionsReplicationService.server.ts (2)

434-437: 🩺 Stability & Availability | ⚡ Quick win

Fire-and-forget shutdown() still lacks a .catch().

This promise isn't awaited, so a rejection surfaces as an unhandled rejection. The sibling runsReplicationService.#handleTransaction swallows these in the equivalent path; mirror that here.

🛡️ Proposed fix
-      this._replicationClient.shutdown().finally(() => {
-        this._isSubscribed = false;
-        this._isShutDownComplete = true;
-      });
+      this._replicationClient
+        .shutdown()
+        .catch((error) => {
+          this.logger.error("Error stopping replication client during shutdown", { error });
+        })
+        .finally(() => {
+          this._isSubscribed = false;
+          this._isShutDownComplete = true;
+        });

190-190: LGTM!

Also applies to: 269-269, 298-298

internal-packages/replication/src/client.ts (1)

22-23: LGTM!

Also applies to: 69-79, 123-129, 152-154, 179-183, 246-325, 359-577, 860-889

.server-changes/fix-replication-leader-lock-per-slot.md (1)

1-6: LGTM!

apps/webapp/app/routes/admin.api.v1.runs-replication.status.ts (1)

10-18: LGTM!

Also applies to: 32-34, 52-52

apps/webapp/app/services/runsReplicationService.server.ts (1)

289-289: LGTM!

Also applies to: 432-434, 464-464, 642-642

apps/webapp/test/runsReplicationInstance.test.ts (1)

411-416: LGTM!

internal-packages/replication/src/client.test.ts (1)

185-237: LGTM!

Also applies to: 239-295, 297-351, 353-407, 409-467

d-cs added 2 commits July 4, 2026 15:36
…nsaction

The fire-and-forget shutdown() in the shutdown branch used .finally() with no
.catch(), so a rejection (e.g. leader-lock release throwing) would surface as an
unhandled rejection. Mirror the runs service, which swallows and logs it.
…ally)

teardown() acquired the (slot-keyed) leader lock and opened a pg client, then
dropped the slot and released both. A throw from connect()/#dropSlot() bypassed
the release, stranding the lock (blocking the slot's next leader until TTL) and
leaking the pg backend. Wrap the acquire->release body in try/finally so both
are always released.
@d-cs

d-cs commented Jul 4, 2026

Copy link
Copy Markdown
Collaborator Author

@coderabbitai good catch on teardown() — fixed in 4ab8c7d by wrapping the acquire→release body in try/finally, so the pg client and the (now slot-keyed) leader lock are released on both success and a mid-teardown throw.

@d-cs d-cs enabled auto-merge (squash) July 4, 2026 14:43
@coderabbitai

coderabbitai Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
git log --oneline -5 2>/dev/null
git show 4ab8c7d6a --stat 2>/dev/null | head -50
ast-grep outline internal-packages/replication/src/client.ts --view expanded 2>/dev/null | head -80

Length of output: 4673


🏁 Script executed:

#!/bin/bash
sed -n '327,362p' internal-packages/replication/src/client.ts

Length of output: 1259


Confirmed — commit 4ab8c7d wraps the client.connect()#dropSlot() body in a try/finally inside teardown() (Lines 341–358), so both this.client.end() and #releaseLeaderLock() now run on success or if #dropSlot()/connect() throws. That closes the leak/stranded-lock path I flagged. Thanks for the fix, @d-cs!

[resolve]

@d-cs d-cs merged commit c9f427e into main Jul 4, 2026
42 checks passed
@d-cs d-cs deleted the fix/runs-replication-leader-lock-per-slot branch July 4, 2026 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Testing creating a new issue

2 participants