feat(storage): storage-agnostic BlobStore layer + push-latency quick wins#79
feat(storage): storage-agnostic BlobStore layer + push-latency quick wins#79kevincodex1 wants to merge 5 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThe PR replaces the hard-coded Tigris storage client with a ChangesPluggable Storage Backend and Etag-Driven Caching
Sequence Diagram(s)sequenceDiagram
participant Client as Git Client
participant Handler as git_receive_pack
participant Store as RepoStore
participant Archive as RepoArchive
participant Blob as BlobStore (S3/FS/IPFS)
Client->>Handler: HTTP push
Handler->>Store: acquire_write(key)
Store->>Store: pg_try_advisory_lock (with retry)
Store->>Store: sync_down_if_stale(require_fresh=true)
Store->>Archive: head_etag(owner, repo)
Archive->>Blob: head(key)
Blob-->>Archive: ObjectMeta {size, etag}
alt etag matches cached versions
Archive-->>Store: etag found, skip download
else stale or missing
Archive->>Blob: get(key)
Blob-->>Archive: tar.zst bytes
Archive->>Archive: decompress_repo to local_path
Archive-->>Store: versions.insert(key, etag)
end
Store-->>Handler: RepoWriteGuard {conn, archive, versions}
Handler->>Handler: smart_http::receive_pack
Handler-->>Client: HTTP response (push acked)
alt push_ok AND async_upload
Handler->>Store: tokio::spawn guard.release(true)
Store->>Archive: upload(owner, repo, local_path)
Archive->>Blob: put(key, tar.zst)
Blob-->>Archive: ObjectMeta {etag}
Store->>Store: versions.insert + pg_advisory_unlock
else push failed OR sync upload
Handler->>Store: guard.release(push_ok).await
Store->>Archive: upload → store.put
Store->>Store: pg_advisory_unlock
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
Thanks for the contribution. A couple of things will help us review this faster:
See CONTRIBUTING.md. Update the PR and these notes will clear automatically. |
…wins
Replace the hardcoded Tigris client with a backend-agnostic object store.
BlobStore trait (get/put/head/delete/list) with backends:
- s3: any S3-compatible service (Tigris, Cloudflare R2, AWS S3, MinIO, B2)
- fs: local filesystem (atomic writes; unit-tested)
- ipfs: Kubo node via MFS (etag = content CID)
RepoArchive composes a bare repo into one repos/v1/{slug}/{repo}.tar.zst
object on top of any backend (tar+zstd moved here; git/tigris.rs removed).
RepoStore rewired from Option<TigrisClient> to Option<RepoArchive>.
Config (backward compatible): GITLAWB_STORAGE_BACKEND, GITLAWB_S3_BUCKET,
GITLAWB_S3_ENDPOINT, GITLAWB_S3_FORCE_PATH_STYLE, GITLAWB_STORAGE_FS_DIR,
GITLAWB_ASYNC_UPLOAD. GITLAWB_TIGRIS_BUCKET kept as a legacy alias.
Push-latency quick wins:
- Skip the redundant pre-write download when the cached etag already matches
storage (previously up to two full-repo downloads per push).
- Write-back ack: ack the client before the durable upload completes; the
advisory lock is held until upload finishes so cross-node consistency holds.
cargo check clean; storage::fs + repo_store unit tests pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
fa8a9f3 to
7b7e690
Compare
There was a problem hiding this comment.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/gitlawb-node/src/storage/archive.rs (1)
162-184: 🩺 Stability & Availability | 🟡 Minor | ⚡ Quick winClean up the temp dir when the swap phase fails.
The unpack-error path removes
tmp_dir, but the swap phase does not: ifremove_dir_all(local_path)orrenamereturnsErr, the freshly-extracted.{file_name}.tmp-extract.<uuid>dir is orphaned. Because each extraction uses a fresh UUID and nothing reclaims leftovers, repeated failures (e.g. permission errors, open handles) leak temp dirs indefinitely.Secondary concern:
remove_dir_all(local_path)beforerenamedestroys the only valid local copy first, so a subsequentrenamefailure leaves the repo missing — underminingsync_down_if_stale'slocal_path.exists()self-heal fallback.🧹 Proposed fix: scope cleanup over the whole swap
- let lock = publish_lock(local_path); - let _publish = lock.lock().expect("publish lock poisoned"); - if local_path.exists() { - std::fs::remove_dir_all(local_path).context("removing stale repo dir")?; - } - std::fs::rename(&tmp_dir, local_path).context("swapping extracted repo into place")?; - Ok(()) + let swap = (|| -> Result<()> { + let lock = publish_lock(local_path); + let _publish = lock.lock().expect("publish lock poisoned"); + if local_path.exists() { + std::fs::remove_dir_all(local_path).context("removing stale repo dir")?; + } + std::fs::rename(&tmp_dir, local_path).context("swapping extracted repo into place")?; + Ok(()) + })(); + if let Err(e) = swap { + let _ = std::fs::remove_dir_all(&tmp_dir); + return Err(e); + } + Ok(())🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/gitlawb-node/src/storage/archive.rs` around lines 162 - 184, The swap phase in the archive extraction does not clean up tmp_dir if either the remove_dir_all(local_path) call or the rename call fails, leaving orphaned temporary directories. Additionally, removing local_path before confirming the rename succeeds risks losing the only valid copy if rename fails. Ensure that tmp_dir is cleaned up in all failure paths during the swap phase by wrapping the entire lock-acquisition and swap logic so that if any error occurs after acquiring the publish lock, the tmp_dir is removed before returning the error. Consider the order of operations to minimize data loss risk by only removing the old local_path after confirming the rename will succeed.
🧹 Nitpick comments (2)
crates/gitlawb-node/src/storage/ipfs.rs (1)
72-73: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick winAvoid
body.to_vec()to prevent doubling upload memory.Line [72] copies the full blob before upload. For large repo archives this spikes memory usage and allocator pressure; use a streaming multipart part instead.
Suggested fix
- let part = reqwest::multipart::Part::bytes(body.to_vec()).file_name("blob"); + let part = reqwest::multipart::Part::stream(reqwest::Body::from(body)).file_name("blob");🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/gitlawb-node/src/storage/ipfs.rs` around lines 72 - 73, The multipart Part creation uses body.to_vec() which copies the entire blob into memory before uploading, doubling memory usage for large files. Replace the reqwest::multipart::Part::bytes() call with a streaming approach such as reqwest::multipart::Part::stream() to avoid the unnecessary copy and pass the body content directly as a stream instead of converting it to a vector first.crates/gitlawb-node/src/storage/fs.rs (1)
114-117: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winDon’t silently skip directory read failures in
list().Lines [114]-[117] convert I/O failures into partial results. Returning
Ok(keys)after hidden errors can mislead future admin/GC/migration callers into acting on incomplete inventories.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/gitlawb-node/src/storage/fs.rs` around lines 114 - 117, The list() function silently ignores directory read failures by using an Err(_) => continue pattern in the std::fs::read_dir(&dir) match statement, which causes it to return Ok(keys) with incomplete results. Replace this silent failure handling with proper error propagation or handling that ensures callers are informed of failures rather than acting on incomplete inventory data. Instead of continuing on read_dir errors, return an Err result or log and return the appropriate error indication so that admin, GC, and migration processes receive accurate information about the success or failure of listing all keys.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/gitlawb-node/src/api/repos.rs`:
- Around line 555-561: The `tokio::spawn(guard.release(true))` call has two
issues: the advisory lock release at the end of the `release()` method silently
discards errors with `let _ =`, making lock-release failures invisible to
operators, and the spawned task is not integrated with graceful shutdown,
potentially leaving locks acquired indefinitely. Fix by adding proper error
logging (via warn! or error!) for the advisory lock release query failure
instead of using `let _ =` to discard the error, and integrate the spawned task
with axum's graceful shutdown mechanism using a task tracker or shutdown signal
to ensure the task completes or is cancelled gracefully during shutdown.
In `@crates/gitlawb-node/src/git/repo_store.rs`:
- Around line 178-192: The cache-miss path currently uses unwrap_or(None) on the
head_etag call, which swallows any transient storage errors and returns None,
causing the function to fall through and return a non-existent local_path that
makes the caller see a misleading "repo not found" error. Replace the
unwrap_or(None) on the head_etag(...).await call with the ? operator to
propagate storage errors instead of silently converting them to None, so real
errors surface to the caller rather than being hidden as "not found" responses.
- Around line 79-116: The sync_down_if_stale function uses permissive self-heal
fallbacks for both read and write paths, but this causes a lost-update issue for
writes: when the remote etag differs from the cached etag (remote is newer) and
download fails, the function falls back to the stale local copy, which then gets
uploaded and overwrites the newer remote archive. Add a freshness-requirement
parameter to sync_down_if_stale that gates the self-heal behavior: when
freshness is required (write path), propagate errors from the HEAD check (around
lines 82-89) and download operation (around lines 105-115) instead of falling
back to the local copy; when freshness is not required (read path), keep the
current permissive fallback. Update calls from acquire_fresh to pass false
(freshness not required) and acquire_write to pass true (freshness required).
In `@crates/gitlawb-node/src/storage/fs.rs`:
- Around line 75-82: The fixed temporary file extension `tmp-put` created by
`path.with_extension("tmp-put")` on line 75 causes a race condition when
concurrent `put` operations occur on the same key, as both writes will use the
same temp file path and corrupt each other's data. Generate a unique temporary
filename for each write operation instead of using a fixed extension, such as by
appending a UUID, random suffix, or timestamp to the temporary filename to
ensure each concurrent write has its own isolated temp file before the atomic
rename into place.
In `@crates/gitlawb-node/src/storage/ipfs.rs`:
- Around line 166-168: The list() function currently skips non-success responses
from the files/ls endpoint by using continue, which silently treats
authentication, network, and server errors as empty successful listings. Instead
of continuing when !resp.status().is_success() is true, you should properly
handle the error by returning a Result error that propagates the failure to the
caller, ensuring that list() accurately reflects whether the operation succeeded
or failed rather than returning partial or empty results on error conditions.
- Around line 25-29: The reqwest::Client::new() call in the new method does not
configure any timeout values, which allows requests to the IPFS API to hang
indefinitely if the service becomes slow or unreachable. Replace the default
Client::new() call with a Client::builder() pattern that explicitly sets timeout
values using methods like timeout() to define a reasonable connection and
request timeout duration, then call build() to create the configured client.
This ensures that storage operations have bounded execution times and prevents
indefinite hangs.
In `@crates/gitlawb-node/src/storage/mod.rs`:
- Around line 67-70: The selection order documentation comment in the storage
module starting with "Selection order:" is incomplete and does not reflect the
actual implementation. Update the comment block to include the auto-detection of
the ipfs backend when the GITLAWB_IPFS_API environment variable is set, which
occurs in the actual implementation around lines 84-86. Add ipfs as an
additional auto-selection option to the documented order after the fs fallback
so the documentation accurately describes all auto-detection paths in the
storage backend selection logic.
- Around line 91-145: The function currently downgrades to local-only mode
(returning None) when explicitly configured backends fail to initialize, which
can silently accept writes without the intended durable backend and cause data
persistence issues. In the match statement, change the error handling for the
s3, fs, and ipfs branches so that when backend creation fails (in the Err
branches of S3BlobStore::new, FsBlobStore::new, and IpfsBlobStore::new), return
the error to propagate it up the call stack rather than logging a warning and
returning None. This ensures that if a backend is explicitly configured but
cannot be initialized, the application will fail at startup rather than silently
degrading to local-only mode.
---
Outside diff comments:
In `@crates/gitlawb-node/src/storage/archive.rs`:
- Around line 162-184: The swap phase in the archive extraction does not clean
up tmp_dir if either the remove_dir_all(local_path) call or the rename call
fails, leaving orphaned temporary directories. Additionally, removing local_path
before confirming the rename succeeds risks losing the only valid copy if rename
fails. Ensure that tmp_dir is cleaned up in all failure paths during the swap
phase by wrapping the entire lock-acquisition and swap logic so that if any
error occurs after acquiring the publish lock, the tmp_dir is removed before
returning the error. Consider the order of operations to minimize data loss risk
by only removing the old local_path after confirming the rename will succeed.
---
Nitpick comments:
In `@crates/gitlawb-node/src/storage/fs.rs`:
- Around line 114-117: The list() function silently ignores directory read
failures by using an Err(_) => continue pattern in the std::fs::read_dir(&dir)
match statement, which causes it to return Ok(keys) with incomplete results.
Replace this silent failure handling with proper error propagation or handling
that ensures callers are informed of failures rather than acting on incomplete
inventory data. Instead of continuing on read_dir errors, return an Err result
or log and return the appropriate error indication so that admin, GC, and
migration processes receive accurate information about the success or failure of
listing all keys.
In `@crates/gitlawb-node/src/storage/ipfs.rs`:
- Around line 72-73: The multipart Part creation uses body.to_vec() which copies
the entire blob into memory before uploading, doubling memory usage for large
files. Replace the reqwest::multipart::Part::bytes() call with a streaming
approach such as reqwest::multipart::Part::stream() to avoid the unnecessary
copy and pass the body content directly as a stream instead of converting it to
a vector first.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3b25ebb6-e61a-4ce6-a7ae-1636bdde8f45
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (11)
crates/gitlawb-node/Cargo.tomlcrates/gitlawb-node/src/api/repos.rscrates/gitlawb-node/src/config.rscrates/gitlawb-node/src/git/mod.rscrates/gitlawb-node/src/git/repo_store.rscrates/gitlawb-node/src/main.rscrates/gitlawb-node/src/storage/archive.rscrates/gitlawb-node/src/storage/fs.rscrates/gitlawb-node/src/storage/ipfs.rscrates/gitlawb-node/src/storage/mod.rscrates/gitlawb-node/src/storage/s3.rs
💤 Files with no reviewable changes (1)
- crates/gitlawb-node/src/git/mod.rs
Major: - repo_store: add require_fresh to sync_down_if_stale. Write path (acquire_write) fails closed instead of falling back to a stale local copy that a later upload would use to clobber a newer remote archive (lost update); read path (acquire_fresh) keeps self-heal. - fs: unique per-write temp name (uuid) + cleanup on failure, so concurrent puts to the same key can't corrupt each other. - storage::build: return Result and fail closed — a configured-but-unavailable or misconfigured backend now aborts boot instead of silently downgrading to local-only (durability drift). main.rs propagates the error. - ipfs: give the HTTP client connect/total timeouts so an unresponsive Kubo API can't hang push/write flows. Minor/trivial: - repo_store::acquire: propagate storage HEAD errors on cache miss instead of collapsing them into a misleading 404. - repo_store: log advisory-lock release failures instead of discarding them. - archive: clean up the extracted temp dir if the swap step fails. - fs::list / ipfs::list: propagate read/HTTP errors instead of silently returning a partial listing as success. - ipfs::put: stream the body (Part::stream_with_length) instead of to_vec, halving peak upload memory. - storage::build doc: document IPFS auto-detection. Adds a concurrent-put FS test. Deferred (follow-up): integrating the write-back release task with graceful-shutdown drain. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
please have a look into this @beardthelion |
beardthelion
left a comment
There was a problem hiding this comment.
Reviewed the merged state against current main (clean merge), built it, ran the storage and repo_store unit tests, and traced the write and consistency paths. The BlobStore abstraction and the Tigris-to-archive migration are clean and complete, and the fs backend's atomic swap and per-method key validation hold up. My concerns are all in the push-latency changes: the write-back default has a data-loss path, and it leans on an advisory-lock pattern that isn't sound over a connection pool. Highest first.
Findings
-
[P1] Default
async_uploadto false; ack-before-upload can lose an acked push on a rolling deploy.crates/gitlawb-node/src/config.rs(GITLAWB_ASYNC_UPLOADisdefault_value_t = true),api/repos.rs:561. With it on,git_receive_packacks the client and uploads in a detached task. If the process stops in that window (a SIGTERM during a rolling deploy is the routine trigger, not just a crash), storage keeps the old archive. On restart the in-processversionscache is empty, so the write-pathsync_down_if_stale(require_fresh=true) finds no cached etag, downloads the stale remote, and overwrites the newer local copy atrepo_store.rs:102-113. Lazy migration does not save it:acquireonly uploads when storage has no object (:154-175), so an already-present stale archive is never re-synced. The "local copy survives; lazy migration re-syncs" comment does not hold. Default this to false, or add a durable local-newer marker plus a visible signal on upload failure and a shutdown drain before making it the default. -
[P2, pre-existing] The advisory lock is acquired and released on different pooled connections.
repo_store.rs:233locks viafetch_one(&self.pool);:260/:494unlock viaexecute(&self.pool). Session-scoped Postgres advisory locks only release on the connection that took them, and sqlx hands out an arbitrary pooled connection per query, so the unlock can land on a different session and silently no-op (the boolean result is not checked) while the lock persists. A later same-session acquire is re-entrant and succeeds spuriously; a different session blocks 60s then 500s. This is already on main, so the PR does not introduce it, but the write-back change leans on the lock being held through the upload and now runs the unlock from a detached task that is dropped on crash, so it depends on a guarantee that is not actually there. Worth fixing as part of leaning on it: hold one connection (pool.acquire()) for the lock's lifetime and run lock+unlock on it, or use a transaction-scopedpg_advisory_xact_lock. -
[P2] The S3 client sets no request timeout.
storage/s3.rs:31.S3BlobStore::newbuilds the client with no timeout config, and the AWS Rust SDK sets none by default, so a hung S3 endpoint blocks the calling task unbounded. Underasync_uploadthat task holds the advisory lock, so one stuck upload blocks every later push to the repo. The IPFS client already sets connect and total timeouts; S3 should match, viaaws_sdk_s3::config::timeout::TimeoutConfigon the client builder or atokio::time::timeoutwrapper. -
[P2] "Fail closed when the backend is unreachable" only covers missing config, not connectivity.
storage/mod.rs:69. The doc says a configured-but-unavailable backend returns Err, butS3BlobStore::newonly loads credentials and builds the client with no network call, so an unreachable bucket boots clean and fails on the first live request. Scope the comment to config validation, or add a cheap probe against the bucket at boot if real fail-closed is intended. -
[P2] The risky new paths have no tests.
compress_repo/decompress_repoare pure filesystem functions but there is no round-trip test and nothing covering the "corrupt archive leaves the original intact" atomicity claim, which is the whole download-durability story.sync_down_if_stale'srequire_fresh=true(the lost-update guard) and the etag-skip are also untested (for_testingforcesarchive=None). Both are cheap to add with the fs backend and a test constructor that accepts an archive.
P3s, briefly: IPFS list() doesn't reject traversal in its prefix the way the other four methods do (ipfs.rs:159), latent since it has no caller (a traversal-segment check that still allows the empty list-all prefix, or just drop list()); absent-vs-error detection keys on Kubo error-string substrings (ipfs.rs:68), and since Kubo's error Code is always 0 there's nothing structured to switch on, so a pinned comment is the realistic fix; IPFS put() returns Err after a successful write when the follow-up stat fails (ipfs.rs:104); and several doc comments still say "Tigris" after the rename (repo_store.rs:131-133,212,282, archive.rs:4).
The abstraction is the right call and the migration is complete. Once the async_upload default and the advisory-lock soundness are sorted I'm glad to take another pass. @kevincodex1
…, tests) P1: - config: default GITLAWB_ASYNC_UPLOAD to false. Write-back ack opened a data-loss window (crash between ack and upload → stale remote overwrites newer local on restart); strict upload-before-ack is now the default and the latency optimization is opt-in. P2: - repo_store: hold ONE pooled connection for the advisory lock's lifetime. pg_advisory_lock is session-scoped, so acquiring and releasing on different pooled connections let unlock silently no-op while the lock lingered. Acquire, sync-error release, and final release now all run on the pinned connection (RepoWriteGuard owns a PoolConnection instead of the pool). - s3: add TimeoutConfig (60s attempt / 300s operation) so a hung endpoint can't block the task — and, under async_upload, the advisory lock — unbounded. - storage::build: scope the fail-closed doc to config + client construction; note that live connectivity (e.g. unreachable S3 bucket) surfaces on first use. P3: - ipfs: reject `..`/`.` traversal in list() prefixes; make a failed post-write stat non-fatal (the write succeeded — return without an etag). - docs: replace stale generic "Tigris" references with "storage" across repo_store/api/state/archive; keep legitimate Tigris-as-S3-example mentions. Tests: archive compress/decompress round-trip + atomicity (corrupt archive leaves existing copy intact) + fs upload/download; repo_store sync_down_if_stale etag-skip and require_fresh fail-closed-vs-fallback. Skipped with reason: IPFS string-matched error detection — Kubo's HTTP API returns 500 + JSON message with no stable error codes, so substring matching is the pragmatic norm; a structured-error layer is disproportionate here. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
please check again @beardthelion |
Superseded: the five findings from this review are resolved on d35a1c8; re-reviewing the new head separately.
beardthelion
left a comment
There was a problem hiding this comment.
Thanks, this addresses the prior round cleanly. I re-reviewed the merged state on the new head: async_upload now defaults false, the advisory lock is acquired and released on a single pinned connection (sound on every exit path I traced), the S3 client has operation/attempt timeouts, ipfs put no longer fails on a post-write stat, and the new compress/decompress and sync_down_if_stale tests exercise real behavior. All five prior findings are resolved.
One regression the lock fix introduces, which is why I'm holding for another round.
Findings
-
[P2] Bound concurrent pushes per DID before pinning a pool connection per push.
git/repo_store.rs:236(thepool.acquire()now held for the whole write),server.rs:178(push routes).acquire_writeholds one connection from the shared pool across the write, the 60s lock-retry, and the upload, andgit_write_routeshas no per-DID rate limit (unlikecreation_routesatserver.rs:96). Because identities are permissionless, a single actor can fire enough concurrent or slow pushes to exhaust the default 10-connection pool (db/mod.rs:269, noPgPoolOptions), after which every DB-touching handler node-wide blocks onpool.acquire()for the ~30s acquire timeout and then fails. It's bounded (30s acquire timeout, 300s S3 op timeout) and self-recovers, but it's repeatable and reachable, and the previous connection-per-query design didn't have it. Bound concurrent writes per DID ongit_write_routes(therate_limit_by_didlayer creation already uses, or a semaphore), and/or give lock connections a dedicated, sized pool. Notpg_advisory_xact_lockhere: the lock has to outlive the transaction (it is held across the post-commit upload), so a transaction-scoped lock would release before the upload and reopen the stale-archive window the lock exists to close. -
[P3] Add
#[must_use]toRepoWriteGuard.git/repo_store.rs:456. The connection pinning is correct, but a guard dropped withoutrelease()returns its connection to the pool with the session advisory lock still held (sqlx does not reset session state on check-in), so it lingers until that backend connection is evicted. Every current caller releases on all paths, so this is not reachable today;#[must_use]is cheap insurance against a future caller that forgets.
The lock-soundness approach is right and the rest is solid. Once the push path is bounded I'll approve. @kevincodex1
Follow-up to the connection-pinning fix (beardthelion review): - Pool exhaustion (P2): pinning a connection per push from the shared pool let a burst of concurrent/slow pushes drain it and block every DB handler node-wide. Give advisory locks a SEPARATE bounded pool (16 conns) created in main; acquire_write pins from it. RepoStore now holds only that lock pool — it touches Postgres solely for locks — so the handler pool is never starved by pushes. Removed the now-unused Db::pool() accessor. - Lock leak via dropped guard (P3): add #[must_use] to RepoWriteGuard. Dropping it without release() returns the connection to the pool with the session advisory lock still held until that backend connection is evicted; the annotation makes an accidental drop a compile-time warning. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@coderabbitai please do full review of this PR |
|
✅ Action performedFull review finished. |
There was a problem hiding this comment.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/gitlawb-node/src/git/repo_store.rs (1)
161-190: 🗄️ Data Integrity & Integration | 🔴 Critical | 🏗️ Heavy liftSerialize background archive uploads with the write lock.
Both lazy migration and
init()upload archives from background tasks without holding the repo advisory lock. A first push can write and upload a newer archive while one of these older background uploads is still compressing; if the backgroundputfinishes last, it overwrites the newer storage object. This is especially reachable for a newly created repo where the empty initial archive races the first push.Move these uploads behind the same advisory-lock path and re-check storage under the lock before
upload, or makeinit()upload synchronously before the repo is exposed.Also applies to: 316-329
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/gitlawb-node/src/git/repo_store.rs` around lines 161 - 190, The background archive uploads in the spawned async task lack serialization with the repo advisory lock, creating a race condition where an older background upload can overwrite a newer archive pushed by the user. Acquire the repo advisory lock before performing the archive.upload() call within the tokio::spawn task, re-check if the archive already exists in storage after acquiring the lock (to avoid redundant uploads), and only proceed with the upload if the lock is held and the storage check confirms it's needed. Apply this same fix to the other location mentioned at lines 316-329 where archive uploads occur outside the advisory lock path.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/gitlawb-node/src/api/repos.rs`:
- Around line 556-560: The comment describing the write-back ack behavior
incorrectly claims that lazy migration provides recovery for the
crash-durability window when async uploads are deferred. This is inaccurate
because lazy migration does not reliably re-sync when an older archive already
exists in storage. Update the comment to accurately reflect the actual
durability tradeoff: if the node stops after sending the ack but before the
upload completes, storage remains stale and a restart may roll the local cache
back from storage. Remove the claim that lazy migration re-syncs and instead
describe the realistic recovery behavior and the latency-durability tradeoff
being made.
In `@crates/gitlawb-node/src/config.rs`:
- Around line 155-162: The `guard.release()` method returns unit type `()`
instead of a `Result`, causing upload errors to be silently logged and never
propagated to callers. This violates the durability guarantee of the strict path
when `async_upload=false`, where clients receive acknowledgment even though the
durable upload failed. Modify the `release()` method to return a `Result` type
so that errors can be properly propagated, then update all write endpoint call
sites (repos, pulls, issues) that call `guard.release().await` to handle and
propagate the `Result` in the strict path, ensuring the client only receives an
acknowledgment after confirmed successful durable upload.
In `@crates/gitlawb-node/src/git/repo_store.rs`:
- Around line 346-356: The archive.upload() error handling currently only logs
warnings without propagating the failure, causing synchronous write operations
to succeed even when durable storage updates fail. Modify the function
containing this match statement to return a Result type, and change the Err(e)
case to return an error instead of just logging it. For synchronous callers
(async_upload=false), let them handle the returned error to fail the request
appropriately. For asynchronous paths, spawn a background task that can still
log any returned errors. Apply these changes to all similar upload failure
handling patterns throughout the file, including the related code at lines
503-516.
- Around line 519-521: When the write fails and is skipped (in the else block
where the warn! log is), the cached etag for this repo must be invalidated to
prevent subsequent write operations from using stale cache information. Add a
call after the warn! statement to drop or clear the cached etag version
associated with self.repo_name, ensuring that the next write operation will
perform a fresh pre-write download and not operate on inconsistent local state.
In `@crates/gitlawb-node/src/storage/archive.rs`:
- Around line 180-190: The current implementation deletes the old repo before
attempting the rename, which means if the rename fails the old repo is lost. To
fix this, instead of immediately removing the old repo with remove_dir_all when
local_path exists, first move it to a temporary backup location using
std::fs::rename. Then attempt to rename the extracted tmp_dir into place. If the
rename succeeds, clean up the backup directory. If the rename fails (in the
error handling block), restore the original repo by renaming the backup back to
local_path before removing the failed tmp_dir, ensuring the old repo is
preserved on failure.
In `@crates/gitlawb-node/src/storage/fs.rs`:
- Around line 98-103: In the `head` method, replace the error detection logic
that uses `path.exists()` with explicit error kind checking by checking if
`e.kind() == std::io::ErrorKind::NotFound` instead, since `exists()` returns
false for both missing files and permission/IO errors, causing real errors to be
silently masked. Additionally, locate the code around line 127 that contains
`rd.flatten()` in what appears to be a directory listing operation and remove
the `flatten()` call to ensure per-entry errors are properly propagated with
context instead of being silently dropped, as dropping errors causes partial
results to be reported as success and misleads callers that depend on complete
directory listings.
In `@crates/gitlawb-node/src/storage/mod.rs`:
- Around line 47-50: The BlobStore trait methods `get` and `put` require
buffering entire objects as Bytes in memory, which causes out-of-memory issues
for large repository archives. Add streaming/spooling APIs to the BlobStore
trait (such as stream-based get and put methods) that can handle objects
incrementally without loading the full content into memory at once. Modify the
archive upload/download code paths to use these new streaming methods instead of
the current whole-object `get` and `put` contract, allowing large `.tar.zst`
files to be processed in chunks.
- Around line 148-152: The validate_key function is missing validation for
backslashes in blob keys, which could cause path confusion on Windows since blob
keys are documented as forward-slash-delimited. Add a check in the validation
logic that rejects any key containing a backslash character using the contains
method, similar to the existing null byte check. Include this validation
alongside the other blob key checks (the traversal segment check and the
absolute path/null byte check) to ensure backslashes are caught during input
validation rather than relying on downstream conversion like the list function
does.
---
Outside diff comments:
In `@crates/gitlawb-node/src/git/repo_store.rs`:
- Around line 161-190: The background archive uploads in the spawned async task
lack serialization with the repo advisory lock, creating a race condition where
an older background upload can overwrite a newer archive pushed by the user.
Acquire the repo advisory lock before performing the archive.upload() call
within the tokio::spawn task, re-check if the archive already exists in storage
after acquiring the lock (to avoid redundant uploads), and only proceed with the
upload if the lock is held and the storage check confirms it's needed. Apply
this same fix to the other location mentioned at lines 316-329 where archive
uploads occur outside the advisory lock path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 604e1276-dc51-4f03-a84b-b577d38cfffe
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (16)
crates/gitlawb-node/Cargo.tomlcrates/gitlawb-node/src/api/issues.rscrates/gitlawb-node/src/api/pulls.rscrates/gitlawb-node/src/api/repos.rscrates/gitlawb-node/src/config.rscrates/gitlawb-node/src/db/mod.rscrates/gitlawb-node/src/git/mod.rscrates/gitlawb-node/src/git/repo_store.rscrates/gitlawb-node/src/git/tigris.rscrates/gitlawb-node/src/main.rscrates/gitlawb-node/src/state.rscrates/gitlawb-node/src/storage/archive.rscrates/gitlawb-node/src/storage/fs.rscrates/gitlawb-node/src/storage/ipfs.rscrates/gitlawb-node/src/storage/mod.rscrates/gitlawb-node/src/storage/s3.rs
💤 Files with no reviewable changes (3)
- crates/gitlawb-node/src/git/mod.rs
- crates/gitlawb-node/src/db/mod.rs
- crates/gitlawb-node/src/git/tigris.rs
| async fn get(&self, key: &str) -> Result<Option<Bytes>>; | ||
|
|
||
| /// Store an object, returning its metadata (including the new etag). | ||
| async fn put(&self, key: &str, body: Bytes) -> Result<ObjectMeta>; |
There was a problem hiding this comment.
🚀 Performance & Scalability | 🟠 Major | 🏗️ Heavy lift
Avoid making large repo archives a whole-memory BlobStore operation.
get/put require entire objects as Bytes, so repo archive upload/download paths must buffer the full .tar.zst in memory. Large repositories can OOM the node during push/fetch; consider streaming/spooling APIs for archive objects instead of a whole-object contract.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/gitlawb-node/src/storage/mod.rs` around lines 47 - 50, The BlobStore
trait methods `get` and `put` require buffering entire objects as Bytes in
memory, which causes out-of-memory issues for large repository archives. Add
streaming/spooling APIs to the BlobStore trait (such as stream-based get and put
methods) that can handle objects incrementally without loading the full content
into memory at once. Modify the archive upload/download code paths to use these
new streaming methods instead of the current whole-object `get` and `put`
contract, allowing large `.tar.zst` files to be processed in chunks.
| if key.split('/').any(|seg| seg == ".." || seg == ".") { | ||
| anyhow::bail!("blob key contains traversal segment: {key}"); | ||
| } | ||
| if key.starts_with('/') || key.contains('\0') { | ||
| anyhow::bail!("blob key is absolute or contains null byte: {key}"); |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
# First, let's explore the storage module structure
fd -t f -e rs . crates/gitlawb-node/src/storage/
# Also check what's in the directory
ls -la crates/gitlawb-node/src/storage/Repository: Gitlawb/node
Length of output: 750
🏁 Script executed:
# Find and examine FsBlobStore implementation
rg -t rs "FsBlobStore" --max-count 20 -A 5 -B 2Repository: Gitlawb/node
Length of output: 177
🏁 Script executed:
# Look at the full storage/mod.rs file around the lines in question
sed -n '140,160p' crates/gitlawb-node/src/storage/mod.rsRepository: Gitlawb/node
Length of output: 642
🏁 Script executed:
# Search for documentation about blob keys
rg -t rs "blob key" --max-count 30 -B 2 -A 2Repository: Gitlawb/node
Length of output: 177
🏁 Script executed:
# Search for FsBlobStore with proper rg syntax
rg "FsBlobStore" --max-count 30 -A 5 -B 2 crates/gitlawb-node/src/storage/Repository: Gitlawb/node
Length of output: 5691
🏁 Script executed:
# Look at fs.rs to understand how FsBlobStore handles blob keys
cat -n crates/gitlawb-node/src/storage/fs.rsRepository: Gitlawb/node
Length of output: 10079
🏁 Script executed:
# Search for how validate_key is used
rg "validate_key" --max-count 20 -B 3 -A 3 crates/gitlawb-node/src/storage/Repository: Gitlawb/node
Length of output: 6806
🏁 Script executed:
# Search for blob key documentation in comments
rg "blob key" -B 3 -A 3 crates/gitlawb-node/src/Repository: Gitlawb/node
Length of output: 1661
🏁 Script executed:
# Check if there are any tests for backslash rejection in blob keys
rg "backslash|\\\\|platform separator" crates/gitlawb-node/src/storage/ -A 3 -B 3Repository: Gitlawb/node
Length of output: 6043
🏁 Script executed:
# Look for any documentation about blob key format
rg -B 5 -A 5 "forward-slash|forward slash|blob key format" crates/gitlawb-node/src/storage/Repository: Gitlawb/node
Length of output: 931
🏁 Script executed:
# Check the list function to see why they handle backslashes there
sed -n '116,143p' crates/gitlawb-node/src/storage/fs.rsRepository: Gitlawb/node
Length of output: 1376
🏁 Script executed:
# Search for any comments or issues related to Windows path handling
rg "Windows|windows|platform|separator" crates/gitlawb-node/src/storage/ -B 2 -A 2Repository: Gitlawb/node
Length of output: 680
Reject backslashes in blob keys during validation.
Blob keys are documented as forward-slash-delimited. On Windows, PathBuf::join() interprets backslashes as path separators, which could lead to path confusion despite the secondary starts_with check. The list() function already converts backslashes (line 132 in fs.rs), indicating platform awareness—input validation should enforce the documented format. Add a backslash check to validate_key():
Suggested fix
- if key.starts_with('/') || key.contains('\0') {
- anyhow::bail!("blob key is absolute or contains null byte: {key}");
+ if key.starts_with('/') || key.contains('\\') || key.contains('\0') {
+ anyhow::bail!("blob key is absolute, contains a platform separator, or contains null byte: {key}");
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/gitlawb-node/src/storage/mod.rs` around lines 148 - 152, The
validate_key function is missing validation for backslashes in blob keys, which
could cause path confusion on Windows since blob keys are documented as
forward-slash-delimited. Add a check in the validation logic that rejects any
key containing a backslash character using the contains method, similar to the
existing null byte check. Include this validation alongside the other blob key
checks (the traversal segment check and the absolute path/null byte check) to
ensure backslashes are caught during input validation rather than relying on
downstream conversion like the list function does.
CodeRabbit round (data-integrity/consistency): - release()/release_after_write() now return Result. The push strict path, merge, fork, and issue writes surface a durable-upload failure to the client instead of acking a write that never reached storage; the async write-back path logs it. (findings 3, 4) - Cache invalidation on failure: on a failed write OR a failed upload, drop the cached etag so the next write re-downloads and reconciles rather than trusting un-persisted local state. (finding 5) - Background uploads serialized: new upload_under_lock() takes the per-repo advisory lock. init() uploads synchronously under it (fail closed); lazy migration uploads under it (skip-if-exists). Stops an older snapshot from clobbering a concurrent locked push. (finding 2) - archive swap no longer destroys the old repo up front: move it to a backup, rename the new copy in, delete the backup on success, restore it on rename failure — local_path is never left missing. (finding 6) - fs head() distinguishes NotFound from permission/IO errors via ErrorKind instead of path.exists(). (finding 1) - fs list() propagates per-entry read errors instead of dropping them via flatten(). (finding 7) - Reworded the async-upload comment to state the real tradeoff: on a crash after ack, storage stays stale until the repo's next successful push — it is not auto-reconciled. (finding 8) Already in place from earlier commits (verified): IPFS list() traversal guard, IPFS non-fatal post-write stat, S3 client timeouts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/gitlawb-node/src/api/repos.rs`:
- Around line 1187-1197: In the error handler for the release_after_write call
(within the map_err block), add cleanup logic to remove the local fork directory
stored in disk_path before returning the AppError. This prevents orphan data
from accumulating when the durable upload fails, which can cause issues on retry
attempts due to the destination path already existing. The cleanup should happen
in the error path before constructing and returning the AppError with the
failure message.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5d006b80-0156-44ab-a8a3-778966abe6f7
📒 Files selected for processing (6)
crates/gitlawb-node/src/api/issues.rscrates/gitlawb-node/src/api/pulls.rscrates/gitlawb-node/src/api/repos.rscrates/gitlawb-node/src/git/repo_store.rscrates/gitlawb-node/src/storage/archive.rscrates/gitlawb-node/src/storage/fs.rs
🚧 Files skipped from review as they are similar to previous changes (3)
- crates/gitlawb-node/src/storage/archive.rs
- crates/gitlawb-node/src/storage/fs.rs
- crates/gitlawb-node/src/git/repo_store.rs
| // Upload fork to storage — fail closed if the durable upload fails rather | ||
| // than reporting a fork that only exists on this node's local disk. | ||
| state | ||
| .repo_store | ||
| .release_after_write(&forker_did, &fork_name) | ||
| .await; | ||
| .await | ||
| .map_err(|e| { | ||
| AppError::Git(format!( | ||
| "fork created locally but durable upload failed: {e}" | ||
| )) | ||
| })?; |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Clean up the local fork directory on durable-upload failure.
If release_after_write(...) fails, this returns an error but leaves the cloned disk_path behind. That creates orphan data and can make retries fail due to an existing destination path.
♻️ Proposed fix
- state
- .repo_store
- .release_after_write(&forker_did, &fork_name)
- .await
- .map_err(|e| {
- AppError::Git(format!(
- "fork created locally but durable upload failed: {e}"
- ))
- })?;
+ if let Err(e) = state
+ .repo_store
+ .release_after_write(&forker_did, &fork_name)
+ .await
+ {
+ if let Err(cleanup_err) = tokio::fs::remove_dir_all(&disk_path).await {
+ tracing::warn!(
+ path = %disk_path.display(),
+ err = %cleanup_err,
+ "failed to clean up local fork after durable upload failure"
+ );
+ }
+ return Err(AppError::Git(format!(
+ "fork created locally but durable upload failed: {e}"
+ )));
+ }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/gitlawb-node/src/api/repos.rs` around lines 1187 - 1197, In the error
handler for the release_after_write call (within the map_err block), add cleanup
logic to remove the local fork directory stored in disk_path before returning
the AppError. This prevents orphan data from accumulating when the durable
upload fails, which can cause issues on retry attempts due to the destination
path already existing. The cleanup should happen in the error path before
constructing and returning the AppError with the failure message.
jatmn
left a comment
There was a problem hiding this comment.
Findings
-
[P2] Clean up the local fork directory when fork creation fails after the clone
crates/gitlawb-node/src/api/repos.rs:1187-1218
Aftergit clone --mirrorsucceeds,release_after_writeis called without any cleanup on failure. If the durable upload fails, the handler returns an error but leaves the cloned mirror atdisk_path. A later fork with the same name (or a retry of this one) will then fail because the directory already exists. The same leak happens ifstate.db.create_repofails after the upload. Removedisk_pathbefore returning from either error path. CodeRabbit's latest review explicitly requested this cleanup. -
[P2] Don't block the Tokio runtime in
FsBlobStore::headand::put
crates/gitlawb-node/src/storage/fs.rs:39-51andcrates/gitlawb-node/src/storage/fs.rs:69-107
headandputboth callSelf::meta_of, which usesstd::fs::metadatadirectly inside an async trait method. That blocks the async executor thread while stat'ing local files. Forput, move the stat into the existingspawn_blockingclosure; forhead, usetokio::fs::metadata(or aspawn_blockingcall) instead of the synchronous helper. -
[P3] Advisory-lock acquisition timeout is shorter than the backend operation timeout
crates/gitlawb-node/src/git/repo_store.rs:234-250
acquire_writeretriespg_try_advisory_lockfor 60 seconds, but the S3 backend sets a 300-second operation timeout (crates/gitlawb-node/src/storage/s3.rs:44-47) and the IPFS backend sets the same 300-second total timeout (crates/gitlawb-node/src/storage/ipfs.rs:30-31). A single large-repo upload can therefore hold the advisory lock long enough for concurrent pushes to the same repo to be rejected after only 60 seconds. Align the lock-acquisition timeout with the configured storage operation timeout, or use a blocking advisory-lock call with a matching timeout.
Both findings from this review are resolved on the current head: advisory locks now use a dedicated bounded pool (no handler-pool starvation), and RepoWriteGuard is #[must_use]. Dismissing so the state matches the current head; jatmn's open findings still stand.
Replace the hardcoded Tigris client with a backend-agnostic object store.
BlobStore trait (get/put/head/delete/list) with backends:
RepoArchive composes a bare repo into one repos/v1/{slug}/{repo}.tar.zst object on top of any backend (tar+zstd moved here; git/tigris.rs removed). RepoStore rewired from Option to Option.
Config (backward compatible): GITLAWB_STORAGE_BACKEND, GITLAWB_S3_BUCKET, GITLAWB_S3_ENDPOINT, GITLAWB_S3_FORCE_PATH_STYLE, GITLAWB_STORAGE_FS_DIR, GITLAWB_ASYNC_UPLOAD. GITLAWB_TIGRIS_BUCKET kept as a legacy alias.
Push-latency quick wins:
cargo check clean; storage::fs + repo_store unit tests pass.
Closes #89
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Tests
Chores