test(jepsen): add ZSet safety workload with model-based checker#550
test(jepsen): add ZSet safety workload with model-based checker#550bootjp wants to merge 37 commits into
Conversation
Adds a Jepsen workload that goes beyond add->read visibility and verifies ZSet-specific safety properties under faults (network partitions, node kills): - score correctness: a ZRANGE result's score for a member must equal the model's latest committed score for that member, OR equal a score written by an operation that is concurrent with the read (since the client cannot linearise concurrent writes to the same member). - order preservation: ZRANGE 0 -1 must be sorted by (score asc, member lex asc). - ZRANGEBYSCORE correctness: bounded range queries return exactly the members whose score falls in the bound, modulo concurrent mutations. - no phantom members: every read member must have been introduced by some successful or in-flight ZADD/ZINCRBY. Concurrent-ZADD handling uses an invoke/complete windowing approach: mutations whose complete index < read's invoke index are committed before the read; mutations whose intervals overlap are concurrent and contribute to the per-member allowed-score set. Indeterminate (:info) mutations are treated as possibly-concurrent. Workload entry point added to jepsen_test.clj as elastickv-zset-safety-test.
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Warning Review limit reached
More reviews will be available in 21 minutes and 59 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate. For paid Pro and Pro+ PR reviews, CodeRabbit uses rolling per-developer review limits. Reviews become available again as older review attempts age out of the rolling limit window. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
📝 WalkthroughWalkthroughThis pull request introduces a new Redis ZSet safety Jepsen workload ( ChangesRedis ZSet Safety Workload
Sequence DiagramsequenceDiagram
participant Client
participant Redis
participant Nemesis
participant Checker
Client->>Redis: ZADD/ZINCRBY/ZREM
Redis-->>Client: Response (:ok/:info)
Note over Client: Record invoke/complete<br/>pair with type
Client->>Redis: ZRANGE-ALL/ZRANGEBYSCORE
Redis-->>Client: Member list with scores
Note over Client: Record read result
Nemesis->>Redis: Inject faults
Redis-->>Client: Degraded/timeout/error
Client-->>Client: Mark as :info (uncertain)
Nemesis->>Redis: Heal
Redis-->>Client: Recovery
Checker->>Checker: Reconstruct invoke/complete pairs
Checker->>Checker: Build mutation windows per member
Checker->>Checker: Enumerate admissible linearizations
Checker->>Checker: Validate ordering (score↑, member↑)
Checker->>Checker: Check no duplicates
Checker->>Checker: Enforce no phantom members
Checker->>Checker: Validate score correctness
Checker->>Checker: Verify completeness
Checker-->>Checker: Produce :valid? outcome
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
- jepsen-test.yml: 5s smoke run on every push, mirroring the other workloads. - jepsen-test-scheduled.yml: 150s default run (overridable via workflow_dispatch inputs) every 6 hours. Workload entry: elastickv.redis-zset-safety-workload (added in the previous commit via -main).
There was a problem hiding this comment.
Pull request overview
Adds a new Jepsen workload to validate elastickv’s Redis ZSet safety properties (scores, ordering, range correctness, and phantom detection) under faults using a custom model-based checker, and wires a helper entrypoint into the Jepsen test namespace.
Changes:
- Introduces
redis_zset_safety_workload.cljwith a Carmine-based client, randomized op generator, and a custom checker for ZSet safety properties. - Adds a new wrapper function in
jepsen_test.cljintended to expose the new workload.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| jepsen/src/elastickv/redis_zset_safety_workload.clj | New ZSet safety Jepsen workload with custom model/checker and CLI entrypoint. |
| jepsen/src/elastickv/jepsen_test.clj | Adds a wrapper function and require for the new ZSet safety test. |
| (defn elastickv-zset-safety-test | ||
| "Builds a Jepsen test map that drives elastickv's Redis ZSet safety | ||
| workload." | ||
| ([] (elastickv-zset-safety-test {})) | ||
| ([opts] | ||
| (let [nodes (or (:nodes opts) default-nodes) | ||
| redis-ports (or (:redis-ports opts) | ||
| (repeat (count nodes) (or (:redis-port opts) 6379))) | ||
| node->port (or (:node->port opts) | ||
| (cli/ports->node-map redis-ports nodes)) | ||
| local? (:local opts) | ||
| db (if local? | ||
| jdb/noop | ||
| (ekdb/db {:grpc-port (or (:grpc-port opts) 50051) | ||
| :redis-port node->port | ||
| :raft-groups (:raft-groups opts) | ||
| :shard-ranges (:shard-ranges opts)})) | ||
| rate (double (or (:rate opts) 10)) | ||
| time-limit (or (:time-limit opts) 60) | ||
| faults (if local? | ||
| [] | ||
| (cli/normalize-faults (or (:faults opts) [:partition :kill]))) | ||
| nemesis-p (when-not local? | ||
| (combined/nemesis-package {:db db | ||
| :faults faults | ||
| :interval (or (:fault-interval opts) 40)})) | ||
| nemesis-gen (if nemesis-p | ||
| (:generator nemesis-p) | ||
| (gen/once {:type :info :f :noop})) | ||
| workload (elastickv-zset-safety-workload | ||
| (assoc opts :node->port node->port))] | ||
| (merge workload | ||
| {:name (or (:name opts) "elastickv-redis-zset-safety") | ||
| :nodes nodes | ||
| :db db | ||
| :redis-host (:redis-host opts) | ||
| :os (if local? os/noop debian/os) | ||
| :net (if local? net/noop net/iptables) | ||
| :ssh (merge {:username "vagrant" | ||
| :private-key-path "/home/vagrant/.ssh/id_rsa" | ||
| :strict-host-key-checking false} | ||
| (when local? {:dummy true}) | ||
| (:ssh opts)) | ||
| :remote control/ssh | ||
| :nemesis (if nemesis-p (:nemesis nemesis-p) nemesis/noop) | ||
| :final-generator nil | ||
| :concurrency (or (:concurrency opts) 5) | ||
| :generator (->> (:generator workload) | ||
| (gen/nemesis nemesis-gen) | ||
| (gen/stagger (/ rate)) | ||
| (gen/time-limit time-limit))})))) |
There was a problem hiding this comment.
Stale. Unit tests for this workload were added in commit ecb3983 under jepsen/test/elastickv/redis_zset_safety_workload_test.clj — they cover test-spec construction (builds-test-spec, custom-options-override-defaults) plus checker edge cases: no-op ZREM, :info ZINCRBY, phantoms, stale reads, superseded scores, concurrent ZADD/ZREM, :info ZREM absence, :info ZINCRBY completeness, infinity parsing, duplicate-member flagging. Latest commit 0c0efc4 adds two more for the single-ok / multi-ZINCRBY score check.
| :zrem | ||
| (let [m (:value invoke)] | ||
| {:f :zrem :member m :score nil :zrem? true | ||
| :type t :invoke-idx inv-idx :complete-idx cmp-idx}))))) |
There was a problem hiding this comment.
Stale. ZREM now carries :removed? from the server reply into the mutation record (see completed-mutation-window lines 283-295 and apply-mutation-to-state/existence-evidence?). A no-op ZREM (returns 0) does NOT mutate model state, and :info/:pending ZREMs are treated as possibly-concurrent deletions via the concurrent window. Regression test: noop-zrem-does-not-flag-correct-read and no-op-zrem-alone-does-not-false-positive. Landed in 2a194a4 / ecb3983.
| :zincrby | ||
| (let [[m _delta] (:value invoke) | ||
| s (when (and (= :ok t) (vector? (:value complete))) | ||
| (second (:value complete)))] | ||
| {:f :zincrby :member m :score (some-> s double) | ||
| :type t :invoke-idx inv-idx :complete-idx cmp-idx}) |
There was a problem hiding this comment.
Stale. completed-mutation-window sets :unknown-score? on each :info/:pending ZINCRBY (line 277-279), and allowed-scores-for-member propagates that to the caller so check-zrange-all/check-zrangebyscore skip strict score validation when the resulting score is unknown. Regression test: info-zincrby-skips-strict-score-check, info-before-read-is-considered-uncertain. Landed in 2a194a4 / ecb3983, refined in 0c0efc4 per Codex P1.
| strictly precede `read-inv-idx`. Model maps member -> {:score s} or | ||
| marks member as :deleted. Returns {:members map :ok-members set}. | ||
| Only considers :ok mutations for the authoritative model; :info | ||
| mutations are treated as uncertain (neither strictly applied nor not)." |
There was a problem hiding this comment.
Stale. Docstring rewritten on line 323-327: "Construct authoritative per-member state from mutations whose completions strictly precede read-inv-idx. Returns {member -> {:present? bool :score s}}". The {:members map :ok-members set} wording no longer appears in the file.
| (defn elastickv-zset-safety-test [] | ||
| (zset-safety-workload/elastickv-zset-safety-test {})) | ||
|
|
||
| (defn -main | ||
| [& args] | ||
| (cli/run! (cli/single-test-cmd {:test-fn elastickv-test}) args)) |
There was a problem hiding this comment.
Stale. jepsen_test.clj now dispatches on the first positional arg via the test-fns map (lines 21-29), auto-prepending test if the user omitted the jepsen subcommand. elastickv-zset-safety-test is reachable via lein run -m elastickv.jepsen-test elastickv-zset-safety-test .... The workload namespace's own -main also works for direct invocation.
Three Copilot findings on PR #550: 1. :zincrby indeterminate handling. Pending or :info ZINCRBY left the resulting score unknown, but the checker still required the observed read score to be in the finite allowed-scores set. A read that legitimately observed an in-flight increment was flagged as a score mismatch (false positive). completed-mutation-window now sets :unknown-score? on a ZINCRBY when the completion is :info or pending. allowed-scores-for-member returns :unknown-score? when any concurrent ZINCRBY carries the flag, and check-zrange-all / check-zrangebyscore skip the strict score-membership check in that case. 2. :zrem no-op handling. ZREM of a never-added member returns 0 server-side (no-op). The previous model treated every ZREM as a deletion, producing missing-member false positives and score-mismatch false negatives. invoke! already exposes the actual removed? boolean as the second element of the completion value. completed-mutation-window now threads :removed? through, and the new apply-mutation-to-state helper leaves state unchanged when :removed? is false. 3. model-before docstring claimed it returned {:members map :ok-members set}, but it returned the model map directly. Docstring rewritten to match the actual return value. Adds jepsen/test/elastickv/redis_zset_safety_workload_test.clj covering test-spec construction, the no-op ZREM edge case, the :info ZINCRBY skip, and a positive-control score-mismatch detection. The checker tests bypass the timeline.html sub-checker (which writes to the test store) by invoking zset-safety-checker directly.
|
94be1bd で Copilot 指摘を fix:
/gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
…re-mismatch CI failure on 94be1bd surfaced a remaining checker false positive: a read that observes a member whose only prior ops are no-op ZREMs was classified as :score-mismatch with :allowed #{} rather than treated as a never-existed member. allowed-scores-for-member returned any-known? = true because (seq concurrent) was truthy even when every concurrent op was a :zrem that didn't actually remove anything. Fix: any-known? (renamed internally to existence-evidence?) now only counts concurrent mutations that provide evidence the member ever existed -- :zadd, :zincrby, or a :zrem whose :removed? boolean is true. A concurrent no-op ZREM contributes nothing. Adds no-op-zrem-alone-does-not-false-positive as a regression test. All 6 workload unit tests pass under Java 21.
|
9bfcc13 で CI 失敗原因の checker false positive を追加 fix:
/gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
jepsen/test/elastickv/redis_zset_safety_workload_test.clj (1)
42-89: Add a direct:zrangebyscorechecker regression.The checker has a separate bounded-range path, but these edge-case tests only exercise
:zrange-all. A small missing-member or out-of-range regression would protect the advertised ZRANGEBYSCORE property.Example test to cover bounded-range completeness
+(deftest zrangebyscore-missing-member-is-detected + (let [history [{:type :invoke :process 0 :f :zadd :value ["m1" 5] :index 0} + {:type :ok :process 0 :f :zadd :value ["m1" 5] :index 1} + {:type :invoke :process 0 :f :zrangebyscore :value [0 10] :index 2} + {:type :ok :process 0 :f :zrangebyscore + :value {:bounds [0 10] :members []} + :index 3}] + result (run-checker history)] + (is (not (:valid? result)) (str "expected range mismatch, got: " result))))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@jepsen/test/elastickv/redis_zset_safety_workload_test.clj` around lines 42 - 89, Add a new test that exercises the bounded-range code path by invoking :zrangebyscore instead of :zrange-all so the checker’s ZRANGEBYSCORE logic is covered; create a test (e.g. noop-zrem-does-not-flag-bounded-zrangebyscore or similar) that mirrors one of the existing edge-case histories but uses :zrangebyscore (with appropriate score bounds/values) and asserts run-checker returns :valid? or not as expected, ensuring you reference the same run-checker invocation and operation symbols (:zadd, :zrem, :zincrby, :zrangebyscore) so the new test hits the bounded-range branch.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@jepsen/src/elastickv/redis_zset_safety_workload.clj`:
- Around line 386-416: The read result must be checked for duplicate members to
prevent a ZSet read returning the same member twice; update check-zrange-all
(and the similar function handling ranges around lines 430-464) to detect
duplicate member entries before per-entry validation by scanning entries and
maintaining a seen-members set, and if a member is already seen swap! the errors
atom with a {:kind :duplicate :index cmp-idx :member member :entries entries}
(or similar) error entry so duplicate-member reads are reported and rejected.
- Around line 313-318: The current logic builds applied as (->> muts (filter ...
(< (:complete-idx %) read-inv-idx)) (sort-by :complete-idx)) and then reduces
via apply-mutation-to-state, which incorrectly linearizes overlapping :ok
mutations by completion time; change this to detect overlapping committed
mutations (using their :invoke-idx and :complete-idx intervals from muts) and do
not sort/serialize ambiguous pairs by :complete-idx. Instead compute either the
set of all possible latest states for the read by merging non-overlapping
mutations deterministically and treating overlapping/conflicting writes
conservatively (e.g., allow values from any write whose interval is not ordered
before the read), and update the applied construction and reduction via
apply-mutation-to-state to use that conservative/possible-states approach; apply
the same fix at the other occurrence around lines 347-352.
---
Nitpick comments:
In `@jepsen/test/elastickv/redis_zset_safety_workload_test.clj`:
- Around line 42-89: Add a new test that exercises the bounded-range code path
by invoking :zrangebyscore instead of :zrange-all so the checker’s ZRANGEBYSCORE
logic is covered; create a test (e.g.
noop-zrem-does-not-flag-bounded-zrangebyscore or similar) that mirrors one of
the existing edge-case histories but uses :zrangebyscore (with appropriate score
bounds/values) and asserts run-checker returns :valid? or not as expected,
ensuring you reference the same run-checker invocation and operation symbols
(:zadd, :zrem, :zincrby, :zrangebyscore) so the new test hits the bounded-range
branch.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e12c0d39-43a2-4d57-a8a3-f49ade75baba
📒 Files selected for processing (5)
.github/workflows/jepsen-test-scheduled.yml.github/workflows/jepsen-test.ymljepsen/src/elastickv/jepsen_test.cljjepsen/src/elastickv/redis_zset_safety_workload.cljjepsen/test/elastickv/redis_zset_safety_workload_test.clj
Three Major-severity CodeRabbit findings on PR #550: 1. Duplicate-member detection (line 416): a ZSet read must return each member at most once. Previously, if ZRANGE returned the same member twice with an allowed score, the checker accepted it because sort and score-membership checks passed independently per entry. duplicate-members helper now flags :duplicate-members (and :duplicate-members-range for ZRANGEBYSCORE) before the per-entry loop. 2. Overlapping committed writes (line 318): two :ok mutations whose invoke/complete windows overlap have ambiguous serialization order. Pinning allowed-scores to a single last-wins linearization by :complete-idx was unsound. allowed-scores-for-member now unions all :zadd/:ok-:zincrby scores from committed mutations (over-approx that stays sound), and must-be-present? is relaxed when any pair of committed writes for the same member overlaps in time. 3. Pre-read :info mutations (line 328): a mutation recorded as :info whose completion precedes a later read's invoke may have taken effect server-side. Previously it was ignored by both model-before (:ok only) and the concurrent window (complete-idx >= read-inv-idx required). Now collected as pre-read-info, contributing to allowed scores and flipping unknown-score? for :zincrby with unknown resulting score. 3 new regression tests (duplicate-members-are-flagged, overlapping- committed-zadds-allow-either-score, info-before-read-is-considered- uncertain). Workload unit test count now 9, all PASS under Java 21.
|
6d0b4c3 で CodeRabbit Major 3 件を fix:
regression test 3 件追加 ( /gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new Jepsen workload and model-based checker to verify Redis ZSet safety properties in elastickv. It includes a client for Redis operations, a generator for high-contention workloads, and a checker that validates score correctness, ordering, and atomicity. Feedback focuses on handling infinite scores in Redis to prevent parsing exceptions, improving the soundness of the linearizability model in allowed-scores-for-member to prevent stale reads, and optimizing the efficiency of mutation checks.
| (defn- parse-withscores | ||
| "Carmine returns a flat [member score member score ...] vector for | ||
| ZRANGE WITHSCORES. Convert to a sorted vector of [member (double score)] | ||
| preserving server-returned order (score ascending, then member)." | ||
| [flat] | ||
| (->> flat | ||
| (partition 2) | ||
| (mapv (fn [[m s]] | ||
| [(if (bytes? m) (String. ^bytes m) (str m)) | ||
| (Double/parseDouble (str s))])))) |
There was a problem hiding this comment.
The use of Double/parseDouble directly on Redis score strings will fail if the score is infinite. Redis returns "inf", "+inf", or "-inf" for infinite values, but Java's Double/parseDouble expects "Infinity" or "-Infinity". This will cause a NumberFormatException during the test if infinite scores are encountered.
(defn- parse-double-safe
"Parses a Redis score string into a Double, handling 'inf' representations."
[s]
(let [s-str (str/lower-case (str s))]
(cond
(str/includes? s-str "inf") (if (str/starts-with? s-str "-")
Double/NEGATIVE_INFINITY
Double/POSITIVE_INFINITY)
:else (Double/parseDouble (str s)))))
(defn- parse-withscores
"Carmine returns a flat [member score member score ...] vector for
ZRANGE WITHSCORES. Convert to a sorted vector of [member (double score)]
preserving server-returned order (score ascending, then member)."
[flat]
(->> flat
(partition 2)
(mapv (fn [[m s]]
[(if (bytes? m) (String. ^bytes m) (str m))
(parse-double-safe s)]))))
There was a problem hiding this comment.
Stale. parse-double-safe (lines 71-87) normalizes inf / +inf / -inf (and JVM-style Infinity) before falling through to Double/parseDouble. parse-withscores calls it on every returned score. Regression test: parse-withscores-handles-inf-strings. Landed in ecb3983.
| (let [[member delta] (:value op) | ||
| new-score (car/wcar cs (car/zincrby zset-key (double delta) member))] | ||
| (assoc op :type :ok | ||
| :value [member (Double/parseDouble (str new-score))])) |
There was a problem hiding this comment.
Stale. The :zincrby branch of invoke! already calls parse-double-safe on the server-returned score (line 134). Covered by parse-withscores-handles-inf-strings plus the infinity branch in parse-double-safe. Landed in ecb3983.
| (defn- allowed-scores-for-member | ||
| "Compute the set of scores considered valid for `member` by a read | ||
| whose window is [read-inv-idx, read-cmp-idx], based on committed state | ||
| and any concurrent mutations. | ||
|
|
||
| Returns: | ||
| :scores - set of acceptable scores (committed + concurrent | ||
| :zadd / :ok :zincrby). | ||
| :unknown-score? - true iff any concurrent ZINCRBY's resulting score | ||
| is unknown (in-flight or :info). When set, the | ||
| caller MUST skip the strict score-membership | ||
| check to stay sound. | ||
| :must-be-present? - committed state says present and no concurrent | ||
| mutation could have removed/changed it. | ||
| :any-known? - some op claims to have touched this member." | ||
| [mutations-by-m member read-inv-idx read-cmp-idx] | ||
| (let [muts (get mutations-by-m member []) | ||
| ;; :ok mutations that completed strictly before the read. They | ||
| ;; may have overlapped with each other in wall-clock time, so | ||
| ;; the serialization order among them is ambiguous. | ||
| committed (->> muts | ||
| (filter #(and (= :ok (:type %)) | ||
| (some? (:complete-idx %)) | ||
| (< (:complete-idx %) read-inv-idx)))) | ||
| ;; :info mutations that completed before the read: they may or | ||
| ;; may not have taken effect server-side. We must account for | ||
| ;; their possible scores just like concurrent ones. | ||
| pre-read-info (->> muts | ||
| (filter #(and (= :info (:type %)) | ||
| (some? (:complete-idx %)) | ||
| (< (:complete-idx %) read-inv-idx)))) | ||
| ;; Concurrent mutations: windows overlap the read. Include both | ||
| ;; :ok and :info since either may have taken effect. | ||
| concurrent (concurrent-mutations-for-member muts read-inv-idx read-cmp-idx) | ||
| ;; A conservative last-wins linearization for the must-be-present? | ||
| ;; check only. Ambiguous when committed writes overlap each other. | ||
| committed-sorted (sort-by :complete-idx committed) | ||
| committed-state (reduce apply-mutation-to-state nil committed-sorted) | ||
| committed-overlap? (boolean | ||
| (some (fn [[a b]] | ||
| (and (not (identical? a b)) | ||
| (<= (:invoke-idx a) (:complete-idx b)) | ||
| (<= (:invoke-idx b) (:complete-idx a)))) | ||
| (for [a committed, b committed] [a b]))) | ||
| ;; Union of every score that any committed / pre-read :info / | ||
| ;; concurrent op could have produced. This over-approximates the | ||
| ;; legitimate post-state set when writes overlap, keeping the | ||
| ;; checker sound at the cost of being slightly less strict on | ||
| ;; overlapping concurrent writers. | ||
| add-scores (fn [acc m] | ||
| (case (:f m) | ||
| :zadd (conj acc (:score m)) | ||
| :zincrby (cond-> acc (some? (:score m)) (conj (:score m))) | ||
| :zrem acc)) | ||
| scores (as-> #{} s | ||
| (reduce add-scores s committed) | ||
| (reduce add-scores s pre-read-info) | ||
| (reduce add-scores s concurrent)) | ||
| unknown-score? (or | ||
| (some #(and (= :zincrby (:f %)) (:unknown-score? %)) | ||
| concurrent) | ||
| (some #(and (= :zincrby (:f %)) (:unknown-score? %)) | ||
| pre-read-info)) | ||
| ;; any-known? must only be true when something provides evidence | ||
| ;; the member actually existed at some point. A no-op ZREM | ||
| ;; (:removed? false) does NOT prove existence. | ||
| existence-evidence? (or (some #(case (:f %) | ||
| (:zadd :zincrby) true | ||
| :zrem (:removed? %)) | ||
| committed) | ||
| (some #(case (:f %) | ||
| (:zadd :zincrby) true | ||
| :zrem (:removed? %)) | ||
| pre-read-info) | ||
| (some #(case (:f %) | ||
| (:zadd :zincrby) true | ||
| :zrem (:removed? %)) | ||
| concurrent))] | ||
| {:scores scores | ||
| :unknown-score? (boolean unknown-score?) | ||
| ;; must-be-present? is relaxed when committed writes overlap | ||
| ;; among themselves or when any :info / concurrent mutation could | ||
| ;; have removed the member before the read. | ||
| :must-be-present? (boolean (and committed-state | ||
| (:present? committed-state) | ||
| (not committed-overlap?) | ||
| (empty? pre-read-info) | ||
| (empty? concurrent))) | ||
| :any-known? (boolean existence-evidence?)})) |
There was a problem hiding this comment.
The current implementation of allowed-scores-for-member is too permissive and potentially unsound for linearizability checks.
-
Stale Reads: It includes all successful committed scores in the
scoresset (line 384), even if they were strictly followed by a later committed mutation. In a linearizable system, a read should only observe scores from the "potentially latest" committed writes (those not strictly followed by another) or concurrent ones. -
Stale Presence: The checker does not verify if a member is allowed to be present. If a member was added and then deleted (both committed before the read),
any-known?remains true, and the read will accept the member even though it should be absent. -
Efficiency: The
committed-overlap?check (line 368) is$O(N^2)$ relative to the number of mutations for a member. While the member universe is small, this can be optimized to$O(N)$ by tracking the maximuminvoke-idxof preceding mutations.
(defn- allowed-scores-for-member
"Compute the set of scores considered valid for `member` by a read
whose window is [read-inv-idx, read-cmp-idx], based on committed state
and any concurrent mutations."
[mutations-by-m member read-inv-idx read-cmp-idx]
(let [muts (get mutations-by-m member [])
;; Mutations that completed strictly before the read started.
preceding (filter #(and (= :ok (:type %))
(some? (:complete-idx %))
(< (:complete-idx %) read-inv-idx))
muts)
;; A mutation is a candidate for the 'latest' state if no other
;; committed mutation strictly follows it (m2.invoke > m.complete).
max-inv (reduce max -1 (map :invoke-idx preceding))
candidates (filter #(>= (:complete-idx %) max-inv) preceding)
concurrent (concurrent-mutations-for-member muts read-inv-idx read-cmp-idx)
pre-read-info (filter #(and (= :info (:type %))
(some? (:complete-idx %))
(< (:complete-idx %) read-inv-idx))
muts)
;; Scores allowed by committed state or uncertain ops.
scores (into (set (keep :score candidates))
(concat (keep :score concurrent)
(keep :score pre-read-info)))
unknown-score? (some #(and (= :zincrby (:f %)) (:unknown-score? %))
(concat concurrent pre-read-info))
;; Allowed to be present if any candidate or concurrent/info op is a write.
can-be-present? (or (some #(#{:zadd :zincrby} (:f %)) candidates)
(some #(#{:zadd :zincrby} (:f %)) concurrent)
(some #(#{:zadd :zincrby} (:f %)) pre-read-info))
;; Must be present if all candidates are writes and no concurrent/info op can delete.
must-be-present? (and (seq candidates)
(every? #(#{:zadd :zincrby} (:f %)) candidates)
(not (some #(= :zrem (:f %)) concurrent))
(not (some #(= :zrem (:f %)) pre-read-info)))]
{:scores scores
:unknown-score? (boolean unknown-score?)
:must-be-present? must-be-present?
:can-be-present? can-be-present?}))
There was a problem hiding this comment.
Stale. allowed-scores-for-member now applies a real-time last-wins max-inv filter to committed mutations (lines 408-411): a preceding mutation is admissible only when no other preceding mutation's invoke strictly follows it. Superseded scores are excluded from :scores. Presence-under-stale-reads is caught by can-be-present? (lines 466-469), which is checked in both check-zrange-all (line 529) and check-zrangebyscore (line 590). Regression tests: superseded-committed-score-is-not-allowed, stale-read-after-committed-zrem-is-flagged, phantom-member-is-flagged. Landed in 2a194a4.
| (let [{:keys [scores any-known? unknown-score?]} | ||
| (allowed-scores-for-member mutations-by-m member inv-idx cmp-idx)] | ||
| (cond | ||
| (not any-known?) | ||
| (swap! errors conj {:kind :phantom | ||
| :index cmp-idx | ||
| :member member | ||
| :score score}) |
There was a problem hiding this comment.
The phantom check should be replaced with a more robust can-be-present? check to catch both phantoms (never existed) and stale reads (present after a committed deletion).
(doseq [[member score] entries]
(let [{:keys [scores can-be-present? unknown-score?]}
(allowed-scores-for-member mutations-by-m member inv-idx cmp-idx)]
(cond
(not can-be-present?)
(swap! errors conj {:kind :unexpected-presence
:index cmp-idx
:member member
:score score})
There was a problem hiding this comment.
Stale. check-zrange-all (line 529) and check-zrangebyscore (line 590) both use (not can-be-present?) to emit :unexpected-presence / :unexpected-presence-range. can-be-present? (returned from allowed-scores-for-member) covers both phantoms (no existence evidence anywhere) and stale reads (committed-removed before the read). Regression tests: phantom-member-is-flagged, stale-read-after-committed-zrem-is-flagged. Landed in 2a194a4.
…o ops Address correctness issues in the Redis ZSet safety workload raised in PR #550 review: - parse-double-safe: Redis emits "inf"/"+inf"/"-inf" for infinite ZSET scores, which Double/parseDouble rejects. Route parse-withscores and the ZINCRBY reply through a tolerant parser that maps those to +/- Double/POSITIVE_INFINITY. - allowed-scores-for-member: tighten committed-score admissibility to real-time "candidates" (preceding :ok mutations not strictly followed in real time by another preceding :ok mutation). Superseded committed scores are no longer admissible, closing a stale-read soundness gap. - can-be-present?: replace the phantom-only check with a presence check that rejects both phantoms (never existed) and stale reads (member committed-removed before the read with no concurrent re-add). - CLI dispatch: -main now selects a workload by the first positional arg (elastickv-test, elastickv-zset-safety-test, elastickv-dynamodb- test, elastickv-s3-test) and auto-prepends the jepsen "test" subcommand when missing, matching the invocation documented in the PR description. - Tests: cover phantom, stale-read-after-committed-zrem, superseded committed score, and parse-withscores infinity handling.
|
Addressed all 9 review items in commit gemini HIGH (4)
Copilot (5)
Validation
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new Jepsen workload, elastickv-zset-safety-test, designed to verify the safety properties of Redis ZSet implementations under faults, including score correctness, ordering, and atomicity. It adds a custom model-based checker and corresponding unit tests. Feedback focuses on refining the allowed-scores-for-member logic to better handle concurrent operations and uncertain states, specifically addressing potential false positives in presence detection and improving the completeness checks for ZRANGE and ZRANGEBYSCORE operations.
| (let [muts (get mutations-by-m member []) | ||
| ;; :ok mutations that completed strictly before the read. | ||
| preceding (->> muts | ||
| (filter #(and (= :ok (:type %)) | ||
| (some? (:complete-idx %)) | ||
| (< (:complete-idx %) read-inv-idx)))) | ||
| ;; Real-time "last-wins" candidate filter: a preceding mutation | ||
| ;; m is admissible iff no OTHER preceding mutation m' has | ||
| ;; m'.invoke-idx > m.complete-idx (i.e. m' strictly follows m). | ||
| ;; Equivalent: m.complete-idx >= max(invoke-idx) over preceding. | ||
| max-inv (reduce max -1 (map :invoke-idx preceding)) | ||
| candidates (filterv #(>= (:complete-idx %) max-inv) preceding) | ||
| ;; :info mutations that completed before the read: they may or | ||
| ;; may not have taken effect server-side. | ||
| pre-read-info (->> muts | ||
| (filter #(and (= :info (:type %)) | ||
| (some? (:complete-idx %)) | ||
| (< (:complete-idx %) read-inv-idx)))) | ||
| ;; Concurrent mutations: windows overlap the read. Include both | ||
| ;; :ok and :info since either may have taken effect. | ||
| concurrent (concurrent-mutations-for-member muts read-inv-idx read-cmp-idx) | ||
|
|
||
| add-scores (fn [acc m] | ||
| (case (:f m) | ||
| :zadd (conj acc (:score m)) | ||
| :zincrby (cond-> acc (some? (:score m)) (conj (:score m))) | ||
| :zrem acc)) | ||
| ;; Admissible scores: candidate committed + pre-read :info + | ||
| ;; concurrent writes (with a known score). | ||
| scores (as-> #{} s | ||
| (reduce add-scores s candidates) | ||
| (reduce add-scores s pre-read-info) | ||
| (reduce add-scores s concurrent)) | ||
|
|
||
| has-unknown-incr? (fn [coll] | ||
| (some #(and (= :zincrby (:f %)) | ||
| (:unknown-score? %)) | ||
| coll)) | ||
| unknown-score? (or (has-unknown-incr? concurrent) | ||
| (has-unknown-incr? pre-read-info)) | ||
|
|
||
| ;; Did any candidate commit establish presence (write, or | ||
| ;; ZREM with :removed? -- either way the member existed)? | ||
| candidate-state (reduce apply-mutation-to-state nil | ||
| (sort-by :complete-idx candidates)) | ||
| candidate-present? (boolean (:present? candidate-state)) | ||
|
|
||
| any-concurrent-could-write? (or (some write-op? concurrent) | ||
| (some write-op? pre-read-info)) | ||
| any-concurrent-could-remove? (or (some #(= :zrem (:f %)) concurrent) | ||
| (some #(= :zrem (:f %)) pre-read-info)) | ||
|
|
||
| can-be-present? (or candidate-present? | ||
| any-concurrent-could-write? | ||
| ;; A :zrem with :removed? true still proves | ||
| ;; existence; if a concurrent ZREM raced | ||
| ;; with an earlier write whose window is | ||
| ;; not captured as a candidate, presence is | ||
| ;; uncertain rather than forbidden. | ||
| (and (some existence-evidence? (concat concurrent | ||
| pre-read-info)) | ||
| any-concurrent-could-remove?)) | ||
|
|
||
| must-be-present? (boolean (and candidate-present? | ||
| (empty? pre-read-info) | ||
| (empty? concurrent)))] | ||
| {:scores scores | ||
| :unknown-score? (boolean unknown-score?) | ||
| :can-be-present? (boolean can-be-present?) | ||
| :must-be-present? must-be-present?})) |
There was a problem hiding this comment.
The logic in allowed-scores-for-member has several correctness issues that will lead to false positives and false negatives in safety checking:
- False Positives (Strictness):
candidate-present?relies oncomplete-idxorder to determine presence. In a linearizable system, concurrent committed operations can be serialized in any order. If aZADDandZREMare concurrent but both finish before the read, the read may legitimately observe either state. The current logic flags one of these as an error. - False Negatives (Completeness):
must-be-present?is too strict. It returnsfalseif any uncertain operation exists, even if that operation is aZADDwhich cannot remove the member. It should only befalseif an uncertain or concurrent removal exists. - ZINCRBY Score Set: For multiple concurrent
ZINCRBYoperations, the set of allowed scores should include all possible prefix sums of the increments. The current logic only includes the final score of each increment, which is too restrictive.
The suggested fix relaxes can-be-present? to allow any valid serialization of candidates and uncertain operations, and refines must-be-present? to correctly identify when a member is required to appear.
(let [muts (get mutations-by-m member [])
;; :ok mutations that completed strictly before the read.
preceding (->> muts
(filter #(and (= :ok (:type %))
(some? (:complete-idx %))
(< (:complete-idx %) read-inv-idx))))
;; Real-time "last-wins" candidate filter: a preceding mutation
;; m is admissible iff no OTHER preceding mutation m' has
;; m'.invoke-idx > m.complete-idx (i.e. m' strictly follows m).
max-inv (reduce max -1 (map :invoke-idx preceding))
candidates (filterv #(>= (:complete-idx %) max-inv) preceding)
;; :info mutations that completed before the read: they may or
;; may not have taken effect server-side.
pre-read-info (->> muts
(filter #(and (= :info (:type %))
(some? (:complete-idx %))
(< (:complete-idx %) read-inv-idx))))
;; Concurrent mutations: windows overlap the read. Include both
;; :ok and :info since either may have taken effect.
concurrent (concurrent-mutations-for-member muts read-inv-idx read-cmp-idx)
all-uncertain (concat pre-read-info concurrent)
add-scores (fn [acc m]
(case (:f m)
:zadd (conj acc (:score m))
:zincrby (cond-> acc (some? (:score m)) (conj (:score m)))
:zrem acc))
;; Admissible scores: candidate committed + pre-read :info +
;; concurrent writes (with a known score).
scores (as-> #{} s
(reduce add-scores s candidates)
(reduce add-scores s all-uncertain))
;; Relax score check if any ZINCRBY is concurrent or uncertain.
;; Multiple concurrent increments can produce intermediate sums not in `scores`.
unknown-score? (some #(= :zincrby (:f %)) all-uncertain)
;; can-be-present?: true if any candidate or uncertain op results in presence.
can-be-present? (or (some write-op? candidates)
(some write-op? all-uncertain)
(some existence-evidence? (concat candidates all-uncertain)))
;; must-be-present?: true if ALL valid serializations result in presence.
must-be-present? (and (some write-op? candidates)
(not (some #(= :zrem (:f %)) candidates))
(not (some #(= :zrem (:f %)) all-uncertain)))]
{:scores scores
:unknown-score? (boolean unknown-score?)
:can-be-present? (boolean can-be-present?)
:must-be-present? must-be-present?})
There was a problem hiding this comment.
Partially stale, partially rebutted.
False Positives (Strictness) — addressed in 2a194a4. Overlapping-window committed ops are both admissible: max-inv = max of preceding :invoke-idx; candidates = preceding ops with :complete-idx >= max-inv. When multiple candidates remain their windows overlap, so any can serialize last, and candidate-can-be-present? / candidate-can-be-absent? both consider them. Regression tests: concurrent-zadd-zrem-both-completed-accepts-either-outcome (both [["m1" 1.0]] and [] accepted), overlapping-committed-zadds-allow-either-score.
False Negatives (Completeness on must-be-present?) — I don't see this in the code. must-be-present? (lines 475-477) guards on any-candidate-write?, (not candidate-can-be-absent?), and (not any-uncertain-zrem?). Uncertain ZADDs do NOT force it to false; only uncertain ZREMs do, which is correct (a ZREM could reorder last and remove the member). Could you point to the exact path that triggers the false negative?
| (let [model (model-before mutations-by-m inv-idx) | ||
| observed-members (into #{} (map first) entries)] | ||
| (doseq [[member {:keys [present?]}] model] | ||
| (when (and present? (not (contains? observed-members member))) | ||
| (let [muts (get mutations-by-m member []) | ||
| concurrent (concurrent-mutations-for-member muts inv-idx cmp-idx)] | ||
| (when (empty? concurrent) | ||
| (swap! errors conj {:kind :missing-member | ||
| :index cmp-idx | ||
| :member member})))))) | ||
| @errors)) |
There was a problem hiding this comment.
The completeness check for ZRANGE is currently too simple and doesn't account for :info operations that might have removed a member before the read. It should use the must-be-present? logic from allowed-scores-for-member which correctly handles uncertain removals.
(let [model (model-before mutations-by-m inv-idx)
observed-members (into #{} (map first) entries)]
(doseq [[member _] model]
(let [{:keys [must-be-present?]} (allowed-scores-for-member mutations-by-m member inv-idx cmp-idx)]
(when (and must-be-present? (not (contains? observed-members member)))
(swap! errors conj {:kind :missing-member
:index cmp-idx
:member member}))))
@errors))
There was a problem hiding this comment.
Stale. check-zrange-all completeness (lines 549-558) already uses must-be-present? from allowed-scores-for-member, which correctly skips members that an :info or concurrent ZREM might have removed. Regression test: info-zrem-concurrent-with-read-allows-missing-member. Landed in 2a194a4.
| ;; Completeness within bounds: any model member whose committed score | ||
| ;; is in [lo,hi] with no concurrent mutation must appear. | ||
| (let [model (model-before mutations-by-m inv-idx) | ||
| observed-members (into #{} (map first) members)] | ||
| (doseq [[member {:keys [present? score]}] model] | ||
| (when (and present? | ||
| (<= lo score hi) | ||
| (not (contains? observed-members member))) | ||
| (let [muts (get mutations-by-m member []) | ||
| concurrent (concurrent-mutations-for-member muts inv-idx cmp-idx)] | ||
| (when (empty? concurrent) | ||
| (swap! errors conj {:kind :missing-member-range | ||
| :index cmp-idx | ||
| :bounds bounds | ||
| :member member | ||
| :expected-score score})))))) | ||
| @errors)) |
There was a problem hiding this comment.
The completeness check for ZRANGEBYSCORE has the same issue as ZRANGE: it doesn't account for uncertain removals. Additionally, it should only enforce completeness if the score is definitively known to be within the requested range (i.e., no uncertain increments).
;; Completeness within bounds: any model member whose committed score
;; is in [lo,hi] with no concurrent mutation must appear.
(let [model (model-before mutations-by-m inv-idx)
observed-members (into #{} (map first) members)]
(doseq [[member _] model]
(let [{:keys [must-be-present? scores unknown-score?]} (allowed-scores-for-member mutations-by-m member inv-idx cmp-idx)]
(when (and must-be-present?
(not unknown-score?)
(every? #(<= lo % hi) scores)
(not (contains? observed-members member)))
(swap! errors conj {:kind :missing-member-range
:index cmp-idx
:bounds bounds
:member member
:expected-score (first scores)}))))
@errors))
There was a problem hiding this comment.
Stale. check-zrangebyscore completeness (lines 610-623) uses both must-be-present? AND score-definitely-in-range? (which requires (not unknown-score?) and every candidate score inside [lo, hi]). Uncertain ZINCRBYs → unknown-score? → no completeness failure. Regression test: info-zincrby-does-not-flag-zrangebyscore-completeness. Landed in 2a194a4.
…ations in checker - allowed-scores-for-member: replace complete-idx-ordered candidate-present? with linearization-aware semantics. When multiple candidates share overlapping windows, presence is allowed iff some linearization keeps the member present; presence is required only when every linearization does. Uncertain ZREMs (pre-read :info / concurrent) now correctly relax must-be-present?. - check-zrange-all / check-zrangebyscore: delegate completeness to must-be-present? so :info or concurrent ZREMs don't false-positive missing-member. - check-zrangebyscore: gate completeness on a new score-definitely-in-range? helper so uncertain ZINCRBY (unknown resulting score) can't trigger false :missing-member-range. - Tests: concurrent ZADD+ZREM accepts either outcome; :info ZREM allows absent read; :info ZINCRBY with out-of-range guess does not flag completeness; sanity tests still flag truly-missing members.
|
Addressed 3 new gemini HIGH items (commit 2a194a4): 1.
This correctly accepts reads observing EITHER outcome when a ZADD and ZREM both commit concurrently, and correctly flags superseded-score / stale reads / phantom members. 2. 3. New test coverage (
/gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
@codex code review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 49c5e0cea1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| unknown-score? (or (has-unknown-incr? uncertain) | ||
| (some #(= :zincrby (:f %)) uncertain)) |
There was a problem hiding this comment.
Keep score validation enabled for known concurrent ZINCRBY
allowed-scores-for-member marks :unknown-score? true whenever any uncertain mutation is :zincrby, even when that op completed :ok with a known resulting score. Both check-zrange-all and check-zrangebyscore then skip score-membership validation when :unknown-score? is true, so a read concurrent with a known ZINCRBY can return an impossible score (e.g., 999.0) without being flagged. This creates false negatives in the new safety checker and weakens the core score-correctness guarantee under concurrency.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in 0c0efc4. unknown-score? is now set only when an individual ZINCRBY's own score is unknown (:info/:pending — the :unknown-score? flag set in completed-mutation-window), or when there are >=2 concurrent/uncertain ZINCRBYs where prefix-sum intermediates can occur. A single :ok concurrent ZINCRBY no longer waves through impossible scores. Regression test: single-ok-concurrent-zincrby-still-validates-scores (read returning 999.0 alongside one :ok ZINCRBY is flagged :score-mismatch).
… known
Codex P1: `allowed-scores-for-member` was marking `:unknown-score?`
true whenever ANY uncertain mutation was `:zincrby` — including
completed `:ok` ops whose resulting score is fully recovered from the
server reply. That over-relaxed the check in `check-zrange-all` /
`check-zrangebyscore`, letting a read concurrent with a single known
ZINCRBY return an arbitrary impossible score without being flagged.
Refine the relaxation rule so it only fires when the score is
genuinely unrecoverable:
* `:unknown-score?` is set on an individual ZINCRBY only when its
own result is unknown (:info/:pending), OR
* there are >=2 concurrent/uncertain ZINCRBYs whose relative
serialization produces prefix-sum intermediates not present in
`:scores`.
With a single :ok concurrent ZINCRBY the read can observe either the
pre-op score (in `:scores` via candidate committed ops) or the post-op
score (also in `:scores`), so the strict score-membership check stays
sound.
Add two regression tests:
* `single-ok-concurrent-zincrby-still-validates-scores` flags a
read observing 999.0 while a single :ok ZINCRBY is concurrent.
* `two-concurrent-zincrbys-relax-score-check` accepts the
intermediate prefix-sum score under multi-ZINCRBY overlap.
Some Carmine versions / RESP3 codepaths surface ZREM's reply as a
numeric string ("1") or raw bytes rather than a Long. The previous
(long (or removed 0)) guard handled nil but still threw
ClassCastException on string/bytes, falling through to the general
Exception handler and masking the real signal.
Add coerce-zrem-count helper (parallel to coerce-zincrby-score) that
accepts Number / String / bytes / nil / other and returns a
non-negative long, with unparseable values treated as 0 so the op
stays :ok.
Covers regression with tests exercising string "1", string "0", bytes,
and unexpected keyword replies.
:missing-member-range previously reported :expected-score (first scores), which is misleading when concurrent writers leave multiple admissible scores -- picking an arbitrary element hides the other valid linearizations from consumers reading the error map. Add :allowed scores to the error map (matching the sibling :score-mismatch-range convention). :expected-score is retained for backward compatibility but is only populated when the admissible set has exactly one element; with >1 admissible scores it is nil, forcing consumers to look at :allowed instead of silently reading a half-truth. Covers the new error shape with tests for both the multi-score (several concurrent writers) and single-score (sanity / backcompat) cases.
|
Addressed both gemini medium items.
Local /gemini review |
…rkers Addresses gemini medium on 559e83d. The `invoke!` `(catch Exception e)` won't catch `java.lang.Error` subclasses (NoClassDefFoundError, OutOfMemoryError, LinkageError, ...). In Jepsen those normally propagate, crash the worker thread, and can hang the run or leave history gaps that mask the underlying failure. Widen to `(catch Throwable t)`, matching the pattern already used one call site up in zincrby's coerce branch at line 224. The op is still recorded as :info with a descriptive :error field, so the history shows exactly which operation hit the Error and why. Tests: 47 tests / 83 assertions, 0 failures.
|
Addressed gemini medium (catch The Widened to (catch Throwable t
(warn t (str "ZSet safety op failed: " (:f op)))
(assoc op :type :info :error (or (.getMessage ^Throwable t) (str t))))Tests: 47 tests / 83 assertions, 0 failures. /gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new Jepsen workload, elastickv-zset-safety-workload, designed to verify safety properties of Redis ZSet implementations under faults. The workload includes a custom model-based checker that validates score correctness, ordering, range queries, and atomicity. Additionally, the PR updates the main Jepsen test entry point to support workload dispatching and adds comprehensive unit tests for the new workload's client and checker logic. I have no feedback to provide as there are no review comments.
|
Codex Review: Didn't find any major issues. Breezy! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
There was a problem hiding this comment.
Code Review
This pull request introduces a new Jepsen workload, redis-zset-safety-workload, designed to verify advanced safety properties of Redis ZSet implementations, such as score correctness, ordering, and atomicity under faults. It includes a custom model-based checker and comprehensive unit tests for the workload's logic. Feedback was provided regarding the handling of Redis bulk strings returned as byte arrays in parse-double-safe and coerce-zincrby-score to ensure correct parsing of scores.
| (let [raw (str s) | ||
| lower (str/lower-case raw)] |
There was a problem hiding this comment.
In Clojure, calling (str s) on a byte array returns the object's string representation (e.g., "[B@..."), which is not the intended behavior when parsing numeric scores from Redis. Since Redis scores can be returned as bulk strings (which Carmine may surface as byte arrays depending on the RESP version or configuration), they should be explicitly converted to a UTF-8 string before parsing.
| (let [raw (str s) | |
| lower (str/lower-case raw)] | |
| (let [raw (if (bytes? s) (String. ^bytes s "UTF-8") (str s)) | |
| lower (str/lower-case raw)] |
| (number? response) | ||
| [:ok (double response)] | ||
|
|
||
| (string? response) |
There was a problem hiding this comment.
Carmine may return the result of a ZINCRBY operation as a byte array (bulk string). The current implementation only handles string? and number?, which causes byte array responses to fall through to the :else case and be treated as :unexpected (resulting in an :info operation type). Adding bytes? support ensures these operations are correctly parsed and recorded as :ok.
| (string? response) | |
| (or (string? response) (bytes? response)) |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d03672e3d4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| any-candidate-zrem? (some #(= :zrem (:f %)) candidates) | ||
| any-uncertain-write? (some write-op? uncertain) | ||
| any-uncertain-zrem? (some #(= :zrem (:f %)) uncertain) |
There was a problem hiding this comment.
Ignore no-op ZREM when deriving possible absence
any-candidate-zrem?/any-uncertain-zrem? currently classify every :zrem as a possible deletion, even when the op completed :ok with removed? = false (a confirmed no-op). That relaxes must-be-present? incorrectly and can let impossible empty reads pass: e.g., overlapping ZADD(m1) and ZREM(m1)->0 both complete before a read, and the checker accepts a read that omits m1 even though ZREM’s return value constrains any valid linearization to end with m1 present. This creates false negatives in the safety checker.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 29e62ca83f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| pre-read-info (->> muts | ||
| (filter #(and (= :info (:type %)) | ||
| (some? (:complete-idx %)) | ||
| (< (:complete-idx %) read-inv-idx)))) |
There was a problem hiding this comment.
Exclude superseded pre-read :info ops from uncertainty
Treating every pre-read :info mutation as uncertain here causes false negatives when a later committed write to the same member exists before the read. For example, an :info :zincrby that completed before a subsequent :ok :zadd is still added to uncertain, which can set :unknown-score? (or admit stale scores) and let impossible read values pass even though the later committed write should deterministically overwrite any earlier effect. The same pattern can also relax required-presence checks via stale :info :zrem entries.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 03ce992af2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| (some (fn [n] | ||
| (and (not (identical? n m)) | ||
| (= :zincrby (:f n)) | ||
| (some? (:score n)) | ||
| (some? (:delta n)) | ||
| (= (:score n) (+ (:score m) (:delta n))))) |
There was a problem hiding this comment.
Constrain ZINCRBY candidates superseded by ZADD
When a committed ZINCRBY overlaps a committed ZADD, the ZINCRBY's returned score can still prove it serialized before that ZADD, so its return value is not a possible final post-read score. For example, from an empty set, ZADD m 10 overlapping ZINCRBY m +1 -> 1 can only serialize as ZINCRBY then ZADD, so a read after both complete must see 10; this helper only prunes ZINCRBY scores using other ZINCRBYs, leaving 1 in the allowed set and letting stale reads pass.
Useful? React with 👍 / 👎.
| ;; states: the return values constrain the serialization | ||
| ;; order, and no legitimate read can observe a partial sum | ||
| ;; that doesn't match any recorded post-op score. | ||
| unknown-score? (has-unknown-incr? uncertain) |
There was a problem hiding this comment.
Avoid making deterministic :info ZINCRBY scores arbitrary
For a pre-read :info ZINCRBY with no overlapping/superseding writes, the possible scores are still bounded by the known committed base and the known delta (e.g. after ZADD m 1, :info ZINCRBY m +5 can leave m at either 1 or 6). Setting unknown-score? for every such op makes the read checks skip score membership entirely, so a later ZRANGE returning m=999 is accepted and score-corruption anomalies after lost ZINCRBY replies are hidden.
Useful? React with 👍 / 👎.
| (catch Throwable t | ||
| (warn t (str "ZSet safety op failed: " (:f op))) | ||
| (assoc op :type :info :error (or (.getMessage ^Throwable t) (str t)))))))) |
There was a problem hiding this comment.
Surface malformed read payloads as checker failures
Fresh evidence relative to the earlier odd-length-payload concern: parse-withscores now throws, but this catch converts that protocol violation into a :info read, and zset-safety-checker only analyzes :ok reads. If one malformed WITHSCORES reply occurs in a run that also has successful reads, the workload can still report :valid? true instead of failing on the corrupted Redis response.
Useful? React with 👍 / 👎.
| ;; present. Because candidates have overlapping windows (they | ||
| ;; all share the same max-inv), any of them can serialize last. | ||
| ;; So presence is allowed iff at least one candidate is a write. | ||
| candidate-can-be-present? (boolean any-candidate-write?) |
There was a problem hiding this comment.
Respect ZREM return values when allowing final presence
When overlapping committed candidates include a ZREM that returned true, that response can constrain the only valid order. Starting from an absent member, ZADD m 1 overlapping ZREM m -> 1 forces the ZADD to serialize before the ZREM, so a later read after both complete must not see m; allowing presence whenever any candidate is a write lets exactly that stale read pass.
Useful? React with 👍 / 👎.
| ok? (= :ok t) | ||
| s (when (and ok? (vector? (:value complete))) | ||
| (second (:value complete)))] | ||
| {:f :zincrby :member m :delta delta :score (some-> s double) |
There was a problem hiding this comment.
Validate ZINCRBY scores instead of trusting them as state
For a non-concurrent ZINCRBY with a known prior score, the returned score is itself part of Redis semantics and should equal prior + delta. This records the server reply as the model score, so if the system returns and stores 999 for ZADD m 1; ZINCRBY m +5, a later read of 999 is treated as valid instead of exposing the arithmetic/score-corruption bug.
Useful? React with 👍 / 👎.
| (some (fn [n] | ||
| (and (not (identical? n m)) | ||
| (= :zincrby (:f n)) | ||
| (some? (:score n)) | ||
| (some? (:delta n)) | ||
| (= (:score n) (+ (:score m) (:delta n))))) |
There was a problem hiding this comment.
Preserve valid ZINCRBY tails with negative deltas
With negative or zero increments, this pairwise supersession test can remove the actual final score and produce false Jepsen failures. For example, from score 0, overlapping committed ZINCRBY +5 -> 5 and ZINCRBY -5 -> 0 leave 0 as the valid final state, but 5 == 0 + 5 also makes the final 0 look like an intermediate prefix, so both returned scores can be pruned and a correct read of 0 is flagged.
Useful? React with 👍 / 👎.
| (defn elastickv-zset-safety-test [] | ||
| (zset-safety-workload/elastickv-zset-safety-test {})) |
There was a problem hiding this comment.
Accept Jepsen CLI option maps in named workloads
Jepsen 0.3.11 documents single-test-cmd's :test-fn as a function receiving the parsed option map (https://jepsen-io.github.io/jepsen/jepsen.cli.html#single-test-cmd), but the newly registered ZSet wrapper is zero-arity. Running the documented path lein run -m elastickv.jepsen-test elastickv-zset-safety-test test ... therefore dispatches to this var and crashes with an arity error instead of constructing the workload with CLI options.
Useful? React with 👍 / 👎.
|
@codex review Current HEAD already addresses the blocking ZSET safety review items I checked here: duplicate members are rejected, concurrent failed ops are excluded from candidates, candidate ordering is not completion-time-only, ZREM return values are modeled through removed? state, ZINCRBY unknown handling is scoped to unknown scores, and read completeness uses must-be-present? expectations. Validation: from jepsen/, ran |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 03ce992af2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ;; present. Because candidates have overlapping windows (they | ||
| ;; all share the same max-inv), any of them can serialize last. | ||
| ;; So presence is allowed iff at least one candidate is a write. | ||
| candidate-can-be-present? (boolean any-candidate-write?) |
There was a problem hiding this comment.
Respect true ZREM as ordering evidence
In histories where setup starts empty and the only write for a member is an overlapping ZADD while an overlapping ZREM returns removed? true, this lets a post-completion read that still shows the member pass. The true ZREM response proves it linearized after some prior write; with no earlier committed presence, that write can only be the overlapping ZADD, so the final state must be absent. Treating any candidate write as enough for presence masks stale reads after a successful remove.
Useful? React with 👍 / 👎.
| candidate-scores (remove #(superseded-zincrby-score? candidates %) candidates) | ||
| scores (as-> #{} s | ||
| (reduce add-scores s candidate-scores) | ||
| (reduce add-scores s uncertain)) |
There was a problem hiding this comment.
Reject lost-update ZINCRBY return sets
When two :ok ZINCRBYs overlap, both complete before the read, and each returns as if it saw the same base, neither return value is removed by superseded-zincrby-score? because no return equals the other plus its delta. scores then admits a partial return (for example starting at 1, +2 -> 3 and +3 -> 4, then a read after both returns 4), even though every linearization of two successful increments must end at 6. This makes the safety workload miss the concurrent increment lost-update case it is intended to catch.
Useful? React with 👍 / 👎.
| max-inv (reduce max -1 (map :invoke-idx preceding)) | ||
| candidates (filterv #(>= (:complete-idx %) max-inv) preceding) |
There was a problem hiding this comment.
Don't let no-op ZREM supersede writes
When a confirmed no-op ZREM is the latest preceding mutation (for example, an :ok ZADD, then an :ok ZREM with [member false], then a read), max-inv keeps only that ZREM in candidates. Because that candidate is neither a write nor an effective remove, can-be-present? and must-be-present? both become false, so the checker rejects a correct read that still contains the member and accepts an absent read even though the no-op should have left the prior score intact.
Useful? React with 👍 / 👎.
|
|
||
| :zrange-all | ||
| (let [flat (car/wcar cs (car/zrange zset-key 0 -1 "WITHSCORES"))] | ||
| (assoc op :type :ok :value (parse-withscores flat))) |
There was a problem hiding this comment.
Don't hide malformed ZRANGE replies as timeouts
If Redis returns a malformed WITHSCORES payload, such as an odd element count or an unparsable score, parse-withscores throws here but the surrounding catch Throwable converts the read to :info. The checker ignores non-:ok reads, so an intermittent protocol/serialization bug is discarded and the run can still pass as long as another read succeeds; malformed successful replies should be reported as invalid read evidence instead of downgraded to indeterminate operations.
Useful? React with 👍 / 👎.
| (defn elastickv-zset-safety-test [] | ||
| (zset-safety-workload/elastickv-zset-safety-test {})) |
There was a problem hiding this comment.
Accept CLI options in the dispatched ZSet test
When the new dispatcher is used as documented (lein run -m elastickv.jepsen-test elastickv-zset-safety-test ...), Jepsen 0.3.11's single-test-cmd invokes the selected :test-fn as (test-fn options). This wrapper only has zero arity, so the ZSet entrypoint added to test-fns crashes with an ArityException before running and cannot honor any supplied Jepsen options; it needs to accept the options map and forward it to zset-safety-workload/elastickv-zset-safety-test.
Useful? React with 👍 / 👎.
|
Addressed the current-head ZSET safety checker findings in d31d8b9. What changed:
Verification:
Note: full cd jepsen && lein test still fails before running tests because elastickv.redis-workload cannot locate jepsen.redis.client on the classpath; this appears unrelated to the ZSET workload changes. @codex review |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/jepsen-test-scheduled.yml:
- Line 186: Replace the uses declaration for the upload-artifact action from
using a version tag (v7) to using a full commit SHA. Identify the current commit
SHA associated with upload-artifact v7 and replace `actions/upload-artifact@v7`
with `actions/upload-artifact@<full-commit-sha>` to pin the workflow to a
specific immutable commit rather than a mutable version tag, which strengthens
supply-chain security.
- Around line 156-159: The direct expansion of ${{ inputs.rate || '5' }} inside
the shell command creates a command injection vulnerability. Instead, bind the
input value to an environment variable in the GitHub Actions workflow (in the
env section or step env), then reference that variable in the shell command with
proper quoting like "${RATE_VAR}" to safely pass the value at runtime rather
than during workflow expansion.
In @.github/workflows/jepsen-test.yml:
- Line 140: In the jepsen-test.yml workflow file, replace the mutable version
tag `v7` in the `actions/upload-artifact` action with a specific full commit
SHA. Instead of using `actions/upload-artifact@v7`, pin it to a complete commit
digest (e.g., `actions/upload-artifact@abc123...`) to ensure the action version
is immutable and cannot change unexpectedly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0e42d95a-85ef-448e-9fb4-cdd32cc75e30
📒 Files selected for processing (5)
.github/workflows/jepsen-test-scheduled.yml.github/workflows/jepsen-test.ymljepsen/src/elastickv/jepsen_test.cljjepsen/src/elastickv/redis_zset_safety_workload.cljjepsen/test/elastickv/redis_zset_safety_workload_test.clj
💤 Files with no reviewable changes (3)
- jepsen/src/elastickv/jepsen_test.clj
- jepsen/test/elastickv/redis_zset_safety_workload_test.clj
- jepsen/src/elastickv/redis_zset_safety_workload.clj
There was a problem hiding this comment.
Caution
Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/jepsen-test-scheduled.yml:
- Line 186: Replace the uses declaration for the upload-artifact action from
using a version tag (v7) to using a full commit SHA. Identify the current commit
SHA associated with upload-artifact v7 and replace `actions/upload-artifact@v7`
with `actions/upload-artifact@<full-commit-sha>` to pin the workflow to a
specific immutable commit rather than a mutable version tag, which strengthens
supply-chain security.
- Around line 156-159: The direct expansion of ${{ inputs.rate || '5' }} inside
the shell command creates a command injection vulnerability. Instead, bind the
input value to an environment variable in the GitHub Actions workflow (in the
env section or step env), then reference that variable in the shell command with
proper quoting like "${RATE_VAR}" to safely pass the value at runtime rather
than during workflow expansion.
In @.github/workflows/jepsen-test.yml:
- Line 140: In the jepsen-test.yml workflow file, replace the mutable version
tag `v7` in the `actions/upload-artifact` action with a specific full commit
SHA. Instead of using `actions/upload-artifact@v7`, pin it to a complete commit
digest (e.g., `actions/upload-artifact@abc123...`) to ensure the action version
is immutable and cannot change unexpectedly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0e42d95a-85ef-448e-9fb4-cdd32cc75e30
📒 Files selected for processing (5)
.github/workflows/jepsen-test-scheduled.yml.github/workflows/jepsen-test.ymljepsen/src/elastickv/jepsen_test.cljjepsen/src/elastickv/redis_zset_safety_workload.cljjepsen/test/elastickv/redis_zset_safety_workload_test.clj
💤 Files with no reviewable changes (3)
- jepsen/src/elastickv/jepsen_test.clj
- jepsen/test/elastickv/redis_zset_safety_workload_test.clj
- jepsen/src/elastickv/redis_zset_safety_workload.clj
🛑 Comments failed to post (3)
.github/workflows/jepsen-test-scheduled.yml (2)
156-159: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick win
Avoid direct
${{ inputs.* }}expansion inside shell commands.
--rate ${{ inputs.rate || '5' }}is rendered into shell source before execution, which can become command injection if a crafted workflow input is passed. Bind input toenvand validate/quote at runtime instead.Suggested hardening
- name: Run DynamoDB per-type Jepsen workloads against elastickv working-directory: jepsen timeout-minutes: 30 env: + INPUT_RATE: ${{ inputs.rate || '5' }} TYPE_TL: "60" TYPE_CONCURRENCY: "4" TYPE_KEY_COUNT: "8" TYPE_MAX_WRITES: "80" run: | + [[ "$INPUT_RATE" =~ ^[0-9]+([.][0-9]+)?$ ]] || { echo "Invalid rate: $INPUT_RATE"; exit 2; } PER_TYPE_TIMEOUT=$((TYPE_TL + 180)) declare -A RESULT FAILED=() for t in string number binary bool null string-set number-set binary-set list map; do echo "::group::value-type=${t}" set +e timeout "${PER_TYPE_TIMEOUT}" ~/lein run -m elastickv.dynamodb-types-workload --local \ --time-limit "${TYPE_TL}" \ - --rate ${{ inputs.rate || '5' }} \ + --rate "${INPUT_RATE}" \ --concurrency "${TYPE_CONCURRENCY}" \ --key-count "${TYPE_KEY_COUNT}" \ --max-writes-per-key "${TYPE_MAX_WRITES}" \ --value-type "${t}" \ --dynamo-ports 63801,63802,63803 \ --host 127.0.0.1🧰 Tools
🪛 zizmor (1.26.1)
[error] 158-158: code injection via template expansion (template-injection): may expand into attacker-controllable code
(template-injection)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/jepsen-test-scheduled.yml around lines 156 - 159, The direct expansion of ${{ inputs.rate || '5' }} inside the shell command creates a command injection vulnerability. Instead, bind the input value to an environment variable in the GitHub Actions workflow (in the env section or step env), then reference that variable in the shell command with proper quoting like "${RATE_VAR}" to safely pass the value at runtime rather than during workflow expansion.Source: Linters/SAST tools
186-186: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick win
Pin
upload-artifactto a full commit SHA.Using
actions/upload-artifact@v7without SHA pinning weakens workflow supply-chain integrity.🧰 Tools
🪛 zizmor (1.26.1)
[error] 186-186: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)
(unpinned-uses)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/jepsen-test-scheduled.yml at line 186, Replace the uses declaration for the upload-artifact action from using a version tag (v7) to using a full commit SHA. Identify the current commit SHA associated with upload-artifact v7 and replace `actions/upload-artifact@v7` with `actions/upload-artifact@<full-commit-sha>` to pin the workflow to a specific immutable commit rather than a mutable version tag, which strengthens supply-chain security.Source: Linters/SAST tools
.github/workflows/jepsen-test.yml (1)
140-140: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick win
Pin
upload-artifactto a full commit SHA.
actions/upload-artifact@v7is mutable and should be pinned to a specific commit digest per policy.🧰 Tools
🪛 zizmor (1.26.1)
[error] 140-140: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)
(unpinned-uses)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/jepsen-test.yml at line 140, In the jepsen-test.yml workflow file, replace the mutable version tag `v7` in the `actions/upload-artifact` action with a specific full commit SHA. Instead of using `actions/upload-artifact@v7`, pin it to a complete commit digest (e.g., `actions/upload-artifact@abc123...`) to ensure the action version is immutable and cannot change unexpectedly.Source: Linters/SAST tools
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d31d8b9d50
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| (and (= :ok (:type m)) (false? (:removed? m))) | ||
| #{st} |
There was a problem hiding this comment.
Constrain no-op ZREM to an absent prior state
When a member was definitely present before ZREM—for example, a :zadd completes before the :zrem is invoked—and ZREM returns 0 while the member remains visible, this branch keeps the present state and the checker accepts the history. Redis's 0 count is only compatible with the member being absent at the ZREM linearization point, so a false ZREM should return #{st} only for absent states (and #{} for present states); otherwise failed removals can pass the safety workload.
Useful? React with 👍 / 👎.
| "True iff the mutation adds/updates the member's score (i.e. would | ||
| make the member present). :zrem is NOT a write-op here." | ||
| [m] | ||
| (#{:zadd :zincrby} (:f m))) |
There was a problem hiding this comment.
Keep uncertain increments before later ZINCRBYs
With an :info ZINCRBY that completes before a later :ok ZINCRBY starts, the earlier increment may be required to make the later reply consistent (e.g. absent -> ZINCRBY +5 returns :info, then ZINCRBY +1 returns 6). Because write-op? includes :zincrby, superseded-by-preceding-state-change? drops the pre-read :info, so the checker rejects the valid read of 6. Only absolute overwrites/deletions should supersede prior uncertain increments; relative increments need to remain in the enumerated prefix.
Useful? React with 👍 / 👎.
| base-preceding (->> preceding | ||
| (remove (set candidates)) | ||
| (sort-by :complete-idx)) |
There was a problem hiding this comment.
Enumerate overlapping base mutations before tail candidates
These base mutations can still overlap each other even though they all complete before the read and before the tail candidate's invoke. Sorting them by :complete-idx explores only one legal serialization; e.g. two overlapping ZADDs to scores 1 and 2 followed by a non-concurrent ZINCRBY +1 returning 3 is valid if the score-2 ZADD serialized last, but this code fixes the base at score 1 and rejects the history. The base prefix needs the same real-time-consistent enumeration as the tail candidates.
Useful? React with 👍 / 👎.
| states (apply-sequential-mutations #{absent-state} applied) | ||
| state (first states)] | ||
| (if state | ||
| (assoc model member state) | ||
| model))) |
There was a problem hiding this comment.
Fail impossible mutation chains instead of dropping them
apply-sequential-mutations uses an empty state set to mean a successful mutation reply is impossible, but this branch just omits the member from the model. In a run such as ZADD m1 1, ZINCRBY m1 5 returning 999, followed by an empty ZRANGE, the bad successful ZINCRBY reply is silently discarded and no completeness check runs for m1, so the checker can pass histories with impossible write responses unless a later read happens to expose the bad score.
Useful? React with 👍 / 👎.
| (when (odd? (count flat)) | ||
| (throw (ex-info "WITHSCORES reply has odd element count" | ||
| {:count (count flat) | ||
| :payload flat}))) | ||
| (->> flat |
There was a problem hiding this comment.
Reject nil WITHSCORES replies as malformed
count on nil is 0 in Clojure, so a nil Carmine reply reaches partition and becomes an empty successful read instead of a malformed-read failure. If ZRANGE/ZRANGEBYSCORE starts returning nil against an empty key, this checker records a successful empty read and can return :valid? true, which defeats the malformed-payload and no-successful-read guards.
Useful? React with 👍 / 👎.
| {"elastickv-test" elastickv-test | ||
| "elastickv-zset-safety-test" elastickv-zset-safety-test | ||
| "elastickv-dynamodb-test" elastickv-dynamodb-test | ||
| "elastickv-s3-test" elastickv-s3-test}) |
There was a problem hiding this comment.
Let every selected workload accept the option map
The new dispatcher exposes these zero-arity wrappers through single-test-cmd, but Jepsen's single-test-cmd documentation says the :test-fn receives the parsed option map. Selecting elastickv-test, elastickv-dynamodb-test, or elastickv-s3-test now calls a zero-arity function with one argument, so those newly advertised named workloads fail before running; they need the same ([] ...) ([opts] ...) shape as the zset wrapper.
Useful? React with 👍 / 👎.
|
@codex review Addressed the current-head ZSET safety review findings in d094a85:
Caller audit: grep checked parse-withscores, model-before, superseded-by-preceding-state-change?, and the workload wrapper functions; affected callers are the zset checker/read paths and the CLI dispatcher tests. Validation:
Note: elastickv.jepsen-test-test still cannot be run in this local classpath because loading elastickv.redis-workload fails on missing jepsen.redis.client, which is the same local full-suite blocker seen previously. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d094a85aec
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| base-states (linearized-read-states #{absent-state} base-preceding []) | ||
| possible-states (linearized-read-states base-states candidates uncertain) |
There was a problem hiding this comment.
Keep overlapping base mutations in the candidate search
When a pre-read mutation falls into base-preceding but overlaps one of the tail candidates, this applies it before every candidate, which can reject valid linearizations. For example, ZADD m 1 invoked at 0/completing at 5, ZINCRBY m 1 invoked at 1/completing at 2 returning 2, and another ZINCRBY m 1 invoked at 3/completing at 4 returning 3 is valid as ZADD -> first increment -> second increment; however the first increment is put in base-preceding and applied to the absent state before the overlapping ZADD candidate, making the checker report a false failure for a later read of 3. This can make the new Jepsen workload fail correct executions with slow overlapping writes.
Useful? React with 👍 / 👎.
| ;; consistent order and any of their return values is a valid | ||
| ;; final state. | ||
| max-inv (reduce max -1 (map :invoke-idx preceding)) | ||
| candidates (filterv #(>= (:complete-idx %) max-inv) preceding) |
There was a problem hiding this comment.
Preserve one serialization across later reads
When overlapping writes have all completed before multiple later reads, this candidate set is recomputed independently for each read, so each read may choose a different order for the same completed writes. For example, two overlapping ZADDs can both complete, then a read can return score 2 and a later read return score 1 with no intervening mutation; both reads are accepted because each score is in its own candidate set, even though no single linearization can satisfy both. This makes the new Jepsen checker miss stale/non-monotonic reads after ambiguous writes.
Useful? React with 👍 / 👎.
| (let [{:keys [scores can-be-present?]} | ||
| (allowed-scores-for-member mutations-by-m member inv-idx cmp-idx)] |
There was a problem hiding this comment.
Use one read prefix across all members
Each entry is validated with a fresh per-member state search, so a single read can combine visibility choices that no one linearization point permits. For instance, if ZADD m2 overlaps the read but completes before ZADD m1 is invoked, and the read returns m1 while omitting m2, the checker can accept m1 as a visible concurrent write and independently skip m2; however any read linearized late enough to see m1 must also include the earlier-completed m2. This misses fractured ZRANGE/ZRANGEBYSCORE snapshots across members.
Useful? React with 👍 / 👎.
|
@codex review 対応しました。
検証:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c5147914b3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| (let [observed (into {} entries) | ||
| all-mutations (mapcat second mutations-by-m) | ||
| kind (if bounds :fractured-read-prefix-range :fractured-read-prefix)] | ||
| (->> (for [[member score] entries |
There was a problem hiding this comment.
Anchor prefix checks on visible ZREM omissions
When the read's only evidence of a concurrent mutation is an omission caused by ZREM, this loop never considers that removed member because it only iterates over returned entries. For example, if m1 is definitely present before a full read, m2's ZADD completes before a concurrent ZREM m1 is invoked, and the overlapping read returns [], the checker accepts it: m1 may be absent because of the ZREM and m2 may be absent because its ZADD is concurrent, but no single prefix can omit both once the ZREM is visible. This lets fractured snapshots involving deletes pass; infer missing members made optional by a concurrent ZREM and run the same predecessor-prefix check for those anchors.
Useful? React with 👍 / 👎.
| :zincrby | ||
| (and (= :ok (:type m)) | ||
| (score-eq? (:score m) score)) |
There was a problem hiding this comment.
Treat visible info ZINCRBY as a prefix anchor
For :info ZINCRBYs, this predicate returns false even when the observed score is only reachable by applying that in-flight increment. A history where m2's ZADD completes during a read before an :info ZINCRBY m1 is invoked, and the read returns m1 at the incremented score while omitting m2, currently passes: allowed-scores-for-member admits the derived score, but read-prefix-errors never anchors on the visible ZINCRBY, so it does not force the real-time predecessor m2 into the same read prefix. This misses fractured snapshots whenever the visible concurrent mutation is an unknown-response increment.
Useful? React with 👍 / 👎.
| :when (and (not= member (:member predecessor)) | ||
| (= :ok (:type predecessor)) | ||
| (real-time-before? predecessor anchor) | ||
| (not (contains? observed (:member predecessor))) |
There was a problem hiding this comment.
Validate forced predecessors when already present
This condition only reports a forced real-time predecessor when its member is omitted, so the checker misses fractured reads that include that member at an older state. For example, if a read starts before ZADD m2 2, that update completes, then a visible concurrent ZADD m1 1 starts, a result containing m1=1 and stale m2=1 passes because allowed-scores-for-member permits either concurrent score for m2 and read-prefix-errors skips it once m2 is present. When a visible anchor forces a predecessor into the same prefix, compare the observed predecessor state against the forced-prefix states instead of only checking membership.
Useful? React with 👍 / 👎.
|
@codex review 対応しました。
Validation:
|
|
Codex Review: Didn't find any major issues. Chef's kiss. Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Summary
Adds a Jepsen workload (
elastickv-zset-safety-test) that verifies ZSet-specific safety properties under faults (network partitions, node kills), using a custom model-based checker. Goes beyond the simple add->read visibility check.Properties verified:
Concurrent-ZADD handling uses an invoke/complete windowing approach. A mutation is "committed before" a read iff its
:completeindex is strictly less than the read's:invokeindex. Mutations whose intervals overlap are "concurrent" and contribute to a per-member allowed-score set. Indeterminate (:info) mutations are treated as possibly-concurrent. ZINCRBY whose response is unknown sets:unknown-score?so the checker skips the strict score check for concurrent reads. ZREM carries the actualremoved?boolean from the server reply so a no-op ZREM does not falsely mark the member as deleted.Files
jepsen/src/elastickv/redis_zset_safety_workload.cljjepsen/test/elastickv/redis_zset_safety_workload_test.clj(test-spec construction + checker edge cases: no-op ZREM, :info ZINCRBY, deterministic score mismatch).jepsen/src/elastickv/jepsen_test.clj(entry point added)..github/workflows/jepsen-test.yml(5s smoke run on every push)..github/workflows/jepsen-test-scheduled.yml(150s default run every 6h).Running locally
The workload is invoked directly via its own
-main, not throughjepsen-test/-main:cd jepsen lein run -m elastickv.redis-zset-safety-workload \ --time-limit 60 --rate 10 --concurrency 5 \ --ports 63791,63792,63793 --host 127.0.0.1(
elastickv.jepsen-testexposeselastickv-zset-safety-testonly as a Clojure function for REPL use; CI and ad-hoc runs use the namespace's own-main.Test plan
EOF
)
Summary by CodeRabbit
New Features
CI/CD
Tests