HDDS-15643 redundant OM lookupKey RPC or EC checksum by yandrey321 · Pull Request #10594 · apache/ozone

yandrey321 · 2026-06-23T21:48:30Z

Before the fix the 6-argument constructor called fetchBlocks(), which issues a lookupKey RPC to OM when keyInfo is null. Because this.keyInfo is not assigned until after the 6-arg constructor returns, keyInfo was always null inside fetchBlocks(), so the RPC was always issued — even when the caller already had OmKeyInfo in hand.

The fix reverses the chain: the 6-arg constructor delegates to the 7-arg with null, and the 7-arg constructor does the full initialization, assigning this.keyInfo before calling fetchBlocks().

getChunkInfos() builds a standalone Pipeline containing only the selected nodes (data replica index 1 and all parity nodes). The old code used pipeline.toBuilder(), which copies the full EC nodeStatus map (all 5 or 9 nodes). When setNodes() was then called with the smaller selected-node list, Pipeline.Builder detected the size mismatch and replaced the pipeline ID with PipelineID.randomId() — calling SecureRandom.nextBytes() on every file, even though setId(deterministicId) was not called at all in the old code.

Because the pipeline ID was random per file, XceiverClientManager could never reuse a cached gRPC connection: every file opened a new connection regardless of whether it hit the same physical datanodes.

The fix does two things:
a) Compute a deterministic pipeline ID from the sorted UUIDs of the selected nodes
b) Switch from toBuilder() to Pipeline.newBuilder() so that nodeStatus starts with null. Pipeline.Builder.setNodes() only calls PipelineID.randomId() when nodeStatus != null and the node count changes. With newBuilder() nodeStatus is always null, so the random replacement never triggers.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-15643

How was this patch tested?

Run existing unit and integration tests
Run the benchmark FileChecksumBenchmark, 5 threads, 10 s warmup + 20 s measurement per combination. Two configurations (RS-3-2, RS-6-3) × three simulated OM latencies (0ms / 5 ms / 10 ms). The test is implemented in a way to limit test hit by 75% and intentionally force hit misses.

Config	Latency	Before	After	Ratio
RS-3-2	0ms	62,203	80,691	1.30×
RS-3-2	5ms	405	637	1.57×
RS-3-2	10ms	204	416	2.04×
RS-6-3	0ms	42,613	53,837	1.26×
RS-6-3	5ms	348	812	2.34×
RS-6-3	10ms	184	416	2.26×

The gain is largest at higher latencies because connection reuse (Fix 2) saves one full RPC round-trip per file. At 0 ms the dominant gain is Fix 1 (halving OM calls). The OM/file counter confirms 2.00 → 1.00 for all rows; CacheHit% confirms 0% → 74-75% for all rows.

…eation for EC checksum

yandrey321 added 2 commits June 23, 2026 14:03

HDDS-15643 redundant OM lookupKey RPC and per-file gRPC connection cr…

ddb5b98

…eation for EC checksum

fixed crc algorithm in the benchmark

5311aa2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-15643 redundant OM lookupKey RPC or EC checksum#10594

HDDS-15643 redundant OM lookupKey RPC or EC checksum#10594
yandrey321 wants to merge 2 commits into
apache:masterfrom
yandrey321:HDDS-15643

yandrey321 commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yandrey321 commented Jun 23, 2026

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant