Antalya 26.3 Backport of #99935 - Support more flexible data paths validation in iceberg#1769
Antalya 26.3 Backport of #99935 - Support more flexible data paths validation in iceberg#1769mkmkme wants to merge 2 commits into
Conversation
…_iceberg Support more flexible data paths validation in iceberg
| { | ||
| /// connection://bucket | ||
| auto prefix = table_location.substr(0, table_location.size() - common_path.size()); | ||
| if (data_path.size() < prefix.size()) |
There was a problem hiding this comment.
As I see this change is from ClickHouse#100420.
Do we want to backport this huge PR too?
Verification: PR #1769Title: Antalya 26.3 Backport of ClickHouse#99935 — Support more flexible data paths validation in Iceberg ChangesProduction fix (1 file):
New tests (3 stateless tests, 6 files):
PR-added tests — all GREEN3 tests × 6 stateless jobs each = 18 OK runs, 0 failures. (1 SKIPPED per test in
Jobs covered: Both error paths ( CI overview (head commit)
PR test-workflow failures
Regression-workflow failures (chronic baseline on
|
| Suite | Fails |
|---|---|
| Swarms (Aarch64 + Release) | 226 |
| Parquet (Aarch64 + Release) | 34 |
S3Export partition (Aarch64 + Release) |
20 |
S3Export part (Aarch64 + Release) |
16 |
Same fingerprint as sibling antalya-26.3 PRs (1770, 1767, 1762, 1759, 1748, …). No new failure modes.
Caveat — partial frontport
This PR lands on antalya-26.3 while companion features from antalya-26.1 are still being frontported in parallel. Some regression suites still rely on settings/CLI args not yet wired in this base. Final re-verify is recommended once the rest of the bundle lands.
Verdict
Safe to merge.
- All 3 newly added stateless tests pass 100% (18/18 stateless runs), covering both the new
BAD_ARGUMENTSguards and the positive Spark-style path resolution. - Only test-level failure is
03443_shared_storage_snapshots, a well-documented pre-existing flake. - The other 5 GitHub-level red checks are job-level/infra errors and the recurring
antalya-26.3chronic baseline.
Audit: PR #1769 — Antalya 26.3 Backport of #99935 — Support more flexible data paths validation in icebergAI audit note: This review comment was generated by AI (Cursor agent, audit-review skill). Scope reviewedSingle-commit backport (PR head Changes
Call graph (in scope)
Transition matrix
Confirmed defects1.
|
| Category | Outcome |
|---|---|
Old std::length_error on find_first_of(...) == npos (e.g. fuzzer input '.*') |
Fixed for both branches with explicit npos/degenerate-index throws |
Sibling-prefix false match in first branch of getProperFilePathFromMetadataInfo (table vs tableX) |
Fixed by child-boundary check |
Second-branch std::out_of_range when data_path.size() < prefix.size() |
Fixed by explicit BAD_ARGUMENTS with diagnostic |
Spark-style absolute-path tables (different scheme/bucket in location) |
Now works (test 04034) |
| Catalog vs filesystem-only Iceberg | No new code path beyond the listed branches |
| Lifetime / iterator invalidation / data races | None — pure string parsing |
| Exception safety | Only throws added; no resource acquired before the new guards |
C++ bug classes (delta)
| Class | Assessment |
|---|---|
std::out_of_range / std::length_error via iterator + npos |
Fixed by explicit npos guards |
| Sibling-prefix accidental match | Fixed by child-boundary check |
| Iterator / reference invalidation | None |
| Data races, deadlocks | None |
UB / integer wraparound (size_t underflow in prefix.size() arithmetic) |
Fixed for the second branch via explicit size guard |
| Filename-format regression introduced by reverting #1640 | See defect above |
Test review
| Test | Coverage |
|---|---|
04033_iceberg_mismatched_location.sh |
Forces table location to a much longer path than the manifest-list entries; pre-fix → std::out_of_range; post-fix → BAD_ARGUMENTS |
04034_iceberg_spark_style_location.sh |
Spark-style metadata where location = s3a://spark-bucket/… and manifest entries use the full Spark prefix; positive end-to-end SELECT |
04061_iceberg_bad_metadata_file_name.sh |
Fuzzer input iceberg_metadata_file_path = '.*' → BAD_ARGUMENTS instead of std::length_error |
| Missing coverage | No test for the vN-<uuid>.metadata.json filename produced by this branch's own FileNamesGenerator under a transactional catalog — that's why the regression slips through |
Audit update for PR #1769 (Iceberg path validation backport)
Confirmed defects
High: getMetadataFileAndVersion regression — rejects Antalya's own vN-<uuid>.metadata.json metadata filenames after this PR
- Impact: Iceberg tables that use a transactional catalog (REST / Glue) become unreadable; all subsequent
SELECT/INSERT/ALTERthrowBAD_ARGUMENTS. - Anchor:
src/Storages/ObjectStorage/DataLakes/Iceberg/Utils.cpp::getMetadataFileAndVersion, v-prefix branch. - Trigger: any Iceberg operation against a tree that wrote metadata as
vN-<uuid>.metadata.json(the formatFileNamesGenerator::generateMetadataNameemits whenevercatalog != nullptr && catalog->isTransactional()). - Why defect: silently reverts the dash-aware search set added by #1640 while keeping the doc comment that claims the format is supported; the new BAD_ARGUMENTS error message also omits
vN-<uuid>.metadata.jsonfrom the list of accepted forms. - Fix direction: in the v-branch, use
find_first_of(".-")(or split into v-then-N-then-./-logic) and extend the error message to include all three accepted forms. - Regression test direction: a stateless test that constructs (or causes a write of) a
vN-<uuid>.metadata.jsonfile and asserts thatgetMetadataFileAndVersionand a follow-upSELECTsucceed.
Coverage summary
| Item | Detail |
|---|---|
| Scope reviewed | The two Utils.cpp hunks (filename parser, path resolver), three new stateless tests, the catalog/FileNamesGenerator emission paths, and the upstream IcebergPath.cpp::resolve the author cited as the reference. |
| Categories failed | Filename-format compatibility regression (high). |
| Categories passed | npos hardening in both branches of getMetadataFileAndVersion; sibling-prefix tightening and size-guard in getProperFilePathFromMetadataInfo; Spark-style positive case; non-introduction of new lifetime / race / UB; tests added for the cases that motivated the PR. |
| Assumptions / limits | Static audit. Upstream master itself emits vN-<uuid>.metadata.json from FileNamesGenerator::generateMetadataPathWithInfo, so this audit's defect description may also apply upstream; that is out of scope for the Antalya backport, but worth flagging upstream separately. Reviewer's already-raised concern that the size-guard hunk overlaps with upstream #100420 is a scope question, not a correctness defect. |
References
- PR Antalya 26.3 Backport of #99935 - Support more flexible data paths validation in iceberg #1769: Antalya 26.3 Backport of #99935 - Support more flexible data paths validation in iceberg #1769
- PR head commit: a879f09
- Upstream: Support more flexible data paths validation in iceberg ClickHouse/ClickHouse#99935
- Earlier Antalya fix that PR Antalya 26.3 Backport of #99935 - Support more flexible data paths validation in iceberg #1769 inadvertently reverts: 26.3 Antalya port - Alternative syntax for cluster functions #1640
- Reviewer's note about overlap with upstream Resolve problems with paths and compatibility problems with Spark in Azure (v2) ClickHouse/ClickHouse#100420: Resolve problems with paths and compatibility problems with Spark in Azure (v2) ClickHouse/ClickHouse#100420
- Upstream issue motivating the original fix: Iceberg std::out_of_range ClickHouse/ClickHouse#92348
|
@alsugiliazova I've fixed the regression. Could you verify one more time please? Thanks! |
Re-audit after developer fix — commit
|
| Input filename | delim_pos |
version_str |
isdigit |
Outcome |
|---|---|---|---|---|
v1.metadata.json |
2 (.) |
"1" |
pass | OK |
v123.metadata.json |
4 (.) |
"123" |
pass | OK |
v1-<uuid>.metadata.json |
2 (-) |
"1" |
pass | OK (defect fixed) |
v123-<uuid>.metadata.json |
4 (-) |
"123" |
pass | OK (defect fixed) |
v.json |
1 (.) |
— | — | BAD_ARGUMENTS via delim_pos <= 1 |
v-uuid.metadata.json |
1 (-) |
— | — | BAD_ARGUMENTS via delim_pos <= 1 |
vfoo (no . or -) |
npos |
— | — | BAD_ARGUMENTS via npos |
vABC.metadata.json |
4 (.) |
"ABC" |
fail | BAD_ARGUMENTS via isdigit |
'.*' (fuzzer) |
— | — | — | rejected earlier by isTemporaryMetadataFile / starts_with('v') not matched, then else branch's dash_pos == npos → BAD_ARGUMENTS |
All previously broken cases now succeed; all previously rejected malformed cases continue to throw BAD_ARGUMENTS (no narrowing of guards). The npos/degenerate-position hardening from the first commit is preserved.
C++ bug classes (delta after the fix)
| Class | Before fix commit | After fix commit |
|---|---|---|
Filename-format regression — Antalya transactional-catalog vN-<uuid>.metadata.json rejected |
High defect | Resolved — find_first_of(".-") restored, guard delim_pos <= 1 still rejects v./v- degenerate cases |
BAD_ARGUMENTS message lists all accepted forms |
Inaccurate (only two of three) | Accurate (all three: vN, vN-<uuid>, N-<uuid>) |
std::out_of_range / std::length_error via iterator + npos |
Fixed | Fixed (unchanged) |
Sibling-prefix accidental match in getProperFilePathFromMetadataInfo |
Fixed | Fixed (unchanged) |
size_t underflow in second branch of getProperFilePathFromMetadataInfo |
Fixed | Fixed (unchanged) |
| Iterator / reference invalidation, data races, deadlocks | None | None |
Test review delta
| Test | Status |
|---|---|
04033_iceberg_mismatched_location.sh |
unchanged — still asserts BAD_ARGUMENTS for short data_path vs long table_location |
04034_iceberg_spark_style_location.sh |
unchanged — still asserts positive Spark-style absolute-path SELECT |
04061_iceberg_bad_metadata_file_name.sh |
unchanged — still asserts BAD_ARGUMENTS for iceberg_metadata_file_path = '.*' |
New regression test for vN-<uuid>.metadata.json |
Still missing — the fix commit modifies only Utils.cpp; no stateless or integration test asserts a clean round-trip for the filename form Antalya itself emits under transactional catalogs. Recommend adding such a test (e.g. construct a synthetic v2-<uuid>.metadata.json in S3 and assert SELECT succeeds, or wire an IcebergCatalog integration test against the existing REST mock with a transactional configuration). |
Coverage summary (re-audit)
| Item | Detail |
|---|---|
| Scope reviewed | The single Utils.cpp hunk in commit c8f3d6c, plus re-verification that no test files were added/removed alongside it. |
| Categories failed | None (the previously High-severity filename-format regression is resolved). |
| Categories passed | Filename parsing for all three documented forms; npos/degenerate-position hardening; error-message completeness; non-introduction of new lifetime / race / UB / exception-safety issues. |
| Assumptions / limits | Static audit. Runtime verification (e.g. exercising a transactional catalog write that emits vN-<uuid>.metadata.json and asserting a subsequent SELECT succeeds) is recommended before merge, and would close the "missing coverage" item above. Upstream IcebergPath.cpp was not re-checked in this re-audit — if the same regression exists upstream it is out of scope for the Antalya backport but worth a separate report. |
Verdict
The previously identified High-severity defect (Antalya vN-<uuid>.metadata.json rejected by getMetadataFileAndVersion) is fixed by commit c8f3d6c. No new defects were introduced. The only remaining audit item is a non-blocking test-coverage gap: there is no regression test for the vN-<uuid>.metadata.json filename form. The PR is otherwise safe to merge from a correctness perspective.
Additional references
- Fix commit: c8f3d6c
Verification: PR #1769 (re-verified after latest commit)PR-added tests — all GREEN on new head3 tests × 6 stateless jobs each = 18 OK runs, 0 failures. (1 SKIPPED per test in
Jobs covered: The new CI overview (latest head)
Test-level failures in DBZero. No Regression-workflow failures (chronic baseline on
|
| Suite | Result |
|---|---|
| Swarms (Aarch64 + Release) | chronic baseline |
| Parquet (Aarch64 + Release) | chronic baseline |
S3Export partition (Aarch64 + Release) |
chronic baseline |
S3Export part (Aarch64 + Release) |
chronic baseline |
Verdict
Safe to merge — improved since the previous verification.
Note for reviewer
While applying this backport, I faced conflicts related to the fact that upstream introduced
IcebergPath.cppin 755a11b. I found the old place of the same code inUtils.cppand fixed it there instead.Also, with just these changes, stateless test
04033was still failing, so I had to apply an additional check ingetProperFilePathFromMetadataInfowhich is aligned with the state ofIcebergPath.cppin the upstream.All the added tests are passing locally for me.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Now ClickHouse should properly handle spark-style tables (where we have full absolute path for each file or relative path to common table path) (ClickHouse#99935 by @alesapin)
Documentation entry for user-facing changes
...
CI/CD Options
Exclude tests:
Regression jobs to run: