Skip to content

feat(transaction): Add SnapshotValidator#2590

Open
CTTY wants to merge 2 commits into
apache:mainfrom
CTTY:ctty/snapshot-validator
Open

feat(transaction): Add SnapshotValidator#2590
CTTY wants to merge 2 commits into
apache:mainfrom
CTTY:ctty/snapshot-validator

Conversation

@CTTY
Copy link
Copy Markdown
Collaborator

@CTTY CTTY commented Jun 5, 2026

Which issue does this PR close?

Here is a diagram to picture what I have in mind about the class heirachy for the snapshot producing.

flowchart TB
    subgraph entry["Transaction entry point"]
        TA["«trait» TransactionAction"]
        FAA["FastAppendAction"]
        RFA["RewriteFilesAction<br/>(planned)"]
    end

    subgraph behavior["Behavior traits — passed into SnapshotProducer::commit"]
        SPO["«trait» SnapshotProduceOperation<br/>what to include / tombstone"]
        SV["«trait» SnapshotValidator<br/>conflict checks (default no-op)"]
        MP["«trait» ManifestProcess<br/>how to assemble manifests"]
    end

    subgraph impls["Concrete behavior"]
        FAO["FastAppendOperation"]
        RFO["RewriteFilesOperation<br/>(planned)"]
        DMP["DefaultManifestProcess"]
    end

    SP["SnapshotProducer<br/>(engine)"]

    %% who implements the entry trait
    FAA -- impl --> TA
    RFA -. impl planned .-> TA

    %% supertrait
    SPO -- "requires (supertrait)" --> SV

    %% concrete impls of behavior traits
    FAO -- impl --> SPO
    RFO -. impl planned .-> SPO
    DMP -- impl --> MP

    %% actions drive the producer with an operation + manifest process
    FAA -- "commit() builds" --> SP
    SP -- "uses OP" --> SPO
    SP -- "uses MP" --> MP
    SP -- "calls validate() via supertrait" --> SV

    FAA -. "passes" .-> FAO
    FAA -. "passes" .-> DMP
    RFA -. "passes (planned)" .-> RFO


Loading

What changes are included in this PR?

  • Added a new trait SnapshotValidator to help validate the snapshot to avoid conflicts

Are these changes tested?

Not yet, this should be tested when concrete implementations that validate history or no new delete files are added

let metadata = base.metadata_ref();
let snapshots = ancestors_between(&metadata, to_snapshot_id, from_snapshot_id);

for current_snapshot in snapshots {
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and other for-loops have opportunity to be parallelized using runtime, we can revisit this once we are aligned on the general design or as a follow-up

/// # Errors
///
/// Returns an error if the history between the snapshots cannot be determined.
#[allow(dead_code)]
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

marking them as dead_code for now and they will be used by delete or rewrite actions later on

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add some unit tests?

use crate::{Error, ErrorKind};

/// Operations whose snapshots may add delete files.
static VALIDATE_ADDED_DELETE_FILES_OPERATIONS: Lazy<HashSet<Operation>> =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The price for pay for defining hash set is the changes in public-api.txt etc. No strong opinion but it seems that we could avoid it by defining

/// Operations whose snapshots may add delete files.
fn adds_delete_files(op: &Operation) -> bool {
    matches!(op, Operation::Overwrite | Operation::Delete)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I think having a reference to that function improves readability anyway.

Maybe not worth it since I don't expect a new operation type anytime soon, but I was thinking along the lines of this.

/// Operations whose snapshots may add delete files.
fn adds_delete_files(op: &Operation) -> bool {
    match op {
        Operation::Append | Operation::Replace => false,
        Operation::Overwrite | Operation::Delete => true,
    }
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this mirrors the Java implementation, however I find it confusing as a replace operation will add delete files, when it rewrites the delete files to retain deletes it could not apply.

What we're really checking for here is that no new deletes apply to the data files we've rewritten.

Is adds_deletes(op: &Operation) better here?

/// # Errors
///
/// Returns an error if the history between the snapshots cannot be determined.
#[allow(dead_code)]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add some unit tests?

@blackmwk
Copy link
Copy Markdown
Contributor

blackmwk commented Jun 5, 2026

I'm still quite confused about how the overall things works. I think the key point is the design of MergingSnapshotProduder, which contains a lot of indices to speed up the confliction check.

Copy link
Copy Markdown
Contributor

@dannycjones dannycjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Shawn!

Comment on lines +52 to +66
/// Validates a snapshot against a table.
///
/// # Arguments
///
/// * `base` - The base table to validate against.
/// * `parent_snapshot_id` - The ID of the parent snapshot, if any. This is usually
/// the latest snapshot of the base table, unless it's a non-main branch
/// (note: writing to branches is not currently supported).
///
/// # Returns
///
/// A `Result` indicating success or an error if validation fails.
async fn validate(&self, _base: &Table, _parent_snapshot_id: Option<i64>) -> Result<()> {
Ok(())
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not leave this abstract, such that append is documented as requiring no validation in its implementation?

use crate::{Error, ErrorKind};

/// Operations whose snapshots may add delete files.
static VALIDATE_ADDED_DELETE_FILES_OPERATIONS: Lazy<HashSet<Operation>> =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I think having a reference to that function improves readability anyway.

Maybe not worth it since I don't expect a new operation type anytime soon, but I was thinking along the lines of this.

/// Operations whose snapshots may add delete files.
fn adds_delete_files(op: &Operation) -> bool {
    match op {
        Operation::Append | Operation::Replace => false,
        Operation::Overwrite | Operation::Delete => true,
    }
}

use crate::{Error, ErrorKind};

/// Operations whose snapshots may add delete files.
static VALIDATE_ADDED_DELETE_FILES_OPERATIONS: Lazy<HashSet<Operation>> =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this mirrors the Java implementation, however I find it confusing as a replace operation will add delete files, when it rewrites the delete files to retain deletes it could not apply.

What we're really checking for here is that no new deletes apply to the data files we've rewritten.

Is adds_deletes(op: &Operation) better here?

Comment on lines +221 to +224
match base.metadata().snapshot_by_id(from_snapshot_id) {
Some(snapshot) => snapshot.sequence_number(),
None => INITIAL_SEQUENCE_NUMBER,
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it not an error if we cannot find the snapshot associated with from_snapshot_id? Is it safe to default to INITIAL_SEQUENCE_NUMBER here?

Comment on lines +240 to +246
return Err(Error::new(
ErrorKind::DataInvalid,
format!(
"Cannot commit, found new positional delete for added data file: {}",
data_file.file_path
),
));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data files are not 'added', right?

I was reviewing the Java implementation, the wording is: "Cannot commit, found new position delete for replaced data file: ". Shall we align with that?

https://github.com/apache/iceberg/blob/bead8dfab3249cbc6ba7713e862ff2698fd2122c/core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java#L538-L549

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement SnapshotValidator

4 participants