feat(transaction): Add SnapshotValidator#2590
Conversation
| let metadata = base.metadata_ref(); | ||
| let snapshots = ancestors_between(&metadata, to_snapshot_id, from_snapshot_id); | ||
|
|
||
| for current_snapshot in snapshots { |
There was a problem hiding this comment.
This and other for-loops have opportunity to be parallelized using runtime, we can revisit this once we are aligned on the general design or as a follow-up
| /// # Errors | ||
| /// | ||
| /// Returns an error if the history between the snapshots cannot be determined. | ||
| #[allow(dead_code)] |
There was a problem hiding this comment.
marking them as dead_code for now and they will be used by delete or rewrite actions later on
There was a problem hiding this comment.
Should we add some unit tests?
| use crate::{Error, ErrorKind}; | ||
|
|
||
| /// Operations whose snapshots may add delete files. | ||
| static VALIDATE_ADDED_DELETE_FILES_OPERATIONS: Lazy<HashSet<Operation>> = |
There was a problem hiding this comment.
nit: The price for pay for defining hash set is the changes in public-api.txt etc. No strong opinion but it seems that we could avoid it by defining
/// Operations whose snapshots may add delete files.
fn adds_delete_files(op: &Operation) -> bool {
matches!(op, Operation::Overwrite | Operation::Delete)
}
There was a problem hiding this comment.
+1, I think having a reference to that function improves readability anyway.
Maybe not worth it since I don't expect a new operation type anytime soon, but I was thinking along the lines of this.
/// Operations whose snapshots may add delete files.
fn adds_delete_files(op: &Operation) -> bool {
match op {
Operation::Append | Operation::Replace => false,
Operation::Overwrite | Operation::Delete => true,
}
}There was a problem hiding this comment.
I know this mirrors the Java implementation, however I find it confusing as a replace operation will add delete files, when it rewrites the delete files to retain deletes it could not apply.
What we're really checking for here is that no new deletes apply to the data files we've rewritten.
Is adds_deletes(op: &Operation) better here?
| /// # Errors | ||
| /// | ||
| /// Returns an error if the history between the snapshots cannot be determined. | ||
| #[allow(dead_code)] |
There was a problem hiding this comment.
Should we add some unit tests?
|
I'm still quite confused about how the overall things works. I think the key point is the design of MergingSnapshotProduder, which contains a lot of indices to speed up the confliction check. |
| /// Validates a snapshot against a table. | ||
| /// | ||
| /// # Arguments | ||
| /// | ||
| /// * `base` - The base table to validate against. | ||
| /// * `parent_snapshot_id` - The ID of the parent snapshot, if any. This is usually | ||
| /// the latest snapshot of the base table, unless it's a non-main branch | ||
| /// (note: writing to branches is not currently supported). | ||
| /// | ||
| /// # Returns | ||
| /// | ||
| /// A `Result` indicating success or an error if validation fails. | ||
| async fn validate(&self, _base: &Table, _parent_snapshot_id: Option<i64>) -> Result<()> { | ||
| Ok(()) | ||
| } |
There was a problem hiding this comment.
Why not leave this abstract, such that append is documented as requiring no validation in its implementation?
| use crate::{Error, ErrorKind}; | ||
|
|
||
| /// Operations whose snapshots may add delete files. | ||
| static VALIDATE_ADDED_DELETE_FILES_OPERATIONS: Lazy<HashSet<Operation>> = |
There was a problem hiding this comment.
+1, I think having a reference to that function improves readability anyway.
Maybe not worth it since I don't expect a new operation type anytime soon, but I was thinking along the lines of this.
/// Operations whose snapshots may add delete files.
fn adds_delete_files(op: &Operation) -> bool {
match op {
Operation::Append | Operation::Replace => false,
Operation::Overwrite | Operation::Delete => true,
}
}| use crate::{Error, ErrorKind}; | ||
|
|
||
| /// Operations whose snapshots may add delete files. | ||
| static VALIDATE_ADDED_DELETE_FILES_OPERATIONS: Lazy<HashSet<Operation>> = |
There was a problem hiding this comment.
I know this mirrors the Java implementation, however I find it confusing as a replace operation will add delete files, when it rewrites the delete files to retain deletes it could not apply.
What we're really checking for here is that no new deletes apply to the data files we've rewritten.
Is adds_deletes(op: &Operation) better here?
| match base.metadata().snapshot_by_id(from_snapshot_id) { | ||
| Some(snapshot) => snapshot.sequence_number(), | ||
| None => INITIAL_SEQUENCE_NUMBER, | ||
| } |
There was a problem hiding this comment.
Why is it not an error if we cannot find the snapshot associated with from_snapshot_id? Is it safe to default to INITIAL_SEQUENCE_NUMBER here?
| return Err(Error::new( | ||
| ErrorKind::DataInvalid, | ||
| format!( | ||
| "Cannot commit, found new positional delete for added data file: {}", | ||
| data_file.file_path | ||
| ), | ||
| )); |
There was a problem hiding this comment.
The data files are not 'added', right?
I was reviewing the Java implementation, the wording is: "Cannot commit, found new position delete for replaced data file: ". Shall we align with that?
Which issue does this PR close?
Here is a diagram to picture what I have in mind about the class heirachy for the snapshot producing.
flowchart TB subgraph entry["Transaction entry point"] TA["«trait» TransactionAction"] FAA["FastAppendAction"] RFA["RewriteFilesAction<br/>(planned)"] end subgraph behavior["Behavior traits — passed into SnapshotProducer::commit"] SPO["«trait» SnapshotProduceOperation<br/>what to include / tombstone"] SV["«trait» SnapshotValidator<br/>conflict checks (default no-op)"] MP["«trait» ManifestProcess<br/>how to assemble manifests"] end subgraph impls["Concrete behavior"] FAO["FastAppendOperation"] RFO["RewriteFilesOperation<br/>(planned)"] DMP["DefaultManifestProcess"] end SP["SnapshotProducer<br/>(engine)"] %% who implements the entry trait FAA -- impl --> TA RFA -. impl planned .-> TA %% supertrait SPO -- "requires (supertrait)" --> SV %% concrete impls of behavior traits FAO -- impl --> SPO RFO -. impl planned .-> SPO DMP -- impl --> MP %% actions drive the producer with an operation + manifest process FAA -- "commit() builds" --> SP SP -- "uses OP" --> SPO SP -- "uses MP" --> MP SP -- "calls validate() via supertrait" --> SV FAA -. "passes" .-> FAO FAA -. "passes" .-> DMP RFA -. "passes (planned)" .-> RFOWhat changes are included in this PR?
SnapshotValidatorto help validate the snapshot to avoid conflictsAre these changes tested?
Not yet, this should be tested when concrete implementations that validate history or no new delete files are added