-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract Delta Lake deletion vectors #627
base: main
Are you sure you want to change the base?
Extract Delta Lake deletion vectors #627
Conversation
This change extracts deletion vectors represented as roaring bitmaps in delta lake files and converts them into the XTable intermediate representation. Previously, XTable only detected tables changes that included adding or removing of data files. Now the detected table change also includes any deletion vectors files added in the commit. Note that, in Delta Lake, the Deletion vectors are represented in a compressed binary format. However, once extracted by Xtable, the offset are currently extracted into a list of long offsets. This representation is not the most efficient for large datasets. Optimization is pending to prioritize end-to-end conversion completion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an RFC or documentation on the approach? It will make it easier to review the code with some high level design in mind.
|
||
public static class Builder { | ||
public Builder deleteRecordSupplier(Supplier<Iterator<Long>> recordsSupplier) { | ||
this.deleteRecordSupplier = recordsSupplier; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this required? It seems very similar to what the Builder annotation will provide
dataFilePath = getFullPathToFile(snapshot, dataFilePath); | ||
Path deletionVectorFilePath = deletionVector.absolutePath(snapshot.deltaLog().dataPath()); | ||
|
||
// TODO assumes deletion vector file. Need to handle inlined deletion vectors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to detect that it is not a file? Can we throw some unsupported operation exception for now if we hit that path?
// A commit can add deletion vectors when some records are deleted. New deletion vectors can be | ||
// added even if no new data files are added. However, as deletion vectors are always associated | ||
// with a data file, they are implicitly removed when a corresponding data file is removed. | ||
List<InternalDeletionVector> deletionVectorsAdded; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we read a snapshot, will there be deletionVectors present there as well?
long countRecordsDeleted; | ||
|
||
@Getter(AccessLevel.NONE) | ||
Supplier<Iterator<Long>> deleteRecordSupplier; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a doc for this as well? What does the long represent?
Fixes: #345
This change extracts deletion vectors represented as roaring bitmaps in delta lake files and converts them into the XTable intermediate representation.
Previously, XTable only detected tables changes that included adding or removing of data files. Now the detected table change also includes any deletion vectors files added in the commit.
Note that, in Delta Lake, the Deletion vectors are represented in a compressed binary format. However, once extracted by Xtable, the offset are currently extracted into a list of long offsets. This representation is not the most efficient for large datasets. Optimization is pending to prioritize end-to-end conversion completion.