Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract Delta Lake deletion vectors #627

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ashvina
Copy link
Contributor

@ashvina ashvina commented Jan 23, 2025

Fixes: #345

This change extracts deletion vectors represented as roaring bitmaps in delta lake files and converts them into the XTable intermediate representation.

Previously, XTable only detected tables changes that included adding or removing of data files. Now the detected table change also includes any deletion vectors files added in the commit.

Note that, in Delta Lake, the Deletion vectors are represented in a compressed binary format. However, once extracted by Xtable, the offset are currently extracted into a list of long offsets. This representation is not the most efficient for large datasets. Optimization is pending to prioritize end-to-end conversion completion.

This change extracts deletion vectors represented as roaring bitmaps in delta lake files and
converts them into the XTable intermediate representation.

Previously, XTable only detected tables changes that included adding or removing of data files. Now
the detected table change also includes any deletion vectors files added in the commit.

Note that, in Delta Lake, the Deletion vectors are represented in a compressed binary format.
However, once extracted by Xtable, the offset are currently extracted into a list of long offsets.
This representation is not the most efficient for large datasets. Optimization is pending to
prioritize end-to-end conversion completion.
Copy link
Contributor

@the-other-tim-brown the-other-tim-brown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an RFC or documentation on the approach? It will make it easier to review the code with some high level design in mind.


public static class Builder {
public Builder deleteRecordSupplier(Supplier<Iterator<Long>> recordsSupplier) {
this.deleteRecordSupplier = recordsSupplier;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this required? It seems very similar to what the Builder annotation will provide

dataFilePath = getFullPathToFile(snapshot, dataFilePath);
Path deletionVectorFilePath = deletionVector.absolutePath(snapshot.deltaLog().dataPath());

// TODO assumes deletion vector file. Need to handle inlined deletion vectors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to detect that it is not a file? Can we throw some unsupported operation exception for now if we hit that path?

// A commit can add deletion vectors when some records are deleted. New deletion vectors can be
// added even if no new data files are added. However, as deletion vectors are always associated
// with a data file, they are implicitly removed when a corresponding data file is removed.
List<InternalDeletionVector> deletionVectorsAdded;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we read a snapshot, will there be deletionVectors present there as well?

long countRecordsDeleted;

@Getter(AccessLevel.NONE)
Supplier<Iterator<Long>> deleteRecordSupplier;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a doc for this as well? What does the long represent?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Read and translate the deletion vectors in Delta source table to XTables internal representation
2 participants