Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix prefetch of page index #6999

Merged
merged 6 commits into from
Jan 22, 2025
Merged

Conversation

adriangb
Copy link
Contributor

Fixes the case where a metadata prefetch on a Parquet file includes the page index e.g. if you prefetch the entire file.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jan 20, 2025
@adriangb
Copy link
Contributor Author

As a side note I think one of the biggest bottlenecks in systems working from object storage tends to be latency, so it's important to minimize latency (this is well known, including in the comments/docstrings in this file).

Would it be beneficial to have the right APIs to make it possible to pre-fetch the entire file? E.g. if I'm going to load a <1MB parquet file I might want to just make a single request to object storage and know I have everything I need instead of loading the metadata, then making another request to load the data. This would especially be beneficial for the scenario where you don't know the metadata size but maybe know the file size, then you do 1 request instead of potentially 3+.

@adriangb
Copy link
Contributor Author

cc @tustvold

@tustvold
Copy link
Contributor

Would it be beneficial to have the right APIs to make it possible to pre-fetch the entire file? E.g. if I'm going to load a <1MB parquet file I might want to just make a single request to object storage and know I have everything I need instead of loading the metadata

In such a scenario you're best off just fetching the entire file and feeding the Bytes to the synchronous readers

@tustvold
Copy link
Contributor

FYI @etseidl this looks to have been introduced by #6431

@adriangb
Copy link
Contributor Author

In such a scenario you're best off just fetching the entire file and feeding the Bytes to the synchronous readers

I guess that makes sense yeah

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @adriangb, guess I forgot to test some cases for the async side. I think we can fix this with a little less thrash, though. What do you think?

Comment on lines 412 to 426
let bytes = match &fetched {
Some((fetched_start, fetched)) if *fetched_start <= range.start => {
// `fetched`` is an amount of data spanning from fetched_start to the end of the file
// We want to slice out the range we need from that data, but need to adjust the
// range we are looking for to be relative to fetched_start.
let fetched_start = *fetched_start;
let range = range.start - fetched_start..range.end - fetched_start;
// santity check: `fetched` should always go until the end of the file
// so if our range is beyond that, something is wrong!
assert!(
range.end <= fetched_start + fetched.len(),
"range: {range:?}, fetched: {}, fetched_start: {fetched_start}",
fetched.len()
);
fetched.slice(range)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let bytes = match &fetched {
Some((fetched_start, fetched)) if *fetched_start <= range.start => {
// `fetched`` is an amount of data spanning from fetched_start to the end of the file
// We want to slice out the range we need from that data, but need to adjust the
// range we are looking for to be relative to fetched_start.
let fetched_start = *fetched_start;
let range = range.start - fetched_start..range.end - fetched_start;
// santity check: `fetched` should always go until the end of the file
// so if our range is beyond that, something is wrong!
assert!(
range.end <= fetched_start + fetched.len(),
"range: {range:?}, fetched: {}, fetched_start: {fetched_start}",
fetched.len()
);
fetched.slice(range)
let offset = range.start - *remainder_start;
let end = offset + range.end - range.start;
assert!(end <= remainder.len());
remainder.slice(offset..end)

Instead of all the other changes, I think this will properly compute the end of the slice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done 😄

parquet/src/file/metadata/reader.rs Show resolved Hide resolved
@adriangb adriangb requested review from tustvold and etseidl January 20, 2025 22:06
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Thanks again @adriangb

@etseidl
Copy link
Contributor

etseidl commented Jan 20, 2025

Looks like some whitespace is causing CI to fail.

@adriangb
Copy link
Contributor Author

thanks fixed

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @adriangb and @etseidl for the review. This looks great to me

@alamb
Copy link
Contributor

alamb commented Jan 22, 2025

In such a scenario you're best off just fetching the entire file and feeding the Bytes to the synchronous readers

I guess that makes sense yeah

Perhaps this is worth addng / stating explicitly in comments somewhere

@alamb alamb merged commit ffeda12 into apache:main Jan 22, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants