-
Notifications
You must be signed in to change notification settings - Fork 846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix prefetch of page index #6999
Conversation
As a side note I think one of the biggest bottlenecks in systems working from object storage tends to be latency, so it's important to minimize latency (this is well known, including in the comments/docstrings in this file). Would it be beneficial to have the right APIs to make it possible to pre-fetch the entire file? E.g. if I'm going to load a <1MB parquet file I might want to just make a single request to object storage and know I have everything I need instead of loading the metadata, then making another request to load the data. This would especially be beneficial for the scenario where you don't know the metadata size but maybe know the file size, then you do 1 request instead of potentially 3+. |
cc @tustvold |
In such a scenario you're best off just fetching the entire file and feeding the |
I guess that makes sense yeah |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @adriangb, guess I forgot to test some cases for the async side. I think we can fix this with a little less thrash, though. What do you think?
parquet/src/file/metadata/reader.rs
Outdated
let bytes = match &fetched { | ||
Some((fetched_start, fetched)) if *fetched_start <= range.start => { | ||
// `fetched`` is an amount of data spanning from fetched_start to the end of the file | ||
// We want to slice out the range we need from that data, but need to adjust the | ||
// range we are looking for to be relative to fetched_start. | ||
let fetched_start = *fetched_start; | ||
let range = range.start - fetched_start..range.end - fetched_start; | ||
// santity check: `fetched` should always go until the end of the file | ||
// so if our range is beyond that, something is wrong! | ||
assert!( | ||
range.end <= fetched_start + fetched.len(), | ||
"range: {range:?}, fetched: {}, fetched_start: {fetched_start}", | ||
fetched.len() | ||
); | ||
fetched.slice(range) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let bytes = match &fetched { | |
Some((fetched_start, fetched)) if *fetched_start <= range.start => { | |
// `fetched`` is an amount of data spanning from fetched_start to the end of the file | |
// We want to slice out the range we need from that data, but need to adjust the | |
// range we are looking for to be relative to fetched_start. | |
let fetched_start = *fetched_start; | |
let range = range.start - fetched_start..range.end - fetched_start; | |
// santity check: `fetched` should always go until the end of the file | |
// so if our range is beyond that, something is wrong! | |
assert!( | |
range.end <= fetched_start + fetched.len(), | |
"range: {range:?}, fetched: {}, fetched_start: {fetched_start}", | |
fetched.len() | |
); | |
fetched.slice(range) | |
let offset = range.start - *remainder_start; | |
let end = offset + range.end - range.start; | |
assert!(end <= remainder.len()); | |
remainder.slice(offset..end) |
Instead of all the other changes, I think this will properly compute the end of the slice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Thanks again @adriangb
Looks like some whitespace is causing CI to fail. |
thanks fixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this is worth addng / stating explicitly in comments somewhere |
Fixes the case where a metadata prefetch on a Parquet file includes the page index e.g. if you prefetch the entire file.