fix prefetch of page index #6999

adriangb · 2025-01-20T13:34:37Z

Fixes the case where a metadata prefetch on a Parquet file includes the page index e.g. if you prefetch the entire file.

adriangb · 2025-01-20T13:40:01Z

As a side note I think one of the biggest bottlenecks in systems working from object storage tends to be latency, so it's important to minimize latency (this is well known, including in the comments/docstrings in this file).

Would it be beneficial to have the right APIs to make it possible to pre-fetch the entire file? E.g. if I'm going to load a <1MB parquet file I might want to just make a single request to object storage and know I have everything I need instead of loading the metadata, then making another request to load the data. This would especially be beneficial for the scenario where you don't know the metadata size but maybe know the file size, then you do 1 request instead of potentially 3+.

adriangb · 2025-01-20T13:40:31Z

cc @tustvold

tustvold · 2025-01-20T13:47:51Z

Would it be beneficial to have the right APIs to make it possible to pre-fetch the entire file? E.g. if I'm going to load a <1MB parquet file I might want to just make a single request to object storage and know I have everything I need instead of loading the metadata

In such a scenario you're best off just fetching the entire file and feeding the Bytes to the synchronous readers

tustvold · 2025-01-20T13:50:26Z

FYI @etseidl this looks to have been introduced by #6431

adriangb · 2025-01-20T13:51:03Z

In such a scenario you're best off just fetching the entire file and feeding the Bytes to the synchronous readers

I guess that makes sense yeah

etseidl

Thanks @adriangb, guess I forgot to test some cases for the async side. I think we can fix this with a little less thrash, though. What do you think?

etseidl · 2025-01-20T18:27:55Z

parquet/src/file/metadata/reader.rs

+        let bytes = match &fetched {
+            Some((fetched_start, fetched)) if *fetched_start <= range.start => {
+                // `fetched`` is an amount of data spanning from fetched_start to the end of the file
+                // We want to slice out the range we need from that data, but need to adjust the
+                // range we are looking for to be relative to fetched_start.
+                let fetched_start = *fetched_start;
+                let range = range.start - fetched_start..range.end - fetched_start;
+                // santity check: `fetched` should always go until the end of the file
+                // so if our range is beyond that, something is wrong!
+                assert!(
+                    range.end <= fetched_start + fetched.len(),
+                    "range: {range:?}, fetched: {}, fetched_start: {fetched_start}",
+                    fetched.len()
+                );
+                fetched.slice(range)


Suggested change

let bytes = match &fetched {

Some((fetched_start, fetched)) if *fetched_start <= range.start => {

// `fetched`` is an amount of data spanning from fetched_start to the end of the file

// We want to slice out the range we need from that data, but need to adjust the

// range we are looking for to be relative to fetched_start.

let fetched_start = *fetched_start;

let range = range.start - fetched_start..range.end - fetched_start;

// santity check: `fetched` should always go until the end of the file

// so if our range is beyond that, something is wrong!

assert!(

range.end <= fetched_start + fetched.len(),

"range: {range:?}, fetched: {}, fetched_start: {fetched_start}",

fetched.len()

);

fetched.slice(range)

let offset = range.start - *remainder_start;

let end = offset + range.end - range.start;

assert!(end <= remainder.len());

remainder.slice(offset..end)

Instead of all the other changes, I think this will properly compute the end of the slice.

parquet/src/file/metadata/reader.rs

etseidl

Awesome! Thanks again @adriangb

etseidl · 2025-01-20T23:13:34Z

Looks like some whitespace is causing CI to fail.

adriangb · 2025-01-20T23:30:47Z

thanks fixed

alamb

Thank you @adriangb and @etseidl for the review. This looks great to me

alamb · 2025-01-22T20:51:51Z

In such a scenario you're best off just fetching the entire file and feeding the Bytes to the synchronous readers

I guess that makes sense yeah

Perhaps this is worth addng / stating explicitly in comments somewhere

fix prefetch of page index

2648dd1

github-actions bot added the parquet Changes to the parquet crate label Jan 20, 2025

move to assertion

feb6650

fmt

dfe74d1

tustvold approved these changes Jan 20, 2025

View reviewed changes

etseidl suggested changes Jan 20, 2025

View reviewed changes

adriangb added 2 commits January 20, 2025 22:04

less invasive version

02619ec

typo

c45551e

adriangb requested review from tustvold and etseidl January 20, 2025 22:06

etseidl approved these changes Jan 20, 2025

View reviewed changes

fmt

ec36a6f

alamb approved these changes Jan 22, 2025

View reviewed changes

alamb merged commit ffeda12 into apache:main Jan 22, 2025
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix prefetch of page index #6999

fix prefetch of page index #6999

adriangb commented Jan 20, 2025

adriangb commented Jan 20, 2025

adriangb commented Jan 20, 2025

tustvold commented Jan 20, 2025

tustvold commented Jan 20, 2025

adriangb commented Jan 20, 2025

etseidl left a comment

etseidl Jan 20, 2025

adriangb Jan 20, 2025

etseidl left a comment

etseidl commented Jan 20, 2025

adriangb commented Jan 20, 2025

alamb left a comment

alamb commented Jan 22, 2025

fix prefetch of page index #6999

fix prefetch of page index #6999

Conversation

adriangb commented Jan 20, 2025

adriangb commented Jan 20, 2025

adriangb commented Jan 20, 2025

tustvold commented Jan 20, 2025

tustvold commented Jan 20, 2025

adriangb commented Jan 20, 2025

etseidl left a comment

Choose a reason for hiding this comment

etseidl Jan 20, 2025

Choose a reason for hiding this comment

adriangb Jan 20, 2025

Choose a reason for hiding this comment

etseidl left a comment

Choose a reason for hiding this comment

etseidl commented Jan 20, 2025

adriangb commented Jan 20, 2025

alamb left a comment

Choose a reason for hiding this comment

alamb commented Jan 22, 2025