Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MMap support for IPC files #6709

Closed
fabianmurariu opened this issue Nov 9, 2024 · 13 comments · Fixed by #6986
Closed

MMap support for IPC files #6709

fabianmurariu opened this issue Nov 9, 2024 · 13 comments · Fixed by #6986
Labels
documentation Improvements or additions to documentation enhancement Any new improvement worthy of a entry in the changelog

Comments

@fabianmurariu
Copy link
Contributor

fabianmurariu commented Nov 9, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We are using polars-arrow and would prefer to move to arrow-rs as we are looking into using Datafusion as well

Describe the solution you'd like
We are dependent on mmap support for IPC format

Describe alternatives you've considered
keep using polars-arrow

Additional context
It looks like Buffer and Bytes have the right functions and support for Custom dealocation

@fabianmurariu fabianmurariu added the enhancement Any new improvement worthy of a entry in the changelog label Nov 9, 2024
@tustvold
Copy link
Contributor

The docs could definitely be improved, but the way to do this is to use FileDecoder with a buffer constructed from your mmap file - https://docs.rs/arrow-ipc/latest/arrow_ipc/reader/struct.FileDecoder.html

@fabianmurariu
Copy link
Contributor Author

I'll have a look these days, if it works out I'll raise a PR

@totoroyyb
Copy link

Hi, I also have the need for reading the IPC files via mmap for a low-latency workload. Thanks @tustvold for redirecting the document. The approach of using FileDecoder indeed work for the reading purpose.

However, looks like the Buffer is rathor undesired for zero-copy (not entirely true). The documented approach does incur one extra copy due to the Buffer allocation. I feel like many people who are looking into mmap capability, may want to avoid such copy. Here is one approach (assume memmap2 is used):

let file = std::fs::File::open("...").unwrap();
let mmap = unsafe { Mmap::map(&file).unwrap() };
let buffer = unsafe {
   Buffer::from_custom_allocation(
      NonNull::new_unchecked(mmap.as_ptr() as *mut u8),
      mmap.len(),
      Arc::new(mmap),
   )
};

// ... use the buffer as suggested in the document...

Nevertheless, may I still request a more structured way (like implemented in c++ library) of reading IPC file via mmap, which can be more ergnomic to use?

@alamb
Copy link
Contributor

alamb commented Jan 14, 2025

Nevertheless, may I still request a more structured way (like implemented in c++ library) of reading IPC file via mmap, which can be more ergnomic to use?

@totoroyyb sounds like a good idea to figure out a better way to do this

I noticed that Mmap implements as_ref() for [u8]

https://docs.rs/memmap/latest/memmap/struct.Mmap.html#impl-AsRef%3C%5Bu8%5D%3E-for-Mmap

So i think that means you can make a cursor over it like

let cursor = std::io::Cursor::new(mmap.as_ref());

And then you can use it with a FileReader

@tustvold
Copy link
Contributor

tustvold commented Jan 14, 2025

So i think that means you can make a cursor over it like

This would largely defeat the purpose of using mmap, as it will perform a copy.

You want to use FileDecoder as documented here with a Buffer created from the mmap region. This can be done using the recently added Bytes::from_owner and then converting this into a Buffer

@alamb
Copy link
Contributor

alamb commented Jan 14, 2025

This would largely defeat the purpose of using mmap, as it will perform a copy.

This code compiles and I don't think it makes a copy 🤔 Maybe I am missing something

    let file = std::fs::File::open("/tmp/foo.txt").unwrap();
    let mmap = unsafe { Mmap::map(&file).unwrap() };
    let mut cursor = std::io::Cursor::new(&mmap[..]);
    let mut reader = arrow::ipc::reader::FileReader::try_new(&mut cursor, None).unwrap();
    for b in reader.next().unwrap() {
        println!("{:?}", b);
    }

@tustvold
Copy link
Contributor

tustvold commented Jan 14, 2025

This code compiles and I don't think it makes a copy 🤔 Maybe I am missing something

It copies data from the mmap region into the Buffer for the arrays, the FileDecoder avoids this, creating arrays directly by slicing the mmap region.

@alamb
Copy link
Contributor

alamb commented Jan 14, 2025

I verified that this does indeed compile:

    let file = std::fs::File::open("/tmp/foo.txt").unwrap();
    let mmap = unsafe { Mmap::map(&file).unwrap() };
    let bytes = Bytes::from_owner(mmap);
    let buffer = Buffer::from_bytes(bytes.into());

(and then you can use the example @tustvold mentions on FileDecoder)

For anyone following along this means the underlying arrow arrays will then (re)use the mmap region

I agree an example would make this much easier to understand.

@alamb alamb added the documentation Improvements or additions to documentation label Jan 14, 2025
@alamb
Copy link
Contributor

alamb commented Jan 14, 2025

FYI @andygrove this could be another source of improvement for comet -- avoid a copy of the spill files

@tustvold
Copy link
Contributor

I agree an example would make this much easier to understand.

I think adding Buffer::from_owner that uses Bytes::from_owner under the hood would likely be a good API addition, and avoid people needing to deal with Buffer::from_custom_allocation.

@andygrove
Copy link
Member

This is very interesting. Thanks the the ping @alamb

@alamb
Copy link
Contributor

alamb commented Jan 15, 2025

I created a PR with a proposed example here:

@alamb
Copy link
Contributor

alamb commented Jan 21, 2025

I think making Buffer::from_owned is a nice to have -- it would be great if someone else wanted to make a PR to do so

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants