Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The sorter should avoid sorting and use a smarter inserting algorithm #31

Open
Kerollmops opened this issue Mar 3, 2022 · 1 comment

Comments

@Kerollmops
Copy link
Member

Kerollmops commented Mar 3, 2022

We use grenad in Meilisearch, and when we benchmark and flamegraph the indexing process, which is intensively utilizing this library, we can see that 12-15% of the time is passed in the sort_unstable_by_key of the sorter. This sort is executed to ensure that the values are when we need to dump all the in-memory entries into a writer.

According to the documentation, sorting the entries by using sort_unstable_by_key is O(m * n * log(n)) where m is the key function. As our key function, the function that computes our key, can be considered O(1), our current sorting solution should be O(n * log(n)).

Inserting an entry in our current data structure is O(1) but not when we need to dump into the writer.

grenad/src/sorter.rs

Lines 246 to 253 in dd79a33

/// Sorts the entry bounds by the entries keys, after a sort
/// the `iter` method will yield the entries sorted.
pub fn sort_unstable_by_key(&mut self) {
let bounds_end = self.bounds_count * size_of::<EntryBound>();
let (bounds, tail) = self.buffer.split_at_mut(bounds_end);
let bounds = cast_slice_mut::<_, EntryBound>(bounds);
bounds.sort_unstable_by_key(|b| &tail[tail.len() - b.key_start..][..b.key_length as usize]);
}

The sorter constraints

Some of the rules that apply to the sorter are:

  1. We never read entries before the need to dump the entries into a writer.
  2. We need to read the values in order only once to merge them before dumping them into a writer.
  3. We need to support entries with the same key and only merge it once at the end.

Those constraints made me think of a more straightforward implementation of a binary heap, where we do not need to implement removing values but only read them in order, i.e. the binary heap's into_sorted_iter.

Using a binary heap

The only downside that I see about using a heap is about supporting entries with duplicate keys. One solution I see to keep the algorithm simple on both sides, the binary heap inserting system and the current merging system, is to postfix every key by an incrementing big-endian number. This way, we can keep the order and uniqueness of the entries. The downside of this solution is it can impact memory usage. The impact depends on the workload and the size of the inserted keys.

According to the documentation of the standard binary heap, inserting a value is O(1), and poping values is O(log(n)).

Adding benchmarks

We need to introduce more benchmarks to measure our progress on this subject and avoid regressions. To do so, we need to find suitable real-world datasets or find a way to generate them properly. Measuring is complex.

@Kerollmops
Copy link
Member Author

We had a conversation with @MarinPostma and we thought that we could experiment something with the standard binary heap and some RefCells. Storing the EntryBoundAlignedBuffer in a RefCell and an Rc. Using this Rc<RefCell<EntryBoundAlignedBuffer>> in the EntryBounds to be able to implement PartialOrd/Eq and use these in the binary heap.

Regarding the possibility of duplicate keys, I proposed that we store the bounds_count number directly into the EntryBound struct to be able to differentiate them, this number is always incrementing and reset when the sorter is dumped into the writer. We must be careful to Ordering::reverse as the binary heap is a max-heap and we want a min-heap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant