The sorter should avoid sorting and use a smarter inserting algorithm #31

Kerollmops · 2022-03-03T14:14:06Z

We use grenad in Meilisearch, and when we benchmark and flamegraph the indexing process, which is intensively utilizing this library, we can see that 12-15% of the time is passed in the sort_unstable_by_key of the sorter. This sort is executed to ensure that the values are when we need to dump all the in-memory entries into a writer.

According to the documentation, sorting the entries by using sort_unstable_by_key is O(m * n * log(n)) where m is the key function. As our key function, the function that computes our key, can be considered O(1), our current sorting solution should be O(n * log(n)).

Inserting an entry in our current data structure is O(1) but not when we need to dump into the writer.

grenad/src/sorter.rs

Lines 246 to 253 in dd79a33

    
               /// Sorts the entry bounds by the entries keys, after a sort 
        
               /// the `iter` method will yield the entries sorted. 
        
               pub fn sort_unstable_by_key(&mut self) { 
        
                   let bounds_end = self.bounds_count * size_of::<EntryBound>(); 
        
                   let (bounds, tail) = self.buffer.split_at_mut(bounds_end); 
        
                   let bounds = cast_slice_mut::<_, EntryBound>(bounds); 
        
                   bounds.sort_unstable_by_key(|b| &tail[tail.len() - b.key_start..][..b.key_length as usize]); 
        
               }

The sorter constraints

Some of the rules that apply to the sorter are:

We never read entries before the need to dump the entries into a writer.
We need to read the values in order only once to merge them before dumping them into a writer.
We need to support entries with the same key and only merge it once at the end.

Those constraints made me think of a more straightforward implementation of a binary heap, where we do not need to implement removing values but only read them in order, i.e. the binary heap's into_sorted_iter.

Using a binary heap

The only downside that I see about using a heap is about supporting entries with duplicate keys. One solution I see to keep the algorithm simple on both sides, the binary heap inserting system and the current merging system, is to postfix every key by an incrementing big-endian number. This way, we can keep the order and uniqueness of the entries. The downside of this solution is it can impact memory usage. The impact depends on the workload and the size of the inserted keys.

According to the documentation of the standard binary heap, inserting a value is O(1), and poping values is O(log(n)).

Adding benchmarks

We need to introduce more benchmarks to measure our progress on this subject and avoid regressions. To do so, we need to find suitable real-world datasets or find a way to generate them properly. Measuring is complex.

The text was updated successfully, but these errors were encountered:

Kerollmops · 2022-03-09T14:59:27Z

We had a conversation with @MarinPostma and we thought that we could experiment something with the standard binary heap and some RefCells. Storing the EntryBoundAlignedBuffer in a RefCell and an Rc. Using this Rc<RefCell<EntryBoundAlignedBuffer>> in the EntryBounds to be able to implement PartialOrd/Eq and use these in the binary heap.

Regarding the possibility of duplicate keys, I proposed that we store the bounds_count number directly into the EntryBound struct to be able to differentiate them, this number is always incrementing and reset when the sorter is dumped into the writer. We must be careful to Ordering::reverse as the binary heap is a max-heap and we want a min-heap.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The sorter should avoid sorting and use a smarter inserting algorithm #31

The sorter should avoid sorting and use a smarter inserting algorithm #31

Kerollmops commented Mar 3, 2022 •

edited

Loading

Kerollmops commented Mar 9, 2022

The sorter should avoid sorting and use a smarter inserting algorithm #31

The sorter should avoid sorting and use a smarter inserting algorithm #31

Comments

Kerollmops commented Mar 3, 2022 • edited Loading

The sorter constraints

Using a binary heap

Adding benchmarks

Kerollmops commented Mar 9, 2022

Kerollmops commented Mar 3, 2022 •

edited

Loading