performance boost related to memory allocation #85

digizeph · 2023-02-23T21:57:17Z

Originally posted by @jmeggitt in #81 (comment)

I thought it might be interesting to do a profile to see which parts actually have the largest impacts on performance.

The setup was fairly simple. I just wrote a simple test program which parsed the first 5 million entries from a table dump then exited. This was then compiled in release mode with debug symbols using bgpkit-parser 6055612. I used Intel VTune to perform the profile and it gave me the following results.

use std::hint::black_box;

fn main() {
    let start_time = Instant::now();
    let parser = BgpkitParser::new("C:\\Users\\Jasper\\Downloads\\bview.20220911.0800.gz").unwrap();
    let mut count = 0;
    for elem in parser {
        black_box(elem);
        count += 1;
        if count == 5000000 {
            break
        }
    }
    println!("Elapsed: {:?}", start_time.elapsed());
}

Here was the result of that run. I included the image for context, but much of it is unreadable without clicking the various segments.

Here are a couple of the parts I found interesting:

Elementor::record_to_elems took up 11.9% of the total CPU time, but the vast majority (67.7%) of that time was spent waiting on the system allocator. From a quick glance, all of these cases involved using Vec.
The function that took the most CPU time (42.0%) was ReadUtils::read_nlri_prefix. This is not that surprising given the type of file being parsed, but it looks like there are a number of ways that this could be improved.
26.1% of the entire application runtime was spent to allocate/free memory.

Because viewing a table dump leads to somewhat biased results, I also ran it again on one of the largest updates files I could find for rcc15 (updates.20230124.0750.gz, 31MB). The test code was exactly the same except for switching out the file path.

In this case, the majority (59.1%) of the CPU time was spent allocating and freeing memory using the system allocator. This is a bit alarming since it means more time was spent waiting on allocations then actually performing any meaningful processing. An additional 7.8% of the CPU time was spent using memcpy. It is a bit harder to tell if memcpy is being overused, but roughly a third of that seems to involve stuff being cloned in bgp_update_to_elems.

An easy way to get a sizable performance boost might be to use a crate like smallvec, tinyvec, or arrayvec. With some slight variations, they all provide vec-like data structures that reserve a certain amount of space on the stack before allocating space on the heap. This could have a massive impact on performance for cases where you need the flexibility of a Vec, but know than in most cases it will only hold a small number of elements. In fact, if you enable the union feature for smallvec it can use the space a Vec would normally use for the base pointer and capacity to start storing values instead. This means that if total number of items placed on the stack before moving to the heap totals to less than 2 machine words (16 bytes on x64) then it will be the exact same size as a Vec would be minus the heap.

The text was updated successfully, but these errors were encountered:

digizeph added this to the V0.10 milestone Feb 23, 2023

digizeph added the enhancement New feature or request label Mar 9, 2023

jmeggitt mentioned this issue Sep 12, 2023

Performance Improvements #125

Draft

digizeph removed this from the V0.10 milestone Dec 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance boost related to memory allocation #85

performance boost related to memory allocation #85

digizeph commented Feb 23, 2023 •

edited

performance boost related to memory allocation #85

performance boost related to memory allocation #85

Comments

digizeph commented Feb 23, 2023 • edited

digizeph commented Feb 23, 2023 •

edited