AVX2 optimization of group lookup. #167

bashimao · 2022-09-13T10:40:40Z

Thanks for your help earlier today. We have been using your flat_hash_map in Merlin HugeCTR for quite some time now. I have implemented AVX2-accelerated groups for parallel-hashmap, which I hope you could absorb into the next release-version. Looking forward for your review.

Test project /scratch/proj/parallel-hashmap/build
      Start  1: test_compressed_tuple
 1/18 Test  #1: test_compressed_tuple ...............   Passed    0.02 sec
      Start  2: test_container_memory
 2/18 Test  #2: test_container_memory ...............   Passed    0.02 sec
      Start  3: test_hash_policy_testing
 3/18 Test  #3: test_hash_policy_testing ............   Passed    0.01 sec
      Start  4: test_node_hash_policy
 4/18 Test  #4: test_node_hash_policy ...............   Passed    0.02 sec
      Start  5: test_raw_hash_set
 5/18 Test  #5: test_raw_hash_set ...................   Passed    2.22 sec
      Start  6: test_raw_hash_set_allocator
 6/18 Test  #6: test_raw_hash_set_allocator .........   Passed    0.02 sec
      Start  7: test_flat_hash_set
 7/18 Test  #7: test_flat_hash_set ..................   Passed    0.05 sec
      Start  8: test_flat_hash_map
 8/18 Test  #8: test_flat_hash_map ..................   Passed    0.42 sec
      Start  9: test_node_hash_map
 9/18 Test  #9: test_node_hash_map ..................   Passed    0.04 sec
      Start 10: test_node_hash_set
10/18 Test #10: test_node_hash_set ..................   Passed    0.05 sec
      Start 11: test_parallel_flat_hash_map
11/18 Test #11: test_parallel_flat_hash_map .........   Passed    0.47 sec
      Start 12: test_parallel_flat_hash_set
12/18 Test #12: test_parallel_flat_hash_set .........   Passed    0.06 sec
      Start 13: test_parallel_node_hash_map
13/18 Test #13: test_parallel_node_hash_map .........   Passed    0.46 sec
      Start 14: test_parallel_node_hash_set
14/18 Test #14: test_parallel_node_hash_set .........   Passed    0.05 sec
      Start 15: test_parallel_flat_hash_map_mutex
15/18 Test #15: test_parallel_flat_hash_map_mutex ...   Passed    0.47 sec
      Start 16: test_dump_load
16/18 Test #16: test_dump_load ......................   Passed    0.03 sec
      Start 17: test_erase_if
17/18 Test #17: test_erase_if .......................   Passed    0.02 sec
      Start 18: test_btree
18/18 Test #18: test_btree ..........................   Passed   69.74 sec

100% tests passed, 0 tests failed out of 18

To replicate, compile with cmake -DPHMAP_BUILD_TESTS=ON -DPHMAP_BUILD_EXAMPLES=ON -DCMAKE_CXX_FLAGS=-mavx2 .. and run on a machine with AVX2 support.

greg7mdp · 2022-09-13T12:24:27Z

Wow, this is great, I really appreciate it @bashimao ! I did not investigate the AVX2 code, but the change looks very safe as it is conditioned with by a compilation flag. Did you see better performance when using AVX2 versus SSE2?

greg7mdp · 2022-09-13T15:07:24Z

@bashimao I'll check this evening after work how much this improves performance.

greg7mdp · 2022-09-15T00:47:54Z

@bashimao I did some performance testing and unfortunately I cannot see any performance improvement with the avx2 implementation. Maybe it would show only when the hash map is close to full (just before resizing). I'll do some more testing, but I'd like to know what you saw in your own testing.

bashimao · 2022-09-18T15:40:11Z

Hi, sorry for the delay. I am currently travelling overseas. I get back to you later this week with some benchmark results.

bashimao · 2022-09-28T04:33:19Z

Well, that is a now a big question. Very hard to answer. I admit, I may have been chasing a ghost here. I had a use-case, and there it was faster. But marginally. And here you can see why:

This is is mostly your random insertion benchmark, but for reproductivity reasons this one only benchmarks the non-parallel flat_map. So only one core is used. The benchmarks were run on an AMD EPYC 7742 64-Core processor. All results displayed represent the median-value from 5 runs. So this graphic is pretty stable.

As can be seen, we can achieve quite a decent performance without SSE already. SSE is the quickest, but not always. AVX2 occasionally overtakes, but only if you do not keep max occupancy at 87.5%. The reason being, that the AVX2 implementation reads 32 slots per group. If we limit to 7/8 (87.5%), each group we read contains 4 blanks on average. So the likelihood that we need to read another group is low. But the same is true if we use SSE where we read 16 slots per group, with an average blank rate of 2. So no real gain for the majority of reads. I think the amount of bytes read, doesn't really matter, because the cache line size of this CPU is anyway 64 byte. However, some of the AVX2 instructions have a slightly higher latency than their SSE counterparts, which I assume is the reason why it is slower. IMHO, that has to do with the larger register file in AVX2. Anyhow, the same AVX instructions will either be as fast as their SSE counterparts or a little bit slower.

As you can see, the AVX2 implementation has the potential to overtake the SSE version. But only if we reduce to avg. 3 blanks per group, or avg. 2 blanks per group. As can be seen by the SSE implementation, having 2 blanks per group is simply enough. After all, we only need to find a single blank before we can stop searching.

However, as you may see, the side-by-side comparison breaks down there. I need to increase the max occupancy of the hash table to achieve a lower blanks-per-group-ratio. In a way, that is nice, because we are making more efficient use of our memory cells with the AVX2 version. But because that also means that we expand the table at different times, the comparison breaks down. At certain sizes, the AVX2 implementation is certainly faster. But that it is incidental because the table has a different fill state at these times.

I think merging this is fine (with some minor adjustments), but I would by default still select SSE. It is at least always faster than the non-optimized version.

2 things that I would want to explore next:

This was measured on an AMD system. AVX on Intel CPUs is known to occasionally behave different. Should definitely check.
Given the current findings, I doubt that switching to AVX-512 would yield a much different outcome. However, while going through the AVX-512 instruction set, I found a couple of new commands, that provide back-ports for SSE and AVX registers. In particular one of them would allow getting rid of the movemask instruction. That could increase the throughput of both the SSE and AVX implementations. And with this, this merge request would at least add a concrete improvement that improves the speed for many users. However, I would need to find a GPU that supports the right AVX-512 command subset to test first. That may take a couple of days.

bashimao · 2022-10-06T01:22:14Z

Actually, This could be the root cause of the issue. With 64 byte, any unaligned memory access will require hitting 2 cache lines. With a 32 byte lookup like SSE, you may - more often than not - get by with reading only a single cache line. I need to investigate. Maybe it is possible to align the group lookup.

https://lemire.me/blog/2022/06/06/data-structure-size-and-cache-line-accesses/

greg7mdp · 2022-10-06T13:14:07Z

Hey @bashimao , thank for following up with this and the hard work. I think the group lookup starts at the index determined by the hash, so I'm not sure how it can be aligned, and you are right that maybe the cost of accessing more memory for every lookup is higher than the benefit for the rare occasions where the match is not found in the first 16 slots. I'm really not very knowledgeable on SSE/AVX2 so I'm afraid I don't have any specific input, but kudos for the work and good luck finding a faster implementation.

greg7mdp · 2022-10-06T13:18:49Z

Actually, probably you can align the lookup, and discard the matches before the starting index, so you hit only one cache line.

bashimao · 2022-10-06T15:39:59Z

following up with this and the hard work. I think the group lookup starts at the index determined by the hash, so I'm not sure how it can be aligned, and you are right th

Thanks for sharing your concern. However, it turned out to be not that tricky. Hence, I managed to implement a memory aligned version. Timings didn't change on the AMD Epyc CPU. Assuming I did it correctly... that could suggest that cache-line access doesn't matter. Will try to compile and run on an Intel machine (with considerably less cache) next.

In any case, I will learn something in the process.

My private laptop "Intel Tigerlake" -- curiously -- also supports the AVX-512 + backported instructions. Will need to setup an environment though. So don't expect progress overnight.

greg7mdp · 2022-10-06T18:31:07Z

Good luck!

AVX2 optimization of group lookup.

4508c5b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVX2 optimization of group lookup. #167

AVX2 optimization of group lookup. #167

bashimao commented Sep 13, 2022

greg7mdp commented Sep 13, 2022

greg7mdp commented Sep 13, 2022

greg7mdp commented Sep 15, 2022

bashimao commented Sep 18, 2022 •

edited

bashimao commented Sep 28, 2022 •

edited

bashimao commented Oct 6, 2022

greg7mdp commented Oct 6, 2022

greg7mdp commented Oct 6, 2022

bashimao commented Oct 6, 2022 •

edited

greg7mdp commented Oct 6, 2022

AVX2 optimization of group lookup. #167

Are you sure you want to change the base?

AVX2 optimization of group lookup. #167

Conversation

bashimao commented Sep 13, 2022

greg7mdp commented Sep 13, 2022

greg7mdp commented Sep 13, 2022

greg7mdp commented Sep 15, 2022

bashimao commented Sep 18, 2022 • edited

bashimao commented Sep 28, 2022 • edited

bashimao commented Oct 6, 2022

greg7mdp commented Oct 6, 2022

greg7mdp commented Oct 6, 2022

bashimao commented Oct 6, 2022 • edited

greg7mdp commented Oct 6, 2022

bashimao commented Sep 18, 2022 •

edited

bashimao commented Sep 28, 2022 •

edited

bashimao commented Oct 6, 2022 •

edited