Add an inline version of bgzf_read for small reads. #1772

jkbonfield · 2024-04-19T11:37:16Z

The bgzf_read function is long and not something that's likely to get inlined even if it was in the header.

However most of the time our calls request a small amount of data and they fit within the buffer we've read, so we offer a static inline to do the memcpy when we can, falling back to the long function when we cannot.

In terms of CPU time it's not much difference, but the key thing is that it's often CPU time saved in a main thread given the bulk of the decode is often threaded. An example of test_view -B -@16

develop:

real    0m48.158s
user    6m2.901s
sys     0m28.134s

real    0m48.730s
user    6m3.707s
sys     0m28.473s

real    0m48.653s
user    6m5.215s
sys     0m28.637s

This PR:

real    0m41.731s
user    5m59.780s
sys     0m30.393s

real    0m41.945s
user    6m0.367s
sys     0m30.426s

So we can see it's a consistent win when threading, potentially 10-15% faster throughput depending on work loads.

jkbonfield · 2024-04-19T14:46:50Z

/usr/bin/time was reporting ~810% CPU utilisation for develop and ~940% for this PR, demonstrating the reduced time in main thread.

Commenting out the sanity checks in bam_read1 for if (c->n_cigar > 0) onwards increases that CPU utilisation to 1060% (1120% if I get rid of the tag2cigar call), leading to 37s real time. This shows that there is room for improvement, as these sort of checks ought to be executed in the multi-threaded decode instead. We thread the bgzf_read and not the bam_read.

This is why multi-threaded decoding of SAM and SAM.gz is sometimes more performant than BAM, despite the string->integer conversion overheads, as SAM parsing is entirely threadable.

daviesrob · 2024-05-09T09:00:42Z

sam.c

- for (i = 0; i < 8; ++i) ed_swap_4p(x + i);
+ if (fp->block_length - fp->block_offset > 32) {
+ // Avoid bgzf_read and a temporary copy to a local buffer
+ uint8_t *x = (uint8_t *)fp->uncompressed_block + fp->block_offset;


This shadows the earlier x, and also adds an almost-duplicate block to fill out the BAM structure. A way to avoid both problems might be something like this:

uint8_t tmp[32], *x; uint32_t new_l_data; // ... if (fp->block_length - fp->block_offset > 32) { x = (uint8_t *)fp->uncompressed_block + fp->block_offset; fp->block_offset += 32; } else { x = tmp; if (bgzf_read(fp, x, 32) != 32) return -3; } c->tid = le_to_u32(x); c->pos = le_to_i32(x+4); // ... etc.

I see what you mean by duplicate now. It's totally different code and not duplicated at all in style nor in implementation, but the end result is still the same thing. I've taken your suggestion and refactored it.

daviesrob · 2024-05-09T09:08:54Z

sam.c

+ uint32_t x2 = le_to_u32(x+8);
+ c->bin = x2>>16;
+ c->qual = x2>>8&0xff;
+ c->l_qname = x2&0xff;


The intermediate variable could be avoided by:

c->l_qname = le_to_u8(x + 8); c->qual = le_to_u8(x + 9); c->bin = le_to_u16(x + 10);

I'm not sure if it would be any quicker though...

I was mainly mirroring what it did before, but yes it may make things better (or worse). I'll experiment.

daviesrob · 2024-05-09T09:14:05Z

sam.c

+ uint32_t x3 = le_to_u32(x+12);
+ c->flag = x3>>16;
+ c->n_cigar = x3&0xffff;


Similarly, could be:

c->n_cigar = le_to_u16(x + 12); c->flag = le_to_u16(x + 14);

daviesrob · 2024-05-09T09:34:27Z

sam.c

@@ -759,6 +774,7 @@ static int fixup_missing_qname_nul(bam1_t *b) {
 * Note a second interface that returns a bam pointer instead would avoid bam_copy1
 * in multi-threaded handling. This may be worth considering for htslib2.
 */
+#define bgzf_read bgzf_read_small


I think it would be less obscure to replace the individual bgzf_read() calls, as there aren't many of them. Also bgzf_read(fp, x, 32) should remain as-is as the amount of data available would be known to be too small in that case.

You could also merge bgzf_read(fp, &block_len, 4) into the one for the rest of the core header. That would eliminate an entire bgzf_read call, and I don't really see a good reason why it wouldn't work as long as we were careful about checking error cases and returning the right values.

I'd previously tested merging the 4 byte and 32 byte read together and concluded it made no difference in speed (once I'd put the inline version in - it did do before that), and it meant more changes which would have made reviewing more work.

I can try it again if you want, but it's more changes to evaluate.

Also, I don't get the comment about bgzf_read(fp, x, 32) remaining as-is. Initially it wasn't because it's small enough to be inlined and a fixed 32-byte read may still cause the memcpy to be optimised away (I don't know), and because we weren't subverting things and copying out the struct directly initially. However since then, it may not matter much either way as that code path will be very rare. I think it's probably an irrelevance at best.

You also have to think about what's actually happening here. Why is it that bgzf_read(fp, x, 4) for example is faster with the bgzf_read_fast inline? Some of it is the removal of a function, but I think a lot of it is because when we inline a memcpy of constant 4 bytes it becomes a straight 32-bit load to register. When we call the function, the length is a variable so it has to call memcpy direct and that in turn has a lot of code in it for dealing with arbitrary length and is probably optimised for copying long things more than short things.

So anywhere we have a bgzf_read with a small-ish constant size, it's generally good practice to call the inline variant so the compiler can optimise away memcpy.

The optimisation can never work in the specific case of bgzf_read_small(fp, x, 32) due to the earlier line:

if (fp->block_length - fp->block_offset > 32) {

i.e. the same optimisation (plus a zero-copy one) happens elsewhere.

Ah yes I see what you mean. It makes no difference to CPU from what I can see (maybe the compiler detects it as an impossibility and removes the code?), but yes it's pointless to go via the static inline.

Now we have the same code for both paths in parsing the x[] array/pointer there's no as much gain to be had on the fp->block_length - fp->block_offset > 32 check anyway (if we kept the "small" variant), however it does still seem to be about 5% faster on small small-read datasets so we may as well keep it.

@16

The bgzf_read function is long and not something that's likely to get inlined even if it was in the header. However most of the time our calls request a small amount of data and they fit within the buffer we've read, so we offer a static inline to do the memcpy when we can, falling back to the long function when we cannot. In terms of CPU time it's not much difference, but the key thing is that it's often CPU time saved in a main thread given the bulk of the decode is often threaded. An example of test_view -B -@16 develop: real 0m48.158s user 6m2.901s sys 0m28.134s real 0m48.730s user 6m3.707s sys 0m28.473s real 0m48.653s user 6m5.215s sys 0m28.637s This PR: real 0m41.731s user 5m59.780s sys 0m30.393s real 0m41.945s user 6m0.367s sys 0m30.426s So we can see it's a consistent win when threading, potentially 10-15% faster throughput depending on work loads.

@16

When running in a high thread count, our the decompression stage in bgzf_read is often the only threaded part meaning whatever is left in main can become the bottleneck once we have sufficient number of threads running. Hence speeding up anything in bam_read1 is key. - sam_realloc_bam_data has an extra 32 bytes. This may not seem much, especially after rounding up to a power of 2. However in tests it makes a significant reduction to memory copies (and also strangely memory size). Tested with both GNU malloc and tcmalloc. - bam_tag2cigar speed up by reordering the checks and simplifying the expression to look for the necessary cigar field. - Avoid a bgzf read and copying from bgzf buffer to the temporary x[] when we know we can copy direct. This subverts the bgzf interface, but it's internal code. Some benchmarks of test_view -B -@16 in.bam develop: 9.73 sec prev commit: 8.32 sec this commit: 7.81 sec Combined this is ~25% speed up.

@32

As we did with bgzf_read_small, shortcutting the big bgzf_write function for the common use case of a small write that fits into the buffer can help reduce the pressure on the main thread. Benchmarks with test_view: Previous commit: for i in `seq 1 3`;do taskset -c 0-31 /usr/bin/time -f '%U user\t%S system\t%e elapsed\t%P %%CPU' ./test/test_view -@32 -b ~/scratch/data/novaseq.10m.bam -p /tmp/_.bam ;done;md5sum /tmp/_.bam 89.81 user 2.22 system 4.11 elapsed 2237% %CPU 89.57 user 2.43 system 4.20 elapsed 2189% %CPU 88.44 user 2.30 system 3.96 elapsed 2291% %CPU bc9ca86ebef3b6669fc7b6fdd7e1acb6 /tmp/_.bam This commit: for i in `seq 1 3`;do taskset -c 0-31 /usr/bin/time -f '%U user\t%S system\t%e elapsed\t%P %%CPU' ./test/test_view -@32 -b ~/scratch/data/novaseq.10m.bam -p /tmp/_.bam ;done;md5sum /tmp/_.bam 86.45 user 1.91 system 3.49 elapsed 2531% %CPU 86.28 user 1.84 system 3.43 elapsed 2562% %CPU 86.81 user 2.19 system 3.54 elapsed 2509% %CPU bc9ca86ebef3b6669fc7b6fdd7e1acb6 /tmp/_.bam So that's about 14% faster throughput. It harms some over places, so this isn't a blanket bgzf_write to bgzf_write_small define. Also following the observation above, similarly restricted bgzf_read_small to bam_read1 instead. The indirection via the small function harms big reads, which affects SAM reading. Benchmarks on ./test/test_view -@32 -b /tmp/_.sam.gz -p /tmp/_.bam shows this speeds up from 3.7s elapsed to 3.4s elapsed. Small, but consistent.

HTSlib starts a new block if an alignment is likely to overflow the current one, so for its own data this will only happen for records longer than 64kbytes. As other implementations may not do this, check that reading works correctly on some BAM files where records have been deliberately split between BGZF blocks. Additionally, check the writing side by making a record with enough CIGAR entries to make it split into multiple BGZF blocks.

daviesrob · 2024-06-04T11:30:32Z

Added a couple of extra tests...

jkbonfield force-pushed the bgzf_inline branch 2 times, most recently from 14d0582 to eca5c35 Compare April 19, 2024 14:41

jkbonfield force-pushed the bgzf_inline branch 2 times, most recently from 27b2c5a to cda6d59 Compare April 29, 2024 13:21

daviesrob self-assigned this May 2, 2024

daviesrob reviewed May 9, 2024

View reviewed changes

jkbonfield force-pushed the bgzf_inline branch from cda6d59 to 5f30c8e Compare May 9, 2024 14:12

jkbonfield added 4 commits May 14, 2024 15:30

Changes from review

84bbe2c

jkbonfield force-pushed the bgzf_inline branch from 5f30c8e to 84bbe2c Compare May 14, 2024 14:30

daviesrob merged commit 11205a9 into samtools:develop Jun 4, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an inline version of bgzf_read for small reads. #1772

Add an inline version of bgzf_read for small reads. #1772

jkbonfield commented Apr 19, 2024

jkbonfield commented Apr 19, 2024

daviesrob May 9, 2024

jkbonfield May 9, 2024

daviesrob May 9, 2024

jkbonfield May 9, 2024

daviesrob May 9, 2024

daviesrob May 9, 2024

jkbonfield May 9, 2024

jkbonfield May 9, 2024

jkbonfield May 9, 2024 •

edited

Loading

daviesrob May 9, 2024

jkbonfield May 14, 2024

daviesrob commented Jun 4, 2024

Add an inline version of bgzf_read for small reads. #1772

Add an inline version of bgzf_read for small reads. #1772

Conversation

jkbonfield commented Apr 19, 2024

jkbonfield commented Apr 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkbonfield May 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daviesrob commented Jun 4, 2024

jkbonfield May 9, 2024 •

edited

Loading