-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an inline version of bgzf_read for small reads. #1772
Conversation
14d0582
to
eca5c35
Compare
/usr/bin/time was reporting ~810% CPU utilisation for develop and ~940% for this PR, demonstrating the reduced time in main thread. Commenting out the sanity checks in This is why multi-threaded decoding of SAM and SAM.gz is sometimes more performant than BAM, despite the string->integer conversion overheads, as SAM parsing is entirely threadable. |
27b2c5a
to
cda6d59
Compare
sam.c
Outdated
for (i = 0; i < 8; ++i) ed_swap_4p(x + i); | ||
if (fp->block_length - fp->block_offset > 32) { | ||
// Avoid bgzf_read and a temporary copy to a local buffer | ||
uint8_t *x = (uint8_t *)fp->uncompressed_block + fp->block_offset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shadows the earlier x
, and also adds an almost-duplicate block to fill out the BAM structure. A way to avoid both problems might be something like this:
uint8_t tmp[32], *x;
uint32_t new_l_data;
// ...
if (fp->block_length - fp->block_offset > 32) {
x = (uint8_t *)fp->uncompressed_block + fp->block_offset;
fp->block_offset += 32;
} else {
x = tmp;
if (bgzf_read(fp, x, 32) != 32) return -3;
}
c->tid = le_to_u32(x);
c->pos = le_to_i32(x+4);
// ... etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean by duplicate now. It's totally different code and not duplicated at all in style nor in implementation, but the end result is still the same thing. I've taken your suggestion and refactored it.
sam.c
Outdated
uint32_t x2 = le_to_u32(x+8); | ||
c->bin = x2>>16; | ||
c->qual = x2>>8&0xff; | ||
c->l_qname = x2&0xff; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intermediate variable could be avoided by:
c->l_qname = le_to_u8(x + 8);
c->qual = le_to_u8(x + 9);
c->bin = le_to_u16(x + 10);
I'm not sure if it would be any quicker though...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was mainly mirroring what it did before, but yes it may make things better (or worse). I'll experiment.
sam.c
Outdated
uint32_t x3 = le_to_u32(x+12); | ||
c->flag = x3>>16; | ||
c->n_cigar = x3&0xffff; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, could be:
c->n_cigar = le_to_u16(x + 12);
c->flag = le_to_u16(x + 14);
sam.c
Outdated
@@ -759,6 +774,7 @@ static int fixup_missing_qname_nul(bam1_t *b) { | |||
* Note a second interface that returns a bam pointer instead would avoid bam_copy1 | |||
* in multi-threaded handling. This may be worth considering for htslib2. | |||
*/ | |||
#define bgzf_read bgzf_read_small |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be less obscure to replace the individual bgzf_read()
calls, as there aren't many of them. Also bgzf_read(fp, x, 32)
should remain as-is as the amount of data available would be known to be too small in that case.
You could also merge bgzf_read(fp, &block_len, 4)
into the one for the rest of the core header. That would eliminate an entire bgzf_read
call, and I don't really see a good reason why it wouldn't work as long as we were careful about checking error cases and returning the right values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd previously tested merging the 4 byte and 32 byte read together and concluded it made no difference in speed (once I'd put the inline version in - it did do before that), and it meant more changes which would have made reviewing more work.
I can try it again if you want, but it's more changes to evaluate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I don't get the comment about bgzf_read(fp, x, 32)
remaining as-is. Initially it wasn't because it's small enough to be inlined and a fixed 32-byte read may still cause the memcpy to be optimised away (I don't know), and because we weren't subverting things and copying out the struct directly initially. However since then, it may not matter much either way as that code path will be very rare. I think it's probably an irrelevance at best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also have to think about what's actually happening here. Why is it that bgzf_read(fp, x, 4)
for example is faster with the bgzf_read_fast
inline? Some of it is the removal of a function, but I think a lot of it is because when we inline a memcpy of constant 4 bytes it becomes a straight 32-bit load to register. When we call the function, the length is a variable so it has to call memcpy direct and that in turn has a lot of code in it for dealing with arbitrary length and is probably optimised for copying long things more than short things.
So anywhere we have a bgzf_read
with a small-ish constant size, it's generally good practice to call the inline variant so the compiler can optimise away memcpy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The optimisation can never work in the specific case of bgzf_read_small(fp, x, 32)
due to the earlier line:
if (fp->block_length - fp->block_offset > 32) {
i.e. the same optimisation (plus a zero-copy one) happens elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes I see what you mean. It makes no difference to CPU from what I can see (maybe the compiler detects it as an impossibility and removes the code?), but yes it's pointless to go via the static inline.
Now we have the same code for both paths in parsing the x[]
array/pointer there's no as much gain to be had on the fp->block_length - fp->block_offset > 32
check anyway (if we kept the "small" variant), however it does still seem to be about 5% faster on small small-read datasets so we may as well keep it.
The bgzf_read function is long and not something that's likely to get inlined even if it was in the header. However most of the time our calls request a small amount of data and they fit within the buffer we've read, so we offer a static inline to do the memcpy when we can, falling back to the long function when we cannot. In terms of CPU time it's not much difference, but the key thing is that it's often CPU time saved in a main thread given the bulk of the decode is often threaded. An example of test_view -B -@16 develop: real 0m48.158s user 6m2.901s sys 0m28.134s real 0m48.730s user 6m3.707s sys 0m28.473s real 0m48.653s user 6m5.215s sys 0m28.637s This PR: real 0m41.731s user 5m59.780s sys 0m30.393s real 0m41.945s user 6m0.367s sys 0m30.426s So we can see it's a consistent win when threading, potentially 10-15% faster throughput depending on work loads.
When running in a high thread count, our the decompression stage in bgzf_read is often the only threaded part meaning whatever is left in main can become the bottleneck once we have sufficient number of threads running. Hence speeding up anything in bam_read1 is key. - sam_realloc_bam_data has an extra 32 bytes. This may not seem much, especially after rounding up to a power of 2. However in tests it makes a significant reduction to memory copies (and also strangely memory size). Tested with both GNU malloc and tcmalloc. - bam_tag2cigar speed up by reordering the checks and simplifying the expression to look for the necessary cigar field. - Avoid a bgzf read and copying from bgzf buffer to the temporary x[] when we know we can copy direct. This subverts the bgzf interface, but it's internal code. Some benchmarks of test_view -B -@16 in.bam develop: 9.73 sec prev commit: 8.32 sec this commit: 7.81 sec Combined this is ~25% speed up.
As we did with bgzf_read_small, shortcutting the big bgzf_write function for the common use case of a small write that fits into the buffer can help reduce the pressure on the main thread. Benchmarks with test_view: Previous commit: for i in `seq 1 3`;do taskset -c 0-31 /usr/bin/time -f '%U user\t%S system\t%e elapsed\t%P %%CPU' ./test/test_view -@32 -b ~/scratch/data/novaseq.10m.bam -p /tmp/_.bam ;done;md5sum /tmp/_.bam 89.81 user 2.22 system 4.11 elapsed 2237% %CPU 89.57 user 2.43 system 4.20 elapsed 2189% %CPU 88.44 user 2.30 system 3.96 elapsed 2291% %CPU bc9ca86ebef3b6669fc7b6fdd7e1acb6 /tmp/_.bam This commit: for i in `seq 1 3`;do taskset -c 0-31 /usr/bin/time -f '%U user\t%S system\t%e elapsed\t%P %%CPU' ./test/test_view -@32 -b ~/scratch/data/novaseq.10m.bam -p /tmp/_.bam ;done;md5sum /tmp/_.bam 86.45 user 1.91 system 3.49 elapsed 2531% %CPU 86.28 user 1.84 system 3.43 elapsed 2562% %CPU 86.81 user 2.19 system 3.54 elapsed 2509% %CPU bc9ca86ebef3b6669fc7b6fdd7e1acb6 /tmp/_.bam So that's about 14% faster throughput. It harms some over places, so this isn't a blanket bgzf_write to bgzf_write_small define. Also following the observation above, similarly restricted bgzf_read_small to bam_read1 instead. The indirection via the small function harms big reads, which affects SAM reading. Benchmarks on ./test/test_view -@32 -b /tmp/_.sam.gz -p /tmp/_.bam shows this speeds up from 3.7s elapsed to 3.4s elapsed. Small, but consistent.
HTSlib starts a new block if an alignment is likely to overflow the current one, so for its own data this will only happen for records longer than 64kbytes. As other implementations may not do this, check that reading works correctly on some BAM files where records have been deliberately split between BGZF blocks. Additionally, check the writing side by making a record with enough CIGAR entries to make it split into multiple BGZF blocks.
Added a couple of extra tests... |
The bgzf_read function is long and not something that's likely to get inlined even if it was in the header.
However most of the time our calls request a small amount of data and they fit within the buffer we've read, so we offer a static inline to do the memcpy when we can, falling back to the long function when we cannot.
In terms of CPU time it's not much difference, but the key thing is that it's often CPU time saved in a main thread given the bulk of the decode is often threaded. An example of test_view -B -@16
develop:
This PR:
So we can see it's a consistent win when threading, potentially 10-15% faster throughput depending on work loads.