Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an inline version of bgzf_read for small reads. #1772

Merged
merged 5 commits into from
Jun 4, 2024

Conversation

jkbonfield
Copy link
Contributor

The bgzf_read function is long and not something that's likely to get inlined even if it was in the header.

However most of the time our calls request a small amount of data and they fit within the buffer we've read, so we offer a static inline to do the memcpy when we can, falling back to the long function when we cannot.

In terms of CPU time it's not much difference, but the key thing is that it's often CPU time saved in a main thread given the bulk of the decode is often threaded. An example of test_view -B -@16

develop:

real    0m48.158s
user    6m2.901s
sys     0m28.134s

real    0m48.730s
user    6m3.707s
sys     0m28.473s

real    0m48.653s
user    6m5.215s
sys     0m28.637s

This PR:

real    0m41.731s
user    5m59.780s
sys     0m30.393s

real    0m41.945s
user    6m0.367s
sys     0m30.426s

So we can see it's a consistent win when threading, potentially 10-15% faster throughput depending on work loads.

@jkbonfield jkbonfield force-pushed the bgzf_inline branch 2 times, most recently from 14d0582 to eca5c35 Compare April 19, 2024 14:41
@jkbonfield
Copy link
Contributor Author

/usr/bin/time was reporting ~810% CPU utilisation for develop and ~940% for this PR, demonstrating the reduced time in main thread.

Commenting out the sanity checks in bam_read1 for if (c->n_cigar > 0) onwards increases that CPU utilisation to 1060% (1120% if I get rid of the tag2cigar call), leading to 37s real time. This shows that there is room for improvement, as these sort of checks ought to be executed in the multi-threaded decode instead. We thread the bgzf_read and not the bam_read.

This is why multi-threaded decoding of SAM and SAM.gz is sometimes more performant than BAM, despite the string->integer conversion overheads, as SAM parsing is entirely threadable.

@jkbonfield jkbonfield force-pushed the bgzf_inline branch 2 times, most recently from 27b2c5a to cda6d59 Compare April 29, 2024 13:21
@daviesrob daviesrob self-assigned this May 2, 2024
sam.c Outdated
for (i = 0; i < 8; ++i) ed_swap_4p(x + i);
if (fp->block_length - fp->block_offset > 32) {
// Avoid bgzf_read and a temporary copy to a local buffer
uint8_t *x = (uint8_t *)fp->uncompressed_block + fp->block_offset;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shadows the earlier x, and also adds an almost-duplicate block to fill out the BAM structure. A way to avoid both problems might be something like this:

uint8_t tmp[32], *x;
uint32_t new_l_data;

// ...

if (fp->block_length - fp->block_offset > 32) {
    x = (uint8_t *)fp->uncompressed_block + fp->block_offset;
    fp->block_offset += 32;
} else {
    x = tmp;
    if (bgzf_read(fp, x, 32) != 32) return -3;
}
c->tid = le_to_u32(x);
c->pos = le_to_i32(x+4);
// ... etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean by duplicate now. It's totally different code and not duplicated at all in style nor in implementation, but the end result is still the same thing. I've taken your suggestion and refactored it.

sam.c Outdated
Comment on lines 798 to 801
uint32_t x2 = le_to_u32(x+8);
c->bin = x2>>16;
c->qual = x2>>8&0xff;
c->l_qname = x2&0xff;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intermediate variable could be avoided by:

c->l_qname = le_to_u8(x + 8);
c->qual = le_to_u8(x + 9);
c->bin = le_to_u16(x + 10);

I'm not sure if it would be any quicker though...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was mainly mirroring what it did before, but yes it may make things better (or worse). I'll experiment.

sam.c Outdated
Comment on lines 803 to 805
uint32_t x3 = le_to_u32(x+12);
c->flag = x3>>16;
c->n_cigar = x3&0xffff;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, could be:

c->n_cigar = le_to_u16(x + 12);
c->flag = le_to_u16(x + 14);

sam.c Outdated
@@ -759,6 +774,7 @@ static int fixup_missing_qname_nul(bam1_t *b) {
* Note a second interface that returns a bam pointer instead would avoid bam_copy1
* in multi-threaded handling. This may be worth considering for htslib2.
*/
#define bgzf_read bgzf_read_small
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be less obscure to replace the individual bgzf_read() calls, as there aren't many of them. Also bgzf_read(fp, x, 32) should remain as-is as the amount of data available would be known to be too small in that case.

You could also merge bgzf_read(fp, &block_len, 4) into the one for the rest of the core header. That would eliminate an entire bgzf_read call, and I don't really see a good reason why it wouldn't work as long as we were careful about checking error cases and returning the right values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd previously tested merging the 4 byte and 32 byte read together and concluded it made no difference in speed (once I'd put the inline version in - it did do before that), and it meant more changes which would have made reviewing more work.

I can try it again if you want, but it's more changes to evaluate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I don't get the comment about bgzf_read(fp, x, 32) remaining as-is. Initially it wasn't because it's small enough to be inlined and a fixed 32-byte read may still cause the memcpy to be optimised away (I don't know), and because we weren't subverting things and copying out the struct directly initially. However since then, it may not matter much either way as that code path will be very rare. I think it's probably an irrelevance at best.

Copy link
Contributor Author

@jkbonfield jkbonfield May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also have to think about what's actually happening here. Why is it that bgzf_read(fp, x, 4) for example is faster with the bgzf_read_fast inline? Some of it is the removal of a function, but I think a lot of it is because when we inline a memcpy of constant 4 bytes it becomes a straight 32-bit load to register. When we call the function, the length is a variable so it has to call memcpy direct and that in turn has a lot of code in it for dealing with arbitrary length and is probably optimised for copying long things more than short things.

So anywhere we have a bgzf_read with a small-ish constant size, it's generally good practice to call the inline variant so the compiler can optimise away memcpy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The optimisation can never work in the specific case of bgzf_read_small(fp, x, 32) due to the earlier line:

if (fp->block_length - fp->block_offset > 32) {

i.e. the same optimisation (plus a zero-copy one) happens elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes I see what you mean. It makes no difference to CPU from what I can see (maybe the compiler detects it as an impossibility and removes the code?), but yes it's pointless to go via the static inline.

Now we have the same code for both paths in parsing the x[] array/pointer there's no as much gain to be had on the fp->block_length - fp->block_offset > 32 check anyway (if we kept the "small" variant), however it does still seem to be about 5% faster on small small-read datasets so we may as well keep it.

The bgzf_read function is long and not something that's likely to get
inlined even if it was in the header.

However most of the time our calls request a small amount of data and
they fit within the buffer we've read, so we offer a static inline to
do the memcpy when we can, falling back to the long function when we
cannot.

In terms of CPU time it's not much difference, but the key thing is
that it's often CPU time saved in a main thread given the bulk of the
decode is often threaded.  An example of test_view -B -@16

develop:

    real    0m48.158s
    user    6m2.901s
    sys     0m28.134s

    real    0m48.730s
    user    6m3.707s
    sys     0m28.473s

    real    0m48.653s
    user    6m5.215s
    sys     0m28.637s

This PR:

    real    0m41.731s
    user    5m59.780s
    sys     0m30.393s

    real    0m41.945s
    user    6m0.367s
    sys     0m30.426s

So we can see it's a consistent win when threading, potentially 10-15%
faster throughput depending on work loads.
When running in a high thread count, our the decompression stage in
bgzf_read is often the only threaded part meaning whatever is left in
main can become the bottleneck once we have sufficient number of
threads running.  Hence speeding up anything in bam_read1 is key.

- sam_realloc_bam_data has an extra 32 bytes.  This may not seem much,
  especially after rounding up to a power of 2.  However in tests it
  makes a significant reduction to memory copies (and also strangely
  memory size).  Tested with both GNU malloc and tcmalloc.

- bam_tag2cigar speed up by reordering the checks and simplifying the
  expression to look for the necessary cigar field.

- Avoid a bgzf read and copying from bgzf buffer to the temporary x[]
  when we know we can copy direct.  This subverts the bgzf interface,
  but it's internal code.

Some benchmarks of test_view -B -@16 in.bam

develop:     9.73 sec
prev commit: 8.32 sec
this commit: 7.81 sec

Combined this is ~25% speed up.
As we did with bgzf_read_small, shortcutting the big bgzf_write
function for the common use case of a small write that fits into the
buffer can help reduce the pressure on the main thread.

Benchmarks with test_view:

Previous commit:

    for i in `seq 1 3`;do taskset -c 0-31 /usr/bin/time -f '%U user\t%S system\t%e elapsed\t%P %%CPU' ./test/test_view -@32 -b ~/scratch/data/novaseq.10m.bam -p /tmp/_.bam ;done;md5sum /tmp/_.bam
    89.81 user	2.22 system	4.11 elapsed	2237% %CPU
    89.57 user	2.43 system	4.20 elapsed	2189% %CPU
    88.44 user	2.30 system	3.96 elapsed	2291% %CPU
    bc9ca86ebef3b6669fc7b6fdd7e1acb6  /tmp/_.bam

This commit:

    for i in `seq 1 3`;do taskset -c 0-31 /usr/bin/time -f '%U user\t%S system\t%e elapsed\t%P %%CPU' ./test/test_view -@32 -b ~/scratch/data/novaseq.10m.bam -p /tmp/_.bam ;done;md5sum /tmp/_.bam
    86.45 user	1.91 system	3.49 elapsed	2531% %CPU
    86.28 user	1.84 system	3.43 elapsed	2562% %CPU
    86.81 user	2.19 system	3.54 elapsed	2509% %CPU
    bc9ca86ebef3b6669fc7b6fdd7e1acb6  /tmp/_.bam

So that's about 14% faster throughput.  It harms some over places, so
this isn't a blanket bgzf_write to bgzf_write_small define.

Also following the observation above, similarly restricted
bgzf_read_small to bam_read1 instead.  The indirection via the small
function harms big reads, which affects SAM reading.

Benchmarks on ./test/test_view -@32 -b /tmp/_.sam.gz -p /tmp/_.bam
shows this speeds up from 3.7s elapsed to 3.4s elapsed.  Small, but
consistent.
HTSlib starts a new block if an alignment is likely to overflow
the current one, so for its own data this will only happen for
records longer than 64kbytes.  As other implementations may not do
this, check that reading works correctly on some BAM files where
records have been deliberately split between BGZF blocks.

Additionally, check the writing side by making a record with
enough CIGAR entries to make it split into multiple BGZF blocks.
@daviesrob
Copy link
Member

Added a couple of extra tests...

@daviesrob daviesrob merged commit 11205a9 into samtools:develop Jun 4, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants