Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bcf_hdr_remove leaves hash keys behind #1533

Open
pd3 opened this issue Dec 6, 2022 · 10 comments
Open

bcf_hdr_remove leaves hash keys behind #1533

pd3 opened this issue Dec 6, 2022 · 10 comments
Assignees

Comments

@pd3
Copy link
Member

pd3 commented Dec 6, 2022

Somehow this issue #842 has not been fully addressed. The following code shows the problem:

$ echo '#include <htslib/vcf.h>

int main() {
  htsFile *in = hts_open("-", "r");
  bcf_hdr_t *hdr = bcf_hdr_read(in);
  fprintf(stderr, "before: rid=%d n_contigs=%d\n", bcf_hdr_id2int(hdr, BCF_DT_CTG, "chr22"), hdr->n[BCF_DT_CTG]);
  // remove all contigs
  bcf_hdr_remove(hdr, BCF_HL_CTG, NULL);
  // add a new contig
  bcf_hdr_printf(hdr, "##contig=<ID=chr22,length=50818468>");
  bcf_hdr_sync(hdr);
  // show that the number of contigs is increasing
  fprintf(stderr, "after: rid=%d n_contigs=%d\n", bcf_hdr_id2int(hdr, BCF_DT_CTG, "chr22"), hdr->n[BCF_DT_CTG]);
  htsFile *out = hts_open("-", "wb");
  bcf_hdr_write(out, hdr);
  bcf_hdr_destroy(hdr);
  hts_close(in);
  hts_close(out);
  return 0;
}' > /tmp/main.c

Then compling:

$ gcc /tmp/main.c -o /tmp/main -lhts

Create a toy VCF file:

$ (echo -e "##fileformat=VCFv4.2\n##contig=<ID=chr22,length=51304566>\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO") > /tmp/main.vcf

Then running a first test:

$ cat /tmp/main.vcf | /tmp/main > /dev/null
before: rid=0 n_contigs=1
after: rid=1 n_contigs=2

Shows that the number of contigs is increasing even if the code removed all contigs with the bcf_hdr_remove() function

The number of contigs increases at each iteration:

$ cat /tmp/main.vcf | /tmp/main | /tmp/main > /dev/null
before: rid=0 n_contigs=1
after: rid=1 n_contigs=2
before: rid=1 n_contigs=2
after: rid=2 n_contigs=3

Unless the VCF is converted to text file in between:

$ cat /tmp/main.vcf | /tmp/main | bcftools view | /tmp/main > /dev/null
before: rid=0 n_contigs=1
after: rid=1 n_contigs=2
before: rid=0 n_contigs=1
after: rid=1 n_contigs=2

This must not be the intended behavior as it leads to unwanted scenarios:

$ cat /tmp/main.vcf | /tmp/main > /tmp/main.bcf && bcftools index -f /tmp/main.bcf
$ bcftools isec -C /tmp/main.bcf /tmp/main.bcf
bcftools: vcf.c:2221: bcf_hdr_seqnames: Assertion `tid<m' failed.
Aborted (core dumped)

While it works fine if converted to text file:

$ cat /tmp/main.vcf | /tmp/main | bcftools view -Oz > /tmp/main.vcf.gz && bcftools index -f -t /tmp/main.vcf.gz
$ bcftools isec -C /tmp/main.vcf.gz /tmp/main.vcf.gz
Note: -w option not given, printing list of sites...

Notice that in htslib/vcf.h the function bcf_hdr_remove() is explained as follows:

    /**
     *  bcf_hdr_remove() - remove VCF header tag
     *  @param type:      one of BCF_HL_*
     *  @param key:       tag name or NULL to remove all tags of the given type
     */
    HTSLIB_EXPORT
    void bcf_hdr_remove(bcf_hdr_t *h, int type, const char *key);

Originally posted by @freeseek in #842 (comment)

@freeseek
Copy link

freeseek commented Dec 6, 2022

For now I am replacing the code:

bcf_hdr_remove(hdr, BCF_HL_CTG, NULL);

with:

bcf_hdr_remove(hdr, BCF_HL_CTG, NULL);
kh_clear(vdict, hdr->dict[BCF_DT_CTG]);
for (int i=0; i<out->n[BCF_DT_CTG]; i++)
  free((void *)out->id[BCF_DT_CTG][i].key);
hdr->n[BCF_DT_CTG] = 0;

though it requires to redefine the kh_clear_vdict(h) function with:

KHASH_MAP_INIT_STR(vdict, bcf_idinfo_t)

while the hash tables should be opaque to the end users

pd3 added a commit to pd3/htslib that referenced this issue Dec 8, 2022
The bcf_hdr_remove() call can create gaps in tid blocks which fail
assertion in bcf_hdr_seqnames().

This problem was encountered in samtools#1533, but is only a partial fix
of the problem
@pd3
Copy link
Member Author

pd3 commented Dec 8, 2022

This is trickier than it seems. The bcf_hdr_remove() code is not cleaning the header entirely on purpose, because that would make it impossible to parse subsequent data records.

So one should not be accessing the hdr->n fields directly, they are not informative if the header has been edited. Instead, bcf_hdr_seqnames() should be used, I just pushed a commit that makes the function work in this scenario as well.

However, the bcftools isec case is still failing regardless of this at a different place, as described in this related issue #1534

@freeseek
Copy link

freeseek commented Dec 8, 2022

I think I understand the problem a bit better now. A simpler way to see what is going on is also to run:

$ cat /tmp/main.vcf | /tmp/main | zcat
before: rid=0 n_contigs=1
after: rid=1 n_contigs=2
BCF�##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed",IDX=0>
##contig=<ID=chr22,length=50818468,IDX=1>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO

Or again to run:

$ cat /tmp/main.vcf | /tmp/main | /tmp/main | zcat
before: rid=0 n_contigs=1
after: rid=1 n_contigs=2
before: rid=1 n_contigs=2
after: rid=2 n_contigs=3
BCF�##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed",IDX=0>
##contig=<ID=chr22,length=50818468,IDX=2>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO

The IDX keeps increasing for the contigs and skipping the initial values. Converting to text format resets IDX:

$ cat /tmp/main.vcf | /tmp/main | bcftools view --no-version | bcftools view --no-version -Ou
before: rid=0 n_contigs=1
after: rid=1 n_contigs=2
BCF�##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed",IDX=0>
##contig=<ID=chr22,length=50818468,IDX=0>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO

If the intended behavior of bcf_hdr_remove(hdr, BCF_HL_CTG, NULL) is not to reset IDX to still be able to parse subsequent data records, is there an alternative way to do so? For reproducibility purposes, I would prefer to have control into what kind of information goes into the output binary VCF

@pd3
Copy link
Member Author

pd3 commented Dec 8, 2022

I am not sure what you are after exactly. The header and the body must stay consistent, the IDX field was introduced specifically for cases like this. It should be a considered a bug that htslib is not handling it properly.

@freeseek
Copy link

freeseek commented Dec 8, 2022

In my specific case I am writing a BCFtools plugin, so I have one in header for reading the input VCF and one out header for writing the output VCF, so I don't need to parse data records with the out header. I would prefer to have an option to output a VCF without the missing IDX numbers. I would prefer the output VCF to have no unneeded memory of what was in the input VCF

@pd3
Copy link
Member Author

pd3 commented Dec 9, 2022

Tags and contigs in BCF body are identified by their id which is defined implicitly by the order of their definitions in the header. This brings a problem: if, say, the first tag is removed from the header, ids of the remaining tags change by -1, and the entire BCF has to be recoded. The IDX field was introduced to preserve tag ids even when some are removed or reordered. This is part of the BCF specification and all readers must support it, you will not gain anything by doing this extra work.

@freeseek
Copy link

freeseek commented Dec 9, 2022

It seems to me like you are thinking about protecting the end users by preserving the contig table in the header. But what are the end users expected to use the modified header with the now missing header records for? The old contigs' rid's can be interpreted while reading the VCF but if the modified header was the one written in the output the old rid's should not be output as the corresponding table's entries would be missing from the printed header and will be discarded from memory once the executable is over.

To be specific, my use case is a plugin that lifts over a VCF from one reference to another. The in header is used to interpret the old rid's and the out header has the contig table first cleaned up with:

bcf_hdr_remove(hdr, BCF_HL_CTG, NULL);
kh_clear(vdict, hdr->dict[BCF_DT_CTG]);
for (int i=0; i<out->n[BCF_DT_CTG]; i++)
  free((void *)out->id[BCF_DT_CTG][i].key);
hdr->n[BCF_DT_CTG] = 0;

and then filled with a new contig table from an index FASTA structure. This approach effectively resets the IDX field and solves the issue on my side. What would be a use case for using bcf_hdr_remove(hdr, BCF_HL_CTG, NULL) while keeping the contig table in memory?

I am mostly curious. I am not necessarily advocating for one way or another and I was trying to understand if there was a canonical way to remove both the dictionary and the contig table from a header structure.

@pd3
Copy link
Member Author

pd3 commented Dec 12, 2022

It is not about protecting end users, but about preserving the integrity of the BCF, about programming convenience and about the speed of processing. Also it is not only about contigs, but also about FILTER, INFO, and FORMAT tags. For FILTER,INFO,FORMAT this has been used for quite a while and it works, for contigs you happened to test a combination of steps that revealed a bug.

Say the VCF header looks like this

##INFO=<ID=TAG0,...>
##INFO=<ID=TAG1,...>
##INFO=<ID=TAG2,...>
chr1 ... TAG0=0;TAG1=1;TAG2=2

then the BCF body encodes the tags as this (shown here in a simplified form)

chr1 ... 0=0;1=1;2=2

If we remove the TAG1 from the header and from the body, the modified BCF looks like this

##INFO=<ID=TAG1,...,IDX=1>
##INFO=<ID=TAG2,...,IDX=2>
chr1 ... 1=1;2=2

If we accept your suggestion and discard the indexes, we'd have to manually recode all subsequent tags (IDX -> IDX-1) in the entire BCF and make it look like this

##INFO=<ID=TAG1,...>
##INFO=<ID=TAG2,...>
chr1 ... 0=1;1=2

Even if we decided we wanted to do it this way, at this point it's too late, this would be a major breaking change. Luckily, there is no need for that, we can just fix bcf_hdr_seqnames (pd3@d64e710) and the indexing code (#1534).

@freeseek
Copy link

Another way for what I was trying to say is that you would not use bcf_hdr_write(fp, hdr) after you run bcf_hdr_remove(hdr, BCF_HL_CTG, NULL) as the header contig table would not output anymore and the output VCF would be malformed. In any case, I am happy with the current solution and explanation. This issue can be closed

pd3 added a commit to pd3/htslib that referenced this issue Dec 14, 2022
The bcf_hdr_remove() call can create gaps in tid blocks which fail
assertion in bcf_hdr_seqnames().

This problem was encountered in samtools#1533, but is only a partial fix
of the problem
pd3 added a commit to pd3/htslib that referenced this issue Dec 14, 2022
The bcf_hdr_remove() call can create gaps in tid blocks which fail
assertion in bcf_hdr_seqnames().

This problem was encountered in samtools#1533, but is only a partial fix
of the problem
daviesrob pushed a commit that referenced this issue Dec 14, 2022
The bcf_hdr_remove() call can create gaps in tid blocks which fail
assertion in bcf_hdr_seqnames().

This problem was encountered in #1533, but is only a partial fix
of the problem
@daviesrob
Copy link
Member

Fixed by #1535

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants