Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with VCF generated after running vg haplotypes #4258

Open
eblerjana opened this issue Mar 28, 2024 · 2 comments
Open

Issue with VCF generated after running vg haplotypes #4258

eblerjana opened this issue Mar 28, 2024 · 2 comments

Comments

@eblerjana
Copy link

1. What were you trying to do?

I'm trying to run the haplotype sampling algorithm to create a VCF of the resulting subgraph with vg deconstruct. I used the commands shown below based on (CHM13-based) MC graph "hgsvc3_and_hprc-dec16-mc-chm13" available here: https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=pangenomes/scratch/2023_12_16_minigraph_cactus_hgsvc3_and_hprc/, and Illumina short reads for sample HG00733 from the 1000G.

GRAPH_HAPL="hgsvc3_and_hprc-dec16-mc-chm13.hapl"
GRAPH_GBZ="hgsvc3_and_hprc-dec16-mc-chm13.gbz"
READS="HG00733_ERR3988823.fasta.gz"

kmc -k31 -m64 -okff -fm -t24 -hp $READS HG00733 .
vg haplotypes -v 2 -t 24 --num-haplotypes 8 --include-reference -i $GRAPH_HAPL -k HG00733.kff -g HG00733.gbz $GRAPH_GBZ
vg deconstruct HG00733.gbz -P CHM13 -a -t 24 | bgzip > HG00733.vcf.gz

2. What did you want to happen?

I want to create a VCF from the subsampled graph to call bubbles relative to the reference genome (CHM13).

3. What actually happened?

The commands run successfully, but I noticed some issue with the resulting VCF file. According to the AT fields in the VCF, some reference alleles traverse nodes previously not traversed by the reference path in the GFA. In other words, nodes appear in these AT traversals that are not annotated as reference nodes in the GFA (missing SN:Z and SO:i tags) and additionally are not contained in the W line of the respective reference path. One such example is the node with ID "8521880". I attached a file containing an example of such a line in the VCF (for node: 8521880). I'm confused how this can happen. Did I do anything wrong?

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:

not applicable

5. What data and command can the vg dev team use to make the problem happen?

see above (1.)

6. What does running vg version say?

vg version v1.54.0 "Parafada"
Compiled with g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 on Linux
Linked against libstd++ 20230528
Built by xian@octo

vcf-line.txt

@adamnovak
Copy link
Member

@glennhickey Do you know what deconstruct might be thinking to come up with the wrong reference traversals?

@glennhickey
Copy link
Contributor

This looks like the usual confusion about the node ids being different between the GFA (unclipped) and GBZ (clipped to 1024bp). You can generally get the GFA node ids out of vg deconstruct by using its -O option

    -O, --gbz-translation    Use the ID translation from the input gbz to apply snarl names to snarl names and AT fields in output

A few more details about node clipping in general can be found here:

https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/doc/pangenome.md#node-chopping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants