Skip to content

Simulating reads with vg sim

xchang1 edited this page Mar 4, 2020 · 6 revisions

This page gives an overview of vg sim and its usage. Note that this process is automated in toil-vg and we especially recommend using toil-vg if working with large graphs.

Most of the time we might want to simulate reads from one of the sample that was used to create the graph. To do this we need to index each haplotype using the GBWT. This index will be used to identify the paths in the graph corresponding to the haplotype or sample we want to simulate the reads from.

Constructing a graph and indexing haplotypes

Assuming that x.fa contains reference sequence for a chromosome x and variants.vcf.gz contains variants for multiple samples including the sample for which we want to simulate reads (later called SAMP).

vg construct -r x.fa -v variants.vcf.gz -a -f -m 32 > graph.vg
vg index -x graph.xg -G graph.gbwt -v variants.vcf.gz graph.vg

Simulating reads from one particular haplotype

Before calling vg sim we'll create a new graph with only paths from the desired haplotype.

Haplotypes are represented as threads in the graph. Threads are prefixed by _thread_<SAMPLE>_<CHR>_<HAPLOTYPE> with <HAPLOTYPE> being 0 or 1 for diploid samples. To create a graph with only threads from sample SAMP, chromosome x and haplotype 0:

vg paths -d -v graph.vg > thread.merge.vg
vg paths --gbwt graph.gbwt --extract-vg -x graph.xg -Q _thread_SAMP_x_0 >> thread.merge.vg
vg mod -N thread.merge.vg > thread_SAMP_0.vg

Now we can index this new graph and simulate, for example, a thousand 100 bp reads:

vg index -x thread_SAMP_0.xg thread_SAMP_0.vg
vg sim -x thread_SAMP_0.xg -l 100 -n 1000 -a | vg view -X - > SAMP.fastq

By default vg sim outputs reads in a text format. Using -a it outputs reads in the GAM format that we can then convert into FASTQ using vg view -X.

Simulating reads for a diploid sample

Repeat the process above for both haplotypes and concatenate the output FASTQ files.

Paired-end reads

We can specify the fragment size and its standard deviation using the -p and -v arguments. For example:

vg sim -x thread_SAMP_0.xg -l 100 -n 1000 -a -p 500 -v 50 | vg view -X - > SAMP.fastq

Providing a real FASTQ file for error modeling

A FASTQ file can be used to model the error profile using the -F argument. For example:

FASTQ=ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U5a/U5a_AGTCAA_L002_R1_007.fastq.gz

vg sim -x thread_SAMP_0.xg -n 1000 -a -p 500 -v 50 -F $FASTQ | vg view -X - > SAMP.fastq

Note that in that case the read length argument (-l) is not necessary.

Clone this wiki locally