Skip to content

Tutorial: Design sgRNAs for allele specific excision of the gene MFN2 in the WTC genome

Kathleen Keough edited this page Nov 9, 2021 · 47 revisions

Here are the instructions for how to apply AlleleAnalyzer to a genome to identify allele-specific CRISPR sites and design allele-specific sgRNAs. This tutorial is a "simplest-case" scenario, for more complex features please look through the rest of the wiki and the descriptions accompanying each of the tools.

Getting Started

In order to use the tools described in this tutorial, you will need to have cloned this repo. For more information on cloning a repo, see this page.

Clone the repo with the following command in your terminal:

git clone https://github.com/keoughkath/AlleleAnalyzer.git

Next, make sure you have all of the required tools to run AlleleAnalyzer using this wiki page and checking the requirements.txt file.

The first data you will need is the VCF or BCF file for your individual; these are files that contain information about genetic variants in an individual often found via sequencing. You can find more information on this file format here. For this tutorial, we've made the phased VCF for iPSC line WTC, generated by the Conklin Lab at the Gladstone Institutes, available here. More information about this line may be found here.

Start by making a directory at the same level as the AlleleAnalyzer directory. For instance, if the ls command shows the AlleleAnalyzer directory in your current directory, make a new directory for this tutorial mkdir tutorial_directory and move into it cd tutorial_directory (you can copy from the code block below). Next, download the files named wtc_phased_hg19.bcf and wtc_phased_hg19.bcf.csi to the directory from which you're completing this tutorial:

mkdir tutorial_directory

cd tutorial_directory

curl https://lighthouse.ucsf.edu/public_files_no_password/excisionFinderData_public/gRNA_tutorial_sample_data/sample_input/wtc_phased_hg19.bcf -o wtc_phased_hg19.bcf

curl https://lighthouse.ucsf.edu/public_files_no_password/excisionFinderData_public/gRNA_tutorial_sample_data/sample_input/wtc_phased_hg19.bcf.csi -o wtc_phased_hg19.bcf.csi

In this tutorial we will analyze the gene MFN2, which is a dominant negative disease gene that causes Charcot-Marie-Tooth Disease. The locus for this gene in the reference genome GRCh37 is 1:12040238-12073572. This indicates that the gene is located on chromosome 1, starts at genomic coordinate 11980181 and ends at genomic coordinate 12013515.

Generate variant information files

In this step we're grabbing some information about variants in MFN2 in WTC including variant genomic location, reference allele and alternate allele.

copy and paste:

python3 ../AlleleAnalyzer/preprocessing/generate_gens_dfs/get_gens_df.py wtc_phased_hg19.bcf 1:12040238-12073572 mfn2_wtc_hg19

Here is a legend to the above command:

../AlleleAnalyzer/preprocessing/generate_gens_dfs/get_gens_df.py: script name

wtc_phased_hg19.bcf: BCF filename

1:12040238-12073572: locus for MFN2

mfn2_wtc_hg19: prefix for output file

You should see the following in your terminal if it runs correctly:

{'--bed': False,
 '--chrom': False,
 '-f': False,
 '<locus>': '1:12040238-12073572',
 '<out>': 'mfn2_wtc_hg19',
 '<vcf_file>': 'wtc_phased_hg19.bcf'}
bcftools version 1.6 running
Running single locus
Lines   total/split/realigned/skipped:	7/0/0/0
finished

The outputted file will be:

mfn2_wtc_hg19.h5

To check your output against ours, check out the sample output here.

Generate variant annotation files

This section annotates variants that make, break or are near PAM sites. This part requires that you have downloaded the pre-computed locations of PAM sites for SpCas9 analyzed by AlleleAnalyzer (available here). Note that you can generate these files yourself for any genome for which you have a fasta file using the tool preprocessing/find_pams_in_reference/pam_pos_genome.py, but for this tutorial, it's easier to use the pre-generated files. Make a new directory in your current directory titled 'hg19_pams'. Download chr1_SpCas9_pam_sites_for.npy and chr1_SpCas9_pam_sites_rev.npy to the directory 'hg19_pams':

mkdir hg19_pams

curl https://lighthouse.ucsf.edu/public_files_no_password/excisionFinderData_public/gRNA_tutorial_sample_data/sample_input/hg19_pams/chr1_SpCas9_pam_sites_for.npy -o hg19_pams/chr1_SpCas9_pam_sites_for.npy

curl https://lighthouse.ucsf.edu/public_files_no_password/excisionFinderData_public/gRNA_tutorial_sample_data/sample_input/hg19_pams/chr1_SpCas9_pam_sites_rev.npy -o hg19_pams/chr1_SpCas9_pam_sites_rev.npy

Additionally, you will need the fasta file for GRCh37 (hg19), which you can download from the UCSC genome browser. Download the file chr1.fa.gz to your current directory and gunzip it:

curl http://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/chr1.fa.gz -o chr1.fa.gz

gunzip chr1.fa.gz

copy and paste:

Note: this is a bit slow (it's doing a lot of work).

python3 ../AlleleAnalyzer/preprocessing/annotate_variants/annot_variants.py mfn2_wtc_hg19.h5 SpCas9 hg19_pams/ chr1.fa mfn2_hg19_annots

Here is a legend to the above command: ../AlleleAnalyzer/preprocessing/annotate_variants/annot_variants.py: script name

mfn2_wtc_hg19.h5: File with explicit genotypes generated earlier

SpCas9: Type of Cas being evaluated

hg19_pams/: Directory containing PAM site locations for SpCas9 in hg19

chr1.fa: hg19 chromosome 1 Fasta file

mfn2_hg19_annots: Outputted file prefix with annotations for each variant of allele-specific sgRNA sites

The outputted file will be:

mfn2_hg19_annots.h5

To check your output against ours, check out the sample output here.

Design all possible allele-specific guides in MFN2 for WTC

This section designs all possible allele-specific guides in the gene MFN2 for WTC based on heterozygous variants.

copy and paste:

python3 ../AlleleAnalyzer/scripts/gen_sgRNAs.py wtc_phased_hg19.bcf mfn2_hg19_annots.h5 1:12040238-12073572 hg19_pams/ chr1.fa mfn2_wtc_guides SpCas9 20

Here is a legend to the above command: ../AlleleAnalyzer/scripts/gen_sgRNAs.py: script name

wtc_phased_hg19.bcf: BCF genotype file

mfn2_hg19_annots.h5: Variant annotations in this locus for generate allele-specific sgRNA sites

1:12040238-12073572: MFN2 locus

hg19_pams/: Directory containing PAM site locations for SpCas9 in hg19

chr1.fa: hg19 chromosome 1 Fasta file

mfn2_wtc: Prefix for outputted guides file

SpCas9: Type of Cas evaluated

20: Length of sgRNA

This should output mfn2_wtc_guides.tsv. The latter four sets of sgRNAs will have one sgRNA that is all "C"s or "G"s. This indicates that the heterozygous variant that the sgRNA is designed around creates or destroys a PAM site, thereby rendering on the alleles untargetable. The option -d will instead output these sgRNAs are "----" if desired by the user.

Determine whether there are any targetable pairs in MFN2 for WTC

This section identifies pairs of allele-specific sgRNA sites that are likely to disrupt a coding exon, thus meeting our definition of "putatively targetable", and outputs their guides. This requires you to have a GFF file in your current directory that describes where the coding exons are for genes for this reference genome annotation. One place to download these types of files are from RefSeq

Sample Usage:

python3 ../AlleleAnalyzer/scripts/ExcisionFinder.py -vg genes_hg19.gff MFN2 mfn2_hg19_annots.h5 10000 SpCas9 wtc_phased_hg19.bcf wtc_targ --guides=mfn2_wtc_guides.tsv

Here is a legend to the above command:

../AlleleAnalyzer/scripts/ExcisionFinder.py: script name

-vg: options specifying that we want "verbose" output (i.e. the script prints out messages as it runs) and we want guides outputted for the targetable variant pairs

gene_list_hg37.tsv: File detailing locations of coding exons for genes, necessary for determining targetability

MFN2: The gene we're analyzing

mfn2_hg19_annots.h5: Variant annotations in this locus for generate allele-specific sgRNA sites

10000: Maximum distance (in bp) for targetable variant pairs

SpCas9: The Cas variety being analyzed

wtc_phased_hg19.bcf: BCF filename

wtc_targ: Prefix for output files

--guides=mfn2_wtc_guides.tsv: All allele-specific guides available in this locus, as generated earlier

This should output 3 files, wtc_targ.h5, wtc_targgenes_evaluated.txt, and wtc_targpair_guides.tsv. wtc_targ.hg5 simply tells you whether MFN2 in WTC is targetable for allele-specific excision. wtc_targgenes_evaluated.txt is more handy when evaluating multiple genes/loci, as it is a list of all genes that had enough variants annotated and coding exons in order to be evaluated. wtc_targpair_guides.tsv is the sgRNAs for the identified targetable variant pairs.

To check your output against ours, check out the sample output here.

Please send us a note with any questions or if anything in here is confusing!