Skip to content

De novo microsatellite loci identification pipeline as used in Schröder et al. (2021)

License

Notifications You must be signed in to change notification settings

schellt/msat-loci-identification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

De novo microsatellite loci identification pipeline

Description

Identification of microsatellite loci together with corresponding primer sequences and validating specificity and potential for multiplexing.

Microsatellite loci are identified with MISA within a sequence set (e.g. de novo assembly). Afterwards the results are filtered for non-compound loci with a maximum of 20 repetitions. The Primer3 interface modules of MISA where customized to be compatible with the used release version (libprimer3 release 2.3.7). The script p3_in.pl was adjusted to my_p3_in.pl to adapt and extend the options for Primer3. Afterwards primer3_core can be run with the corresponding output. The script p3_out.pl was rewritten as my_p3_out.pl to return information on primer sequences and melting temperatures, fragment size, motif and its length from the MISA and Primer3 output files in tabular format. To ensure accuracy the primer sequences are converted to fasta format and are aligned to the sequence set they were extracted from via blastn. The resulting blast hits are filtered for error free primer alignments over the complete primer. Primer pairs are kept only if both sequences survive filtering. To ensure specificity and potential for multiplexing the remaining blast hits are converted to bed format and checked if there are no primer pairs binding within 3kb up- and downstream as well as there is a larger distance than 3kb to the sequence edge.

Dependencies

Input

A set of sequences to screen for microsatellite loci. Be sure that the sequences potentially fulfill the requirements of your parameters, which is distance of a locus to a sequence edge.

Usage

There is no script (yet) to execute the whole pipeline at once. You can follow the commands below and adjust parameters manually for now.

#run MISA on the sequence set
#set parameters can be found in misa.ini
misa.pl contigs_ge100.fasta > misa.log 2> misa.err

#filter MISA output for non-compound loci with a maximum of 20 repetitions
awk -F'[\t)]' '$3!~/^c/&&$5<=20' contigs_ge100.fasta.misa > contigs_ge100.fasta.misa_filter

#create primer3 input file
my_p3_in.pl contigs_ge100.fasta.misa_filter

#run primer3
primer3_core contigs_ge100.fasta.p3in > contigs_ge100.fasta.p3out

#extract information from primer3 output
my_p3_out.pl contigs_ge100.fasta.misa_filter contigs_ge100.fasta.p3out > contigs_ge100.fasta.p3out.txt

#convert primer sequences into fasta format
awk '$1!="ID"{print ">"$1"_F\n"$2"\n>"$1"_R\n"$4}' contigs_ge100.fasta.p3out.txt > contigs_ge100.fasta.p3out.primer.fasta

#blast primer sequences against the sequence set
makeblastdb -in ./contigs_ge100.fasta -out ./contigs_ge100 -dbtype nucl > mbdb.log 2> mbdb.err
blastn -db ./contigs_ge100 -query ./contigs_ge100.fasta.p3out.primer.fasta -out contigs_ge100.fasta.p3out.primer.fasta_vs_contigs_ge100 -outfmt '6 std qlen' -evalue 1000 -word_size 7 -dust no -num_threads 80 > blastn.log 2> blastn.err

#filter for 100% identity matches over the full length of the query
awk '$3==100&&$4==$13' contigs_ge100.fasta.p3out.primer.fasta_vs_contigs_ge100 > contigs_ge100.fasta.p3out.primer.fasta_vs_contigs_ge100_100perc-matches

#extract specific primer pair alignments
for i in `cut -f1 contigs_ge100.fasta.p3out.primer.fasta_vs_contigs_ge100_100perc-matches | sed 's/_[F,R]$//' | sort | uniq -c | awk '$1==2{print $2}'`; do grep $i contigs_ge100.fasta.p3out.primer.fasta_vs_contigs_ge100_100perc-matches; done > contigs_ge100.fasta.p3out.primer.fasta_vs_contigs_ge100_100perc-matches.spec1

#extract one primer pair per locus
#one can limit the returned pairs in primer3 as well but you never know if the first returned pairs are always specific in your sequence set
for i in `cut -f1 contigs_ge100.fasta.p3out.primer.fasta_vs_contigs_ge100_100perc-matches.spec1 | sed 's/_[0-4]_[F,R]$//' | sort -u`; do grep $i contigs_ge100.fasta.p3out.primer.fasta_vs_contigs_ge100_100perc-matches.spec1 | head -n 2; done > contigs_ge100.fasta.p3out.primer.fasta_vs_contigs_ge100_100perc-matches.spec1.1st-pair

#convert the blast hits to bed format
awk '{if($9>$10){print $2"\t"$10-1"\t"$9"\t"$1}else{print $2"\t"$9-1"\t"$10"\t"$1}}' contigs_ge100.fasta.p3out.primer.fasta_vs_contigs_ge100_100perc-matches.spec1.1st-pair > contigs_ge100.fasta.p3out.primer.fasta_vs_contigs_ge100_100perc-matches.spec1.1st-pair.bed
#sort and verify bed format
bedtools sort -i contigs_ge100.fasta.p3out.primer.fasta_vs_contigs_ge100_100perc-matches.spec1.1st-pair.bed > contigs_ge100.fasta.p3out.primer.fasta_vs_contigs_ge100_100perc-matches.spec1.1st-pair.sort.bed

#check multiplexing potential
#no other primer pair should bind within 3kb up- and downstream
bedtools merge -i contigs_ge100.fasta.p3out.primer.fasta_vs_contigs_ge100_100perc-matches.spec1.1st-pair.sort.bed -d 3000 -c 4 -o collapse | awk -F'[\t,]' 'NF==5{gsub("_F","",$4);print $4}' > contigs_ge100.fasta.p3out.primer.fasta_vs_contigs_ge100_100perc-matches.spec1.1st-pair.sort.bed.candidates
#no locus should be closer to a sequence edge than 3kb
awk -F'[_\t]' '$11>3000&&$4-$12>3000' contigs_ge100.fasta.misa_filter > contigs_ge100.fasta.misa_filter.candidates

#get a list of ids that fulfill all filtering criteria
for i in `awk '{print $1"_"$2}' contigs_ge100.fasta.misa_filter.candidates`; do grep $i contigs_ge100.fasta.p3out.primer.fasta_vs_contigs_ge100_100perc-matches.spec1.1st-pair.sort.bed.candidates; done > consensus.candidates

#return the final output
grep -f consensus.candidates contigs_ge100.fasta.p3out.txt | sort -nk8 > contigs_ge100.fasta.p3out.consensus.candidates.txt

Output

The final output is a tsv file containing:

  • Locus ID: Combined ID of fasta ID from sequence set, locus ID from MISA and primer pair ID from Primer3.
  • Forward sequence: Nucleotide sequence of the forward primer.
  • Forward Tm: Melting temperature of the forward primer.
  • Reverse sequence: Nucleotide sequence of the reverse primer.
  • Reverse Tm: Melting temperature of the reverse primer.
  • Length: Fragment size of the locus.
  • Motif: Nucleotide sequence and repetitions of the identified locus.
  • Motif length: Length of the microsatellite sequence in bp.

Citation

If you use this pipeline please cite:

  • Schröder O, Cavanaugh KK, Schneider JV, Schell T, Bonada N, Seifert L, Pauls SU (2021). Genetic data support local persistence in multiple glacial refugia in the montane net‐winged midge Liponeura cinerascens cinerascens (diptera, blephariceridae). Freshwater Biology. https://doi.org/10.1111/fwb.13682

Additional to the dependencies:

About

De novo microsatellite loci identification pipeline as used in Schröder et al. (2021)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages