Skip to content

Code for choosing the minimal set of genetic markers needed to differentiate all samples in a genotyping dataset

Notifications You must be signed in to change notification settings

pr0kary0te/minimalmarkers

Repository files navigation

minimalmarkers

Code for choosing the minimum set of genetic markers needed to differentiate all samples in a genotyping dataset Usage is ./select_minimal_markers.pl <input_file_name>

Sample SeqSNP data for Cider apples is included as a test dataset. To run this, put the perl script and data in the same directory and type:

./select_minimal_markers.pl AppleGenotypes.csv

The script will first pick the marker which discriminates the maximum number of varieties (iteration 1). For iteration 2, it will find the marker which discriminates the maximum varieties which were NOT discriminated by the marker in iteration 1. This process continues until either: 1) There are no markers left, 2) adding more markers doesn't add additional varietal discrimination or 3) all varieties are discriminated. Output is written to tab delimited text files which can be viewed in Excel or similar.

The Apple example data supplied has 1286 markers and 260 varieties and runs in a few minutes but larger datasets may take much longer! The example data file is comma separated and uses 0 for AA, 1 for AB and 2 for BB. It will also accept tab separated data and A, AB, B formatted genotype calls as input.

A more detailed explanation of the approach can be found in our paper: https://doi.org/10.1371/journal.pone.0242940

Also supplied is a script for checking the proportion of varieties in the original genotyping file that are resolved by the selected minimal marker set. To run this, give the minimal marker results file as the first command argument and the original genotype file as the second, e.g. ./check_results.pl AppleGenotypes_minimal_markers.txt AppleGenotypes.csv For the example data, you should find that 23 SNPs will discriminate all varieties except "Willy" versus "Connie" - which are not distinguishable.

If you are starting with genotypes in the commonly used VCF format, you may be able to convert these to 0, 1, 2 format using ./convert_vcf_to_genotypes.pl infile.vcf

This may not work for all VCFs so please check your output if you use this script and report any bugs.

About

Code for choosing the minimal set of genetic markers needed to differentiate all samples in a genotyping dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages