This is the repository accompanying the publication of the Genomator paper, in this repository there is the experimental procedures that should reproduce the results featured in that paper. To execute these procedures, the python package in directory "experiment_tools" should be installed in a python environment. this package installation should install a range of requisite python libraries, and these libraries should work particularly on a system which has a GPU (or otherwise GPU functionality can be disabled, as detailed in following sections) and a system capable of supporting pytorch and tensorflow system libraries.
In addition to python libraries a particular set of system utilities should be accessible via the script files. particularly:
- bcftools (including utility 'bcftools' and 'bgzip' command utilities)
- plink2
- wget
Once these system utilities and tools are installed (the python package 'experiment_tools' with pytorch and tensorflow system, and utilities bcftools&plink&wget) then a single script should be run:
/experiment/run_me.sh
This script runs the /experiment/sources/script.py file, which downloads and processes all source data, and then scans all subdirectories of /experiment/* for any and all files /script.sh and runs them.
Each of these /script.sh files contains the code to execute a respective experiment.
And the results for each experiment should be contained in those corresponding subdirectories.
we note that the run_me.sh
script file will take a LONG time to compute, and alternatively to running that script any specific experiment can be run by executing the script.sh
file in the desired experiment folder (assuming the /experiment/sources/script.py
has successfully downloaded and formatted the input datasets to these experiments first)
In various scripts and python files in /experiment/ the flag --gpu=True this should be changed to --gpu=False which should facilitate running on systems without GPU.
The 805 experiment contains experimental code to reproduce V-shaped PCAs associated with 805 SNP data, including calculating wasserstein distance on the PCA for each of the methods. in each of the subfolders 1-10 should be PNG images visualsing these PCA images, as well as various .txt files containing the wasserstein distances.
The information collating the wasserstein distances is collated by running the /experiment/805/analyse_pca.sh
script
the 'attribute' experiment contains the execution code to produce data associated with how accurate the synthetic data produced by each method matches against its source vs a similar dataset - to produce a measure of privacy. The results of this experiment are stored in the /results.txt file
the 'ld' experiment produces pictures showing how well each method reproduces the LD structure on the AGBL4 gene, the LD scores between the first 2000 SNPs in the source AGBL4 dataset, and that reproduced in the output from each of the methods are produced as .png files in this directory.
the 'ld_error' experiment contains 4 subfolders for each of the genes considered in the paper, each sub-experiment computes 1000 synthetic genomes from each method and gene, and computes the square error in LD reproduction across a range of genome window sizes, output from these experiments is graphs in .png files.
the 'pharmacogenetic' experiment computes 1000 synthetic sets of chromosomes 10&16 with genomator. in this directory there is information about how to perform analysis with these data to extract relevent pharamacogenetic SNP analysis across ethnicities via PCA analysis.
the 'quadruplets' experiment contains code to interrogate how many private vs fictitious SNP quadruplets are generated from each of the methods on the AGBL4 gene dataset. Output from this experiment is contained in a .png graph and in results.txt file.
the 'reverse' experiment contains script to run genomator and use reverse genomator to detect the number of privacy-exposed instances. the results are generated and stored in the /results subdirectory and therein a .png file shoud contain the experiment results graph.
the 'runtimes' experiment contains the experiment process of callilng each of the methods on iteratively larger portions of the dataset of the human genome, up till the full 22 chromosomes. output is a 'runtime_results.txt' file that should show how long each call took to succeed (or fail).