Skip to content

Privacy-hardened, hallucination-resistant synthetic data generator.

Notifications You must be signed in to change notification settings

aehrc/genomator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Experiment Scripts for Genomator Paper

This is the repository accompanying the publication of the Genomator paper, in this repository there is the experimental procedures that should reproduce the results featured in that paper. To execute these procedures, the python package in directory "experiment_tools" should be installed in a python environment. this package installation should install a range of requisite python libraries, and these libraries should work particularly on a system which has a GPU (or otherwise GPU functionality can be disabled, as detailed in following sections) and a system capable of supporting pytorch and tensorflow system libraries.

In addition to python libraries a particular set of system utilities should be accessible via the script files. particularly:

  • bcftools (including utility 'bcftools' and 'bgzip' command utilities)
  • plink2
  • wget

Once these system utilities and tools are installed (the python package 'experiment_tools' with pytorch and tensorflow system, and utilities bcftools&plink&wget) then a single script should be run:

/experiment/run_me.sh

This script runs the /experiment/sources/script.py file, which downloads and processes all source data, and then scans all subdirectories of /experiment/* for any and all files /script.sh and runs them. Each of these /script.sh files contains the code to execute a respective experiment. And the results for each experiment should be contained in those corresponding subdirectories. we note that the run_me.sh script file will take a LONG time to compute, and alternatively to running that script any specific experiment can be run by executing the script.sh file in the desired experiment folder (assuming the /experiment/sources/script.py has successfully downloaded and formatted the input datasets to these experiments first)

Running without GPU

In various scripts and python files in /experiment/ the flag --gpu=True this should be changed to --gpu=False which should facilitate running on systems without GPU.

Experiment details

805

The 805 experiment contains experimental code to reproduce V-shaped PCAs associated with 805 SNP data, including calculating wasserstein distance on the PCA for each of the methods. in each of the subfolders 1-10 should be PNG images visualsing these PCA images, as well as various .txt files containing the wasserstein distances. The information collating the wasserstein distances is collated by running the /experiment/805/analyse_pca.sh script

attribute

the 'attribute' experiment contains the execution code to produce data associated with how accurate the synthetic data produced by each method matches against its source vs a similar dataset - to produce a measure of privacy. The results of this experiment are stored in the /results.txt file

ld

the 'ld' experiment produces pictures showing how well each method reproduces the LD structure on the AGBL4 gene, the LD scores between the first 2000 SNPs in the source AGBL4 dataset, and that reproduced in the output from each of the methods are produced as .png files in this directory.

ld_error

the 'ld_error' experiment contains 4 subfolders for each of the genes considered in the paper, each sub-experiment computes 1000 synthetic genomes from each method and gene, and computes the square error in LD reproduction across a range of genome window sizes, output from these experiments is graphs in .png files.

pharmacogenetic

the 'pharmacogenetic' experiment computes 1000 synthetic sets of chromosomes 10&16 with genomator. in this directory there is information about how to perform analysis with these data to extract relevent pharamacogenetic SNP analysis across ethnicities via PCA analysis.

quadruplets

the 'quadruplets' experiment contains code to interrogate how many private vs fictitious SNP quadruplets are generated from each of the methods on the AGBL4 gene dataset. Output from this experiment is contained in a .png graph and in results.txt file.

reverse

the 'reverse' experiment contains script to run genomator and use reverse genomator to detect the number of privacy-exposed instances. the results are generated and stored in the /results subdirectory and therein a .png file shoud contain the experiment results graph.

runtimes

the 'runtimes' experiment contains the experiment process of callilng each of the methods on iteratively larger portions of the dataset of the human genome, up till the full 22 chromosomes. output is a 'runtime_results.txt' file that should show how long each call took to succeed (or fail).

About

Privacy-hardened, hallucination-resistant synthetic data generator.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published