splam-analysis-results

all the scripts to reproduce the results in the splam paper

Generalization Tests

benchmark/src_generalization_test/

all outputs from every step of the data processing will be saved in the corresponding folder of the step {#}_output/
{name} refers to the name of the gene database

Part 1: Generating and pre-processing the data

Positive Dataset

The positive data is extracted from the complete genomic GFF annotation files of each species.
1. To start, retrieve all introns from the GFF file:
```
 $ python 1_get_pos_set.py
```
  Inputs:
  - {name}.gff = annotations corresponding to a genome
  - {name}_genomic.fa = genome's fasta file
  - {name}_annotation_report.txt = annotation report These were downloaded together from the NCBI Genome Database
  Outputs:
  - databases/{name}.db = sqlite3-style databases parsed from the .gff annotation files
  - {name}_introns.bed = extracted introns
2. Extract all the sequences from the introns, with the tailored specifications of Splam. This step also performs checks and filters to ensure high quality positive data.
```
 $ python 2_extract_pos_splam.py
```
  Inputs:
  - {name}_introns.bed
  - {name}_genomic.fa
  - {name}_annotation_report.txt
  Outputs:
  - donor.bed, acceptor.bed = bed files referring to the specific 400nt donor and acceptor sequences
  - donor_seq.fa, acceptor_seq.fa= fasta files containing the 400nt sequences
  - d_a.bed = bed file referring to the intron coordinates of the splice junction (midpoint of the donor and acceptor coords)
  - input_neg_random.fa = fasta file containing the 800nt sequence that is given to Splam
3. Extract the same sequences, but with the specifications for SpliceAI.
```
 $ python 3_extract_pos_spliceai.py
```
  Inputs:
  - d_a.bed
  - {name}_genomic.fa
  - {name}_annotation_report.txt
  Outputs:
  - coords.bed = bed file referring to the start and end positions of the whole SpliceAI input
  - seq_noN.fa = fasta file containing the SpliceAI input, with the full flanking sequence (coords refer to splice junction)
  - seq_N.fa = fasta file containing the SpliceAI input, with repeating N flanking sequence (coords refer to splice junction)
Negative Dataset

You will need to randomly generate the negative splice junction dataset. This works by taking the existing protein-coding genes, then selecting the opposite strand to guarantee unique sequences, and creating pseudo-splice-junctions from GT-AG pairs found on this strand.
1. To start, retrieve the protein-coding genes from all four genomes, making use of the gffutils library:
```
 $ python 1_get_neg_set.py
```
  Inputs:
  - {name}.gff = annotations corresponding to a genome
  - {name}_genomic.fa = genome's fasta file
  - {name}_annotation_report.txt = annotation report These were downloaded together from the NCBI Genome Database
  Outputs:
  - databases/{name}.db = sqlite3-style databases parsed from the .gff annotation files
  - {name}_genes.bed = protein-coding genes
2. Then, generate the dataset of splice junctions, and process into a format readable by Splam.
```
 $ python 2_extract_neg_splam.py
```
  Inputs:
  - {name}_genes.bed
  - {name}_genomic.fa
  - {name}_annotation_report.txt
  Outputs:
  - donor.bed, acceptor.bed = bed files referring to the specific 400nt donor and acceptor sequences
  - donor_seq.fa, acceptor_seq.fa= fasta files containing the 400nt sequences
  - d_a.bed = bed file referring to the intron coordinates of the splice junction (midpoint of the donor and acceptor coords)
  - input_neg_random.fa = fasta file containing the 800nt sequence that is given to Splam
3. Now process the same splice junctions into a format readable by SpliceAI (it will take a subset of the junctions for efficiency).
```
 $ python 3_extract_neg_spliceai.py
```
  Inputs:
  - d_a.bed
  - {name}_genomic.fa
  - {name}_annotation_report.txt
  Outputs:
  - coords.bed = bed file referring to the start and end positions of the whole SpliceAI input
  - seq_noN.fa = fasta file containing the SpliceAI input, with the full flanking sequence (coords refer to splice junction)
  - seq_N.fa = fasta file containing the SpliceAI input, with repeating N flanking sequence (coords refer to splice junction)

Part 2: Running Splam and SpliceAI

Run the following three steps in both folders of the pipeline. They are essentially the same for both positive and negative datasets.

Run Splam.
```
 $ ./4_splam_runner.sh
```
Inputs:
- input_neg_random.fa
Outputs:
- score.bed = bed file containing the Splam-scored splice junctions
Run SpliceAI. Depending on your system, this may take several days, so you can run each dataset separately:
```
 $ ./5_spliceai_prediction_wrapper.sh {name}
```
Inputs:
- seq_noN.fa
- seq_N.fa
Outputs: There are 5 model output folders, each containing 4 folders with the database names
- spliceai_all_seq.name.noN.{name}.tsv, spliceai_all_seq.name.N.{name}.tsv = names and identifiers for the scored splice junctions
- spliceai_all_seq.score.a.noN.{name}.tsv, spliceai_all_seq.score.a.N.{name}.tsv = acceptor site scores for every nt in sequence
- spliceai_all_seq.score.d.noN.{name}.tsv, spliceai_all_seq.score.d.N.{name}.tsv = donor site scores for every nt in sequence
- spliceai_all_seq.score.n.noN.{name}.tsv, spliceai_all_seq.score.n.N.{name}.tsv = neutral (neither) scores for every nt in sequence
Post-process the Splam and SpliceAI scores into a single file for comparison:
```
 $ 6_compile_data.py
```
Inputs:
- seq_noN.fa, seq_N.fa
- spliceai_all_seq.name.noN.{name}.tsv, spliceai_all_seq.name.N.{name}.tsv
- spliceai_all_seq.score.a.noN.{name}.tsv, spliceai_all_seq.score.a.N.{name}.tsv
- spliceai_all_seq.score.d.noN.{name}.tsv, spliceai_all_seq.score.d.N.{name}.tsv
- score.bed
Outputs:
- Splam/{name}.splam_data.csv = the compiled Splam scores with various identifiers and metrics
- SpliceAI/spliceai_data.v{#}.noN.{name}.csv, SpliceAI/spliceai_data.v{#}.noN.{name}.csv = the compiled SpliceAI scores with various identifiers and metrics
- combine/aggregate_data.noN.{name}.csv, combine/aggregate_data.N.{name}.csv = the complete list of data compiled from both Splam and SpliceAI, with all data included (no averaging)
- combine/averaged_data.noN.{name}.csv, combine/averaged_data.N.{name}.csv = averaged list across the 5 models, most interfaceable form

Part 3: Plotting Results

Navigate to the figures/ folder. Here, you have a subdirectory for every figure we generated. Simply navigate to the figure you want, and run the plot.py function.

For the f1_scores, run the stats.py which will generate a results.csv for various performance metrics at a specified threshold (that you can edit in the code).

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
benchmark		benchmark
experiments		experiments
figures		figures
model		model
scripts		scripts
train		train
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

splam-analysis-results

Generalization Tests

Part 1: Generating and pre-processing the data

Part 2: Running Splam and SpliceAI

Part 3: Plotting Results

About

Releases

Packages

Contributors 2

Languages

License

Kuanhao-Chao/splam-analysis-results

Folders and files

Latest commit

History

Repository files navigation

splam-analysis-results

Generalization Tests

Part 1: Generating and pre-processing the data

Part 2: Running Splam and SpliceAI

Part 3: Plotting Results

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages