MassiveSeq_Ataxia

This is a repostory for "massive" RNA-seq to analyze roughly 100 Friedreich's ataxia RNA-seq datasets.

pipeline overview

Here is an overview of the proposed pipeline . The key "guts" of our pipeline is a snakemake program to make the key sequence of mapping-quantification easily parralization within and across compute machines availaible to researchers.

Methods

We began by conducting a literature search of all publically deposited Friedreich's Ataxia RNA-seq datasets on SRA. This led to 152 initial samples. We then decided to limit our scope to only human datatests and excluded one small-RNA datasets, as the methods for analyzing smallRNA tend to be fairly different from a "typical" RNA-seq study.

As mentioned in the introduction, we planned the main pipeline in snakemake to automate dispatching of jobs depending on the available cores and memory of a machine (cite). Here, the core steps involved Hisat2 for alignment, followed by Stringtie for transcript annotation and de novo annotation. Finally, reads were quantified using featureCounts from the subRead package (cite). This quantification pipeline follows a common, recently published protocol on Stringtie and Hisat2 (Pertea et al 2016 Nature Protocols).

Once the gene counts were fully quantified for each sample, we analyzed the overall dataset comprised of all 4 studies using the R package metaSeq.

Methods v2 (from paper)

We began by conducting a brief literature survey of the Friedreich’s ataxia RNA-seq datasets. Then, we followed the following steps to download the RNA-seq datasets: Steps to get data:

Search on Google SRA NCBI
Search ataxia in SRA NCBI
Click on RNA (367) on left hand side of page
Click on Homo sapiens (157) on right hand side of page (click on 157)
Click on Send to
Click on Run Selector
Click on Go
Click on Instrument on the left hand side
On bottom left hand side, click on illumina hiseq 2000, illumina hiseq 2500, and nextseq 500
Should see 99 runs found (cases and controls)
Click on green plus sign on 99 runs found
Download data by clicking on RunInfo Table
Open excel file and use default for tab de-limited data to bring in data
This was main dataset we used to merge with other datasets
Go to GEO to get information on studies to merge with main data file:
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE104288
Get count data at this link: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE104288

  Restructure GEO file so that the gene numbers are rows and IDs are columns

Then, we merged the datasets using a common variable

We elected to focus on human data and as a result, excluded one dataset that was smallRNA, leaving 4 studies in total and 83 samples (41 samples were controls, 39 were cases; the remaining 3 were parental carriers) in our final dataset. Next, we conducted differential analyses using GEO and count datasets in R to compare differences in gene expression between cases and controls.

Running the pipeline

If you are new to Snakemake, you are in for a slight learning curve (it's not tooo bad). It's generally easiest to go with a conda install

conda create -n snakemake python=3
source activate snakemake
conda install -c bioconda snakemake

#example submission from our ginormous compute  machines on jetstream with 180GB ram and 44 cores
# if you have slurm/etc, you can distribute with a little extra legwork
snakemake -j 42 -s Snakefile_massiveseq --keep-going --latency-wait 900

Separately, once the results come out you can analyze the counts data with "metaseq_ataxia.R" , you'll need the counts.txt file. If you are doing it on a novel dataset, you'll need to generate the study and case/control flags from, for example the SRA sample sheet we began with.

Results

We have identified some promising truncated transcripts in the current main Friedreich's Ataxia gene, FXN. extra vis from the metaseq to come?

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
data		data
prep_input		prep_input
.gitignore		.gitignore
GSE104288_DESeq_counts.txt		GSE104288_DESeq_counts.txt
LICENSE		LICENSE
MassiveSeq Flow Diagram v3.docx		MassiveSeq Flow Diagram v3.docx
MassiveSeq Flow Diagram v3.pdf		MassiveSeq Flow Diagram v3.pdf
MassiveSeq PowerPoint condensed.pdf		MassiveSeq PowerPoint condensed.pdf
MassiveSeq PowerPoint.pptx		MassiveSeq PowerPoint.pptx
README.md		README.md
SRA_GEO_matchID_v2.txt		SRA_GEO_matchID_v2.txt
SRA_GEO_matchID_v3.txt		SRA_GEO_matchID_v3.txt
SRA_GEO_matchID_v3.xlsx		SRA_GEO_matchID_v3.xlsx
Snakefile_massiveseq.txt		Snakefile_massiveseq.txt
Snakefile_massiveseq_st		Snakefile_massiveseq_st
SraFilternomirna.txt		SraFilternomirna.txt
SraRunTable (Humans).txt		SraRunTable (Humans).txt
SraRunTable.txt		SraRunTable.txt
SraTableFiltered.txt		SraTableFiltered.txt
atax_expr_fastai.csv		atax_expr_fastai.csv
c_get_counts.sh		c_get_counts.sh
config.yml		config.yml
doCounts.sh		doCounts.sh
favicon.ico		favicon.ico
hi.txt		hi.txt
index.html.old		index.html.old
index.php		index.php
info.php		info.php
install.sh		install.sh
lesson4-tabular_genomic.ipynb		lesson4-tabular_genomic.ipynb
lol.cpp		lol.cpp
metaseq_ataxia.R		metaseq_ataxia.R
metaseq_prepfastai_v2.R		metaseq_prepfastai_v2.R
patrick.jpg		patrick.jpg
sig_genes_metaseq.png		sig_genes_metaseq.png
style.css		style.css

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MassiveSeq_Ataxia

pipeline overview

Methods

Methods v2 (from paper)

Running the pipeline

Results

About

Releases 1

Packages

Contributors 4

Languages

License

NCBI-Hackathons/MassiveSeq

Folders and files

Latest commit

History

Repository files navigation

MassiveSeq_Ataxia

pipeline overview

Methods

Methods v2 (from paper)

Running the pipeline

Results

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 4

Languages

Packages