Skip to content

henryjuho/parus_indel

Repository files navigation

Great tit (Parus major) INDEL analysis

Pipeline from Barton and Zeng (2019).

Henry Juho Barton
Department of Animal and Plant Sciences, The University of Sheffield

Introduction

This repository outlines the pipeline used to generate and analyse an INDEL dataset from 10 high coverage (mean coverage = 44X) great tit (Parus major) genomes (described here: Corcoran et al. 2017). The repository is subdivided by processing steps.

Programs required

* Note * that most scripts make use of the script 'qsub_gen.py' which is designed to submit jobs in the form of shell scripts to the 'Sun Grid Engine', if shell scripts only are required the '-OM' option in the 'qsub_gen.py' command line within the scripts can be changed from 'q' to 'w'. Alternatively some scripts make use of the python qsub wrapper module qsub.py described here: https://github.com/henryjuho/python_qsub_wrapper.

Pre-prepared files required for analysis

  • Reference genome: /fastdata/bop15hjb/GT_ref/Parus_major_1.04.rename.fa
  • Reference genome index file: /fastdata/bop15hjb/GT_ref/Parus_major_1.04.rename.fa.fai
  • GFF annotation file: /fastdata/bop15hjb/GT_ref/GCF_001522545.1_Parus_major1.0.3_genomic.gff.gz
  • All sites VCF: /fastdata/bop15hjb/GT_data/BGI/bgi_10birds.raw.snps.indels.all_sites.vcf
  • Repeat masker bed file: /fastdata/bop15hjb/GT_data/BGI_10_repeats/ParusMajorBuild1_v24032014_reps.bed
  • BAM files for SAMtools calling: /fastdata/bop15hjb/GT_data/BGI_10_BAM/*.bam

Pipeline

Generating the dataset

The variant calling and filtering pipeline for both SNPs and INDELs is described here: variant_calling/.

Multispecies alignment and INDEL polarisation

The generation of a multiple species alignment between zebra finch, great tit and fly catcher and its use in polarisating variants and identifying ancestral repeats is described here: alignment_and_polarisation/.

Annotating the data

Variant annotation using the NCBI GFF file is described here: annotation/.

Summary statistics and analyses

The calculation of summary statistics and other data summary analyses are documented here: summary_analyses/.

Anavar analyses

Analysis of the INDEL data with the anavar package is described here: anavar_analyses/.

Proximity analyses

Analysis of INDEL data in windows of increasing distance from exons is described here: gene_proximity_analyses/.

Recombination analyses

Pipeline for relating INDEL diversity and Tajima's D with recombination rate is documented here: recombination_analyses/.

Length analyses

Analysis of impact of INDEL length on the SFS is documented here: length_analyses/.