GitHub - lauralwd/azolla_MYBs: Phylogeny and classification of Azolla filiculoides MYB genes

This repository contains a phylogenetic tree of R2R3 MYB transcription factors. Additionally, this repository details all code and intermediate files used in the process towards infering that tree. Many of these results are intermediate and should be treated as such. For the final results, please refer to the quick links listed below

Manuscript DOI: preprint on bioRXiv

Repository DOI:

Quick links:

treefile
Main text figure png, pdf and Inkscape_svg.
Input sequences fasta
Aligned input sequences fasta, or png
Trimmed input sequences fasta or png

Final figure as shown in Dijkhuizen et al. 2021 with added MSA

The MSA shown below is not included in the manuscript for size limitations. It shows the region of R2R3 MYBs used to differentiate the different subfamilies as described by Jiang & Rao (2020). The figure actually included in the paper is available here.

Guide through folders and files

The data folder contains (unaligned) fasta files, lists of sequence names, and aligned sequences in both trimmed and untrimmed versions. File names reflect the history of that specific file and therefore tend to be rather long. For example combi-I-to-VIII-Azfi-Arabidopsis_sequences_linear_aligned-mafft-einsi_trim-gt4.fasta contains a combination of sequences from the subfamilies I to VIII and sequences from Azolla filiculoides and Arabidopsis thaliana. Those sequences were then aligned with mafft-einsi and trimmed with a gap threshold of .4 (40%).

The analyses folder contains tree inferences and annotation information for use in iToL. These are organised in folders of starting dataset, and then in folders of alignment and trimming strategy. Still, a folder may contain several tree inferences made with IQTree. The final part of the filename summarises the settings used to create a particular tree file. Note that intermediate trees are just that, intermediate results.

The figures folder contains the final versions of the figures shown in the manuscript in several formats. These were made by importing a .treefile in iToL, then adding annotation manually, and downloading these as .svg file. Annotation files for use in iToL can be found in the different directories in the analyses directory These .svg files were then finalised in Inkscape to their published form and exported as pdf or png.

Jupyter notebooks

The workflows shared here are documented in JuPyter notebooks (*.ipynb). Most notebooks contain intermediate work that is shared for transparency and reproducibility purposes and should be treated as such. Alternativelly, the git history may be explored for more information. Note that figures which are embedded in the JuPyter notebooks may not be correctly displayed online on Github. You may download the .ipynb files to display them locally, including images. Alternatively, a html export may be found accompanying the JuPy notebook file.

In step1_differentiate_subfamilies_VI_and_VII (html preview & ipynb preview) we gather R2R3 MYB sequences of subfamily VI & VII and reproduce findings by Jiang & Rao (2020).
In step2_classify-Azfi-RNAseq-targets (html preview & ipynb preview) we placed several Azolla filiculoides sequences in the phylogeny of subfamily VI & VII R2R3 MYBs and compare the differentiating domains as described by Jiang & Rao (2020).
In step3_VI-subfam_in_azolla (html preview & ipynb preview) missing type VI sequences were identified in the Azolla filiculoides genome with hmms and added to the phylogeny.
In step4_expanding-phylogeny (html preview & ipynb preview) the phylogenetic analysis was expanded with R2R3 MYB sequences from all subfamilies (I to VIII). Sequences were taken from the Jiang & Rao (2020) paper.
Finally, in step5_supplement-with-arabidopsis-sequences (html preview & ipynb preview) some uninformative and rogue sequences were removed, Arabidopsis thaliana sequences were added, more Azolla filiculoides sequences were added, and the tree was annotated with RNA-seq data for A. filiculoides.

A template version of the workflow is maintained here.

Finally, the envs directory contains conda environment export files detailing all software names and versions that were used in this project. This file may be used to recreate the exact software environment for this analysis using miniconda. To do so, issue a command like so conda env create -f ./condaenv.yaml.

Data sources used in this project

In building these trees, we have made use of publicly available data exclusively. Most notably, we build here upon the work of Jiang & Rao (2020). Azolla automated annotations are available on fernbase. The manually re-ananotated A. filiculoides R2R3 MYB sequence is made available in ENA and NCBI under accession number [....] . This sequence, and all raw RNA-seq reads used in this project are also made availble in ENA and NCBI under project accession number [....] .

All sequences taken from the several databases used here and their original accession numbers are listed in the data folder, organised in files per subfamily type. These sequences originate from several databases, each with a slightly different naming system. The Jiang & Rao (2020) paper lists each of the species used here, and where to find the right database to search for accession numbers. Those are predominantly:

NCBI nucleotide and protein.
Fernbase for Azolla filiculoides and Salvinia cuculata.
Congenie for Picea abies.
marchantia.info for Marchantia polymorpha.
uniprot for Arabidopsis thaliana sequences.

Links

The Azolla lab at Utrecht University
A MIKC phylogeny workflow, similar to this one and featured in the same preprint.
A blank version of this workflow

Authors

The analyses in this repository were conceived and executed by Dr. Henriette Schluepmann (orcid Utrecht University ) and PhD candidate Laura Dijkhuizen (orcid Utrecht University website) .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quick links:

Final figure as shown in Dijkhuizen et al. 2021 with added MSA

Guide through folders and files

Jupyter notebooks

Data sources used in this project

Links

Authors

About

Releases 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 272 Commits
analyses		analyses
data		data
envs		envs
figures		figures
.gitignore		.gitignore
README.md		README.md
html_step1_differentiate_subfamilies_VI_and_VII.html		html_step1_differentiate_subfamilies_VI_and_VII.html
html_step2_classify-Azfi-RNAseq-targets.html		html_step2_classify-Azfi-RNAseq-targets.html
html_step3_VI-subfam_in_azolla.html		html_step3_VI-subfam_in_azolla.html
html_step4_expanding-phylogeny.html		html_step4_expanding-phylogeny.html
html_step5_supplement-with-arabidopsis-sequences.html		html_step5_supplement-with-arabidopsis-sequences.html
plot_support_frequencies.Rmd		plot_support_frequencies.Rmd
step1_differentiate_subfamilies_VI_and_VII.ipynb		step1_differentiate_subfamilies_VI_and_VII.ipynb
step2_classify-Azfi-RNAseq-targets.ipynb		step2_classify-Azfi-RNAseq-targets.ipynb
step3_VI-subfam_in_azolla.ipynb		step3_VI-subfam_in_azolla.ipynb
step4_expanding-phylogeny.ipynb		step4_expanding-phylogeny.ipynb
step5_supplement-with-arabidopsis-sequences.ipynb		step5_supplement-with-arabidopsis-sequences.ipynb

lauralwd/azolla_MYBs

Folders and files

Latest commit

History

Repository files navigation

Quick links:

Final figure as shown in Dijkhuizen et al. 2021 with added MSA

Guide through folders and files

Jupyter notebooks

Data sources used in this project

Links

Authors

About

Topics

Resources

Stars

Watchers

Forks

Releases 2

Languages