Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List of ideas to improve assemblies #57

Open
d4straub opened this issue Aug 19, 2021 · 9 comments
Open

List of ideas to improve assemblies #57

d4straub opened this issue Aug 19, 2021 · 9 comments
Labels
enhancement New feature or request help wanted Extra attention is needed question Further information is requested
Milestone

Comments

@d4straub
Copy link
Collaborator

d4straub commented Aug 19, 2021

This is a collection of ideas that should be considered after the DSL2 conversion #56 is finished. The list is subject to change. Any ideas or discussions are welcome.

Preprocessing (check out nf-core/mag, any other examples out there?)

  • Filtlong to filter ONT by quality (e.g. >7)
  • Bowtie2 to remove Illumina PhiX reads
  • Nanolyse (alternatively Minimap2) to remove ONT Lambda reads
  • add option to down-sample reads, because sometimes this can actually improve assembly

Assemblers:

  • MEGAHIT (a5-miseq Add A5-miseq support #23 , ...) to have alternative short read assembler
  • Trycycler to have better hybrid and long read assembly than Unicycler
  • Flye (Tulip, Redbean, Raven) to have more long read assemblers at hand
  • Pilon to polish Nanopore-derived contigs with Illumina reads (for long read assemblers)

Assembly QC:

  • BUSCO to check completeness and contamination of assemblies (and possibly bins)
  • MaxBin2 (or any other binner) to separate assembly (cleanup if contaminated). In contrast to other binners, MaxBin2 outputs "Completeness, Genome size, GC content" for each bin it found, that comes very handy when judging whether there is real contamination.

Structural:

  • Use only the most polished assembly for Prokka & QUAST (currently assemblies before polishing are used!)
  • By default, run all (or at least many) assemblers inclusive polishing (Medaka & Pilon) that are appropriate for a data set. That allows easy comparison (with e.g. QUAST and BUSCO) of the performance of different assemblers and choosing the best assembly.

Defaults

  • In my opinion, --skip_kraken2 should be either removed (i.e. using --krakendb to determine whether Kraken2 is used) or a simple default (small, fast, but helpful) value should be chosen for --krakendb, e.g. "https://genome-idx.s3.amazonaws.com/kraken/16S_Greengenes13.5_20200326.tgz". This is a very small 16S database but should be sufficient to detect serious bacterial contamination.
@d4straub d4straub added enhancement New feature or request help wanted Extra attention is needed question Further information is requested labels Aug 19, 2021
@d4straub d4straub added this to the 2.1.0 milestone Aug 19, 2021
@Daniel-VM
Copy link
Contributor

Working on Flye and Pilon!

@erinyoung
Copy link

add option to down-sample reads, because sometimes this can actually improve assembly

Filtlong can down-sample reads to the longest/highest quality reads and rasusa can downsample randomly.

I know there are more papers about the ideal depth for assembly, but I can only find this old one for now (https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0060204).

In my own experience, there are a lot more sequencing artifacts once you get above 100X.

@erinyoung
Copy link

erinyoung commented Oct 18, 2023

Another idea I recommend adding is a rotation step. This ensures all bacterial chromosomes at least start at dnaA.

A case-in-point. These are two chromosomes from a clonal outbreak. They are actual very similar, but one wasn't rotated correctly.

alt text

There are a few tools that rotate circular sequences. I think circlator fixstart (abandonware) and dnaapler are the ones that I use most.

@erinyoung
Copy link

For Assembly QC, I'm a fan of gfastats for metrics about the created gfa files and nanoplot. They have a lot of overlapping features, but gfastats does indicate if a sequence is circular. Nanoplot already has a module in multiQC.

@d4straub
Copy link
Collaborator Author

I actually made very good experience for nanopore assembly with dragonflye (in nf-core modules: https://nf-co.re/modules/dragonflye), the results were close to identical with trycycler results, but execution of the former was very fast (few minutes) while with trycycler it was a chore with many manual inventions.

@Daniel-VM
Copy link
Contributor

Those are really good points @erinyoung and @d4straub 🙌🏾 🙌🏾 .

Downsample step

Yep, downsample is indeed necessary. We could try random subsampling with rasusa.. In De Maio N et.al., 2019 mentioned that the random strategy generates better assemblies compared to filtering strategy. But, it always depends on the input data and goal.
Nevertheless, we can think about adding Filtlong or NanoFilt in the quality filtering step (after adapter trimming with porechop?).

Rotation step

Sure, but I think that Ciclator is not supported either... What do you suggest? Adding ciclator together with dnaapler?, or just dnaapler?

dragonflye - Longreads assembly

Interesting, I haven't tried this tool yet. But if it overcomes the manual intervention of Tricycler, then it would be great to add this module. I know that Flye allows not only ONT but also PACBIO.
dragonfly works with ONT reads only, doesn't it? .

@Daniel-VM
Copy link
Contributor

I have found these two papers that may help us to decide. Both include a detailed flowchart with some of the tools we already have included and additional tools/strategies:

Molina-Mora J.A et.al, 2020

LaSarre B et.al., 2022

@d4straub
Copy link
Collaborator Author

Trycycler will require large effort to automatize. For example rrwick/Trycycler#47
So Dragonflye is the way to go for now I think.

@erinyoung
Copy link

Here's a blog post from Dr. Wick about depth and quality : https://rrwick.github.io/2023/11/06/accuracy-vs-depth-update.html

You can see in the plot that accuracy improved up to ~100× depth, after which additional reads brought no benefit. In fact, some of the genomes got a bit worse with higher depth, which was surprising.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants