How are the base quality score generated? #50

nriddiford · 2022-02-15T14:39:16Z

Hi,

I am using tracy assemble to assemble between 2 - 4 trace files. I am outputting the consensus as a .fastq file, and then aligning this to a reference sequence.

Downstream, I am performing some analysis that filters on per-nucleotide quality scores, and I am not sure that I understand how the these are translated from the base signal from the chromatogram to the base quality of the consensus calculated within tracy assemble. Typically, I only see 2 different base quality scores on a consensus (e.g. 19 and 24).

Do you have any insight into this?

I'm calling tracy like so:

tracy assemble \
            --format fastq \
            --inccons \
            --trim 3 \
            --outprefix ${colony_id} \
            colony_1_p1.ab1 colony_1_p2.ab1

The text was updated successfully, but these errors were encountered:

tobiasrausch · 2022-02-16T14:05:56Z

The quality scoring is indeed a bit of an issue because the input trace qualities are not very useful. The assemble command simply scales a flat quality prior by the fraction of traces supporting the consensus nucleotide. For 2 input traces, it is thus indeed only 1 or 2 traces supporting the consensus nucleotide. For more input traces, you should see a range of quality values.

nriddiford · 2022-03-08T15:02:24Z

OK thanks - that's interesting. I'm using Tracy to detect errors in sequencing data, which can range from 1 trace (where I use basecall) to 4 traces (assemble).

As per your explanation, this sounds like forming a consensus between 2 traces for a given nucleotide doesn't consider the quality of the base call, and rather just looks at the fraction of traces involved in generating the consensus.

Below summarises my understanding for 4 different base quality configurations for the assembly of 2 trace files - is this accurate? To my mind, the 3rd and 4th scenarios should have lower quality values than the 1st.

Part of the problem for me is that I want to have some estimate of the per-base quality score, so that I can confidently calculate the per-base error rate. In practice, this is hard using tracy because the quality scores change depending on how many trace files I use, and don't seem too comparable between a 2-trace assembly and a 4-trace assembly.

Is there a workaround?

blex-max · 2022-03-08T16:51:48Z

@tobiasrausch

the input trace qualities are not very useful

This piqued my interest, would you mind expanding on it a bit? In my department, one of the concerns I come across as a proponent of tracy is the lack of informative quality scoring and the fact that Ns appear in our sequences at a very very low rate compared to other basecalling algorithms - combined, these attributes make my colleagues cautious.

nriddiford mentioned this issue Apr 6, 2022

Base quality and consensus generation #58

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How are the base quality score generated? #50

How are the base quality score generated? #50

nriddiford commented Feb 15, 2022

tobiasrausch commented Feb 16, 2022

nriddiford commented Mar 8, 2022 •

edited

blex-max commented Mar 8, 2022

How are the base quality score generated? #50

How are the base quality score generated? #50

Comments

nriddiford commented Feb 15, 2022

tobiasrausch commented Feb 16, 2022

nriddiford commented Mar 8, 2022 • edited

blex-max commented Mar 8, 2022

nriddiford commented Mar 8, 2022 •

edited