Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How are the base quality score generated? #50

Open
nriddiford opened this issue Feb 15, 2022 · 3 comments
Open

How are the base quality score generated? #50

nriddiford opened this issue Feb 15, 2022 · 3 comments

Comments

@nriddiford
Copy link

Hi,

I am using tracy assemble to assemble between 2 - 4 trace files. I am outputting the consensus as a .fastq file, and then aligning this to a reference sequence.

Downstream, I am performing some analysis that filters on per-nucleotide quality scores, and I am not sure that I understand how the these are translated from the base signal from the chromatogram to the base quality of the consensus calculated within tracy assemble. Typically, I only see 2 different base quality scores on a consensus (e.g. 19 and 24).

Do you have any insight into this?

I'm calling tracy like so:

tracy assemble \
            --format fastq \
            --inccons \
            --trim 3 \
            --outprefix ${colony_id} \
            colony_1_p1.ab1 colony_1_p2.ab1
@tobiasrausch
Copy link
Member

The quality scoring is indeed a bit of an issue because the input trace qualities are not very useful. The assemble command simply scales a flat quality prior by the fraction of traces supporting the consensus nucleotide. For 2 input traces, it is thus indeed only 1 or 2 traces supporting the consensus nucleotide. For more input traces, you should see a range of quality values.

@nriddiford
Copy link
Author

nriddiford commented Mar 8, 2022

OK thanks - that's interesting. I'm using Tracy to detect errors in sequencing data, which can range from 1 trace (where I use basecall) to 4 traces (assemble).

As per your explanation, this sounds like forming a consensus between 2 traces for a given nucleotide doesn't consider the quality of the base call, and rather just looks at the fraction of traces involved in generating the consensus.

Below summarises my understanding for 4 different base quality configurations for the assembly of 2 trace files - is this accurate? To my mind, the 3rd and 4th scenarios should have lower quality values than the 1st.

Screenshot 2022-03-08 at 15 56 11

Part of the problem for me is that I want to have some estimate of the per-base quality score, so that I can confidently calculate the per-base error rate. In practice, this is hard using tracy because the quality scores change depending on how many trace files I use, and don't seem too comparable between a 2-trace assembly and a 4-trace assembly.

Is there a workaround?

@blex-max
Copy link

blex-max commented Mar 8, 2022

@tobiasrausch

the input trace qualities are not very useful

This piqued my interest, would you mind expanding on it a bit? In my department, one of the concerns I come across as a proponent of tracy is the lack of informative quality scoring and the fact that Ns appear in our sequences at a very very low rate compared to other basecalling algorithms - combined, these attributes make my colleagues cautious.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants