Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG?: KeyError(s) when running read2tree #58

Open
almosnow opened this issue Mar 6, 2024 · 3 comments
Open

BUG?: KeyError(s) when running read2tree #58

almosnow opened this issue Mar 6, 2024 · 3 comments

Comments

@almosnow
Copy link

almosnow commented Mar 6, 2024

Hello,

I am trying to use read2tree, able to install it and run it, the example runs without a hitch and I get the expected output files.

When I try to use my sequences though, I was getting many "Invalid marker group" errors, this was relatively straightforward to take care of, I just renamed the fasta header lines accordingly.

Now I cannot get past an error that reads KeyError: 'U1810' in particular,

I think this definitely has to do with the five letter you use/infer, but don't really know how to make it work properly,

Any ideas?

@sinamajidian
Copy link
Contributor

Hi @almosnow

Thanks for reaching out. Could you please give us more information on how you get the gene markers? would be great if you could share with us the mplog.log file and the command line(s) you use.

One thing is that the fasta record ID of gene markers in both amino acid level (OGXX.fa files) and nucleotide level should match. This is needed when you concatenate all fna files as dna_ref.fa and provide read2tree with --dna_reference dna_ref.fa. Otherwise, read2tree uses RestAPI to download them from OMA web browser assuming that the gene markers are downloaded from the OMA web browser.

Best,
Sina

@almosnow
Copy link
Author

Hmm, ok I see, I did not set up the gene markers properly I think.

Actually, now that I've read more, what I did was completely wrong.

Here's my scenario, perhaps you can advice on what to do.

We have a set of ~15 sequences (coding sequences from the same gene and the same organism, different samples around the world), with minor variations between them, a phylogeny shows two major groups distinct of each other (but changes between them are small, SNPs and the like).

We have another set of a few hundred SRA libraries and we would like to find out to which of the aforementioned 15 sequences they are most similar to.

Is it ok to use those initial 15 sequences as marker genes and try to fit the reads into them?

@sinamajidian
Copy link
Contributor

For this case, Read2tree can generate a tree in Multiple species mode. However, one gene might not be enough to describe the evolution of organism or provide enough resolution for distinguishing all samples.

Anyway, you can put the amino acid sequences in a fasta file in the marker_genes folder and the nucleotide sequences of coding regions (with exact order) in another fasta file, mentioned with --dna_reference genes.nuc.fa . Note that the gene names should match in both files. Each starts with a five letter code for each strain, like this

>ASTMX02439
>PYGNA12763 
>ELEEL42119 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants