OGs did not have any DNA #36

Alfred-RRu · 2023-08-16T07:35:14Z

Hello,

Thank you first for your interesting work.

I'm trying to apply your method to some analyses over corals.
OMA only contain few distant species of cnidarians, so I followed the steps in "obtaining marker genes for viral dataset", obtaining the CDS and cDNA of three species related to the ones I'm studiying.

After obtaining 16135 OGS, OMA was run.
Nevethelss, when this OGS are compared with my sequences the message "This OG did not have any DNA" for each OG.

After the analysis, the next lines are showed:
Traceback (most recent call last):
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/read2tree/Aligner.py", line 292, in _align_worker
align.dna = self._get_translated_alignment(codons, alignment)
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/read2tree/Aligner.py", line 216, in _get_translated_alignment
codon = codons[rec.id]
KeyError: '04095'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/alfred/miniconda3/envs/r2t/bin/read2tree", line 4, in
import('pkg_resources').run_script('read2tree==0.1.5', 'read2tree')
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/pkg_resources/init.py", line 720, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/pkg_resources/init.py", line 1570, in run_script
exec(script_code, namespace, namespace)
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/EGG-INFO/scripts/read2tree", line 16, in
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/read2tree/main.py", line 291, in main
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/read2tree/Aligner.py", line 51, in init
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/read2tree/Aligner.py", line 330, in _align
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/multiprocessing/pool.py", line 367, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/multiprocessing/pool.py", line 774, in get
raise self._value
KeyError: '04095'
(base) alfred@adn10:/media/Data/Alfred/Read2tree$ read2tree --standalone_path /media/Data/Alfred/Read2tree/myWorkingDir/Output/OrthologousGroup
Traceback (most recent call last):
File "/home/alfred/miniconda3/envs/r2t/bin/read2tree", line 4, in
import('pkg_resources').run_script('read2tree==0.1.5', 'read2tree')
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/pkg_resources/init.py", line 720, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/pkg_resources/init.py", line 1570, in run_script
exec(script_code, namespace, namespace)
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/EGG-INFO/scripts/read2tree", line 16, in
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/read2tree/main.py", line 287, in main
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/read2tree/parser/OMAOutputParser.py", line 22, in init
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/read2tree/parser/OMAOutputParser.py", line 64, in _check_oma_output_path
FileNotFoundError: [Errno 2] No such file or directory: '/media/Data/Alfred/Read2tree/myWorkingDir/Output/OrthologousGroup'
(base) alfred@adn10:/media/Data/Alfred/Read2tree$

The resulting outputs file doesn't contain any aligns or three files.

Might it be a problem with the installation of the software or is related with the OGs obtained from the species I selected?

Thank you

sinamajidian · 2023-08-16T08:03:39Z

Dear @Alfred-RRu
Thanks for using read2tree.
It is important that the gene ids follow specific pattern.
Could you possibly tell us the acession number of (or links to) the dna and cds of species that you use with OMA? I'll try to reproduce the error.
Best regards,
Sina

Alfred-RRu · 2023-08-16T12:07:41Z

Dear @sinamajidian

Thank for answering.

I use the following links:

The process to obtain marker genes took 10 days with these. I don't know if that is common.

Some of the sequences analyzed can be found on NCBI GenBank. I write you here the SRA accession numbers in case any it helps.

SRR20912049
SRR20912044
SRR19621796
SRR11206396
SRR11206406
SRR11207078
SRR12201157
SRR12620700
SRR12432494
SRR12626557
SRR12626618

Best regards,

Alfred

sinamajidian · 2023-08-18T08:12:47Z

I used the same faa and fna files that you mentioned but I couldn't reproduce the error yet. could you possibly share the file OG1.fa in the marker_genes folder with us here?

I put the unzipped downloaded faa and fna files in data folder.

$ ls data/
GCF_001417965.1_Aiptasia_genome_1.1_translated_cds.faa GCF_022113875.1_Hydra_105_v3_cds_from_genomic.fna  GCF_009602425.1_ASM960242v1_cds_from_genomic.fna   GCF_022113875.1_Hydra_105_v3_translated_cds.faa  GCF_001417965.1_Aiptasia_genome_1.1_cds_from_genomic.fna  GCF_009602425.1_ASM960242v1_translated_cds.faa

Then I ran clean_fasta

mkdir DB
python clean_fasta_cdna_cds.py data DB data/all_cdna_out.fa

This resulted in

$ ls data/ | grep -v "GCF"
all_cdna_out.fa   five_letter_species.tsv   
 $ ls DB/
02425.fa  13875.fa  17965.fa
$ head -n2 DB/02425.fa 
>02425|NW.022257946.1.XP.031553244.1.1
GDRLSKGARQKWTTLTKLDFITLTKSAQSHCFTTQNTQSPQSSSLLRRRKHALPRPLHGE

$ head -n 2 data/all_cdna_out.fa 
>02425|NW.022257946.1.XP.031553244.1.1
GGTGACAGACTATCGAAGGGAGCGAGACAAAAATGGACTACATTGACAAAATTGGACTTC

$ cat data/five_letter_species.tsv
GCF_022113875.1_Hydra_105_v3_translated_cds	13875
GCF_009602425.1_ASM960242v1_translated_cds	02425
GCF_001417965.1_Aiptasia_genome_1.1_translated_cds	17965

Then I ran OMA

OMA=/OMA.2.5.0/bin/oma
${OMA} -p
# OutgroupSpecies := ['13875'];

# we usually use HPC array of 500. Still it took some hours.
${OMA} -s

${OMA}

This resulted in

 $ grep OutgroupSpecies parameters.drw  
OutgroupSpecies := ['13875'];
# Outgro 
# ... 

$ ls myWorkingDir/Output/OrthologousGroupsFasta/ | head -n 2
OG10000.fa
OG10001.fa
OG10002.fa

$ head OG10000.fa
>02425|NW.022261782.1.XP.031572670.1.25016 [02425]
MVQQIEVDEILFEGDIVLDEKNRDVDEPQEQKRNARRDRSKIWKTKVVPYVIDQSLRSPTDAGPHIQQAIAEFHKHTCVK

The OG ids could be different in different run. But the format of the ID of fasta record is important.

To make the process faster I only copied top 10 biggest OGs.

ls -althS myWorkingDir/Output/OrthologousGroupsFasta  | head -n 10
mkdir marker_genes
cp   myWorkingDir/Output/OrthologousGroupsFasta/ XX marker_genes  # replace XX with those 10 files.

Then I ran the first step of read2tree (finished in 1 minutes )

read2tree  --standalone_path  marker_genes  --output_path output --reference --dna_reference  data/all_cdna_out.fa

the output log

$ read2tree  --standalone_path  marker_genes  --output_path output --reference --dna_reference  ../data/all_cdna_out.fa  
--- Load OGs with min 0 species from oma marker_genes - mode = marker_genes ---
Loading files for pre-filter: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 5799.25 OGs/s]
2023-08-18 10:10:06,257 - read2tree.OGSet - INFO - --- Load ogs and find their corresponding DNA seq from ../data/all_cdna_out.fa ---
2023-08-18 10:10:06,257 - read2tree.OGSet - INFO - Loading all_cdna_out.fa into memory. This might take a while . . . 
Loading OGs: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 1135.44 OGs/s]
2023-08-18 10:10:08,205 - read2tree.OGSet - INFO - : Gathering of DNA seq for 8 OGs took 0.007969141006469727.
--- Generating reference for mapping ---
Loading records: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 113743.84 record/s]
2023-08-18 10:10:08,235 - read2tree.ReferenceSet - INFO - : Extracted 3 reference species form 8 ogs took 0.0005331039428710938
--- Alignment of 8 OGs ---
/work/FAC/FBM/DBC/cdessim2/default/smajidi1/software/miniconda/envs/r2t_3.10.8b/lib/python3.10/subprocess.py:961: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used
  self.stdout = io.open(c2pread, 'rb', bufsize)
/work/FAC/FBM/DBC/cdessim2/default/smajidi1/software/miniconda/envs/r2t_3.10.8b/lib/python3.10/subprocess.py:966: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used
  self.stderr = io.open(errread, 'rb', bufsize)
2023-08-18 10:10:22,284 - read2tree.Aligner - INFO - : Alignment of 8 OGs took 14.047221422195435.

this resulted in

$ ls
marker_genes  mplog.log  output
 ls output/
01_ref_ogs_aa  01_ref_ogs_dna  02_ref_dna  03_align_aa  03_align_dna
 $ ls marker_genes/
OG105.fa  OG14.fa  OG15.fa  OG17.fa  OG19.fa  OG1.fa  OG29.fa  OG6.fa

(I also ran for all OGs, and it looks fine: read2tree.Aligner - INFO - : Alignment of 18094 OGs took 2122 seconds)

could you possibly share the file OG1.fa in the marker_genes folder with us here?

You could also run step 4 and 5 again in a new folder.

Alfred-RRu · 2023-08-23T08:24:52Z

Hello Sina,

Sorry for the delay in the answer.
I've been checking the first analysis I made, and I have find a mistake in the faa and fna data. Since this I'am running the analysis as you do, to obtain the same results, but my computing capacity seems to be less powerful than yours as it seems to take 300 hours to run OMA, even when -s is used. Or is this maybe indicating a trouble with my analysis?

Once I have the results I'll check them in comparison with yours, and proceed, if everything is okay, with the next steps as you said.

Nevertheless, I understand it will take more time since I'll include the SRA data in the analysis.

Once everything is finish I will be back to you.

Thank you so much.

Alfred

P.S: I have include my marker_genes folder here in case it helps.
marker_genes.zip

Alfred-RRu closed this as completed Aug 23, 2023

Alfred-RRu reopened this Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OGs did not have any DNA #36

OGs did not have any DNA #36

Alfred-RRu commented Aug 16, 2023

sinamajidian commented Aug 16, 2023

Alfred-RRu commented Aug 16, 2023

sinamajidian commented Aug 18, 2023 •

edited

Loading

Alfred-RRu commented Aug 23, 2023

OGs did not have any DNA #36

OGs did not have any DNA #36

Comments

Alfred-RRu commented Aug 16, 2023

sinamajidian commented Aug 16, 2023

Alfred-RRu commented Aug 16, 2023

sinamajidian commented Aug 18, 2023 • edited Loading

Alfred-RRu commented Aug 23, 2023

sinamajidian commented Aug 18, 2023 •

edited

Loading