Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OGs did not have any DNA #36

Open
Alfred-RRu opened this issue Aug 16, 2023 · 4 comments
Open

OGs did not have any DNA #36

Alfred-RRu opened this issue Aug 16, 2023 · 4 comments

Comments

@Alfred-RRu
Copy link

Hello,

Thank you first for your interesting work.

I'm trying to apply your method to some analyses over corals.
OMA only contain few distant species of cnidarians, so I followed the steps in "obtaining marker genes for viral dataset", obtaining the CDS and cDNA of three species related to the ones I'm studiying.

After obtaining 16135 OGS, OMA was run.
Nevethelss, when this OGS are compared with my sequences the message "This OG did not have any DNA" for each OG.

After the analysis, the next lines are showed:
Traceback (most recent call last):
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/read2tree/Aligner.py", line 292, in _align_worker
align.dna = self._get_translated_alignment(codons, alignment)
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/read2tree/Aligner.py", line 216, in _get_translated_alignment
codon = codons[rec.id]
KeyError: '04095'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/alfred/miniconda3/envs/r2t/bin/read2tree", line 4, in
import('pkg_resources').run_script('read2tree==0.1.5', 'read2tree')
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/pkg_resources/init.py", line 720, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/pkg_resources/init.py", line 1570, in run_script
exec(script_code, namespace, namespace)
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/EGG-INFO/scripts/read2tree", line 16, in
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/read2tree/main.py", line 291, in main
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/read2tree/Aligner.py", line 51, in init
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/read2tree/Aligner.py", line 330, in _align
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/multiprocessing/pool.py", line 367, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/multiprocessing/pool.py", line 774, in get
raise self._value
KeyError: '04095'
(base) alfred@adn10:/media/Data/Alfred/Read2tree$ read2tree --standalone_path /media/Data/Alfred/Read2tree/myWorkingDir/Output/OrthologousGroup
Traceback (most recent call last):
File "/home/alfred/miniconda3/envs/r2t/bin/read2tree", line 4, in
import('pkg_resources').run_script('read2tree==0.1.5', 'read2tree')
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/pkg_resources/init.py", line 720, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/pkg_resources/init.py", line 1570, in run_script
exec(script_code, namespace, namespace)
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/EGG-INFO/scripts/read2tree", line 16, in
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/read2tree/main.py", line 287, in main
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/read2tree/parser/OMAOutputParser.py", line 22, in init
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/read2tree/parser/OMAOutputParser.py", line 64, in _check_oma_output_path
FileNotFoundError: [Errno 2] No such file or directory: '/media/Data/Alfred/Read2tree/myWorkingDir/Output/OrthologousGroup'
(base) alfred@adn10:/media/Data/Alfred/Read2tree$

The resulting outputs file doesn't contain any aligns or three files.

Might it be a problem with the installation of the software or is related with the OGs obtained from the species I selected?

Thank you

@sinamajidian
Copy link
Contributor

Dear @Alfred-RRu
Thanks for using read2tree.
It is important that the gene ids follow specific pattern.
Could you possibly tell us the acession number of (or links to) the dna and cds of species that you use with OMA? I'll try to reproduce the error.
Best regards,
Sina

@Alfred-RRu
Copy link
Author

Dear @sinamajidian

Thank for answering.

I use the following links:

The process to obtain marker genes took 10 days with these. I don't know if that is common.

Some of the sequences analyzed can be found on NCBI GenBank. I write you here the SRA accession numbers in case any it helps.

  • SRR20912049
  • SRR20912044
  • SRR19621796
  • SRR11206396
  • SRR11206406
  • SRR11207078
  • SRR12201157
  • SRR12620700
  • SRR12432494
  • SRR12626557
  • SRR12626618

Best regards,

Alfred

@sinamajidian
Copy link
Contributor

sinamajidian commented Aug 18, 2023

I used the same faa and fna files that you mentioned but I couldn't reproduce the error yet. could you possibly share the file OG1.fa in the marker_genes folder with us here?

  1. I put the unzipped downloaded faa and fna files in data folder.
$ ls data/
GCF_001417965.1_Aiptasia_genome_1.1_translated_cds.faa GCF_022113875.1_Hydra_105_v3_cds_from_genomic.fna  GCF_009602425.1_ASM960242v1_cds_from_genomic.fna   GCF_022113875.1_Hydra_105_v3_translated_cds.faa  GCF_001417965.1_Aiptasia_genome_1.1_cds_from_genomic.fna  GCF_009602425.1_ASM960242v1_translated_cds.faa
  1. Then I ran clean_fasta
mkdir DB
python clean_fasta_cdna_cds.py data DB data/all_cdna_out.fa 

This resulted in

$ ls data/ | grep -v "GCF"
all_cdna_out.fa   five_letter_species.tsv   
 $ ls DB/
02425.fa  13875.fa  17965.fa
$ head -n2 DB/02425.fa 
>02425|NW.022257946.1.XP.031553244.1.1
GDRLSKGARQKWTTLTKLDFITLTKSAQSHCFTTQNTQSPQSSSLLRRRKHALPRPLHGE

$ head -n 2 data/all_cdna_out.fa 
>02425|NW.022257946.1.XP.031553244.1.1
GGTGACAGACTATCGAAGGGAGCGAGACAAAAATGGACTACATTGACAAAATTGGACTTC

$ cat data/five_letter_species.tsv
GCF_022113875.1_Hydra_105_v3_translated_cds	13875
GCF_009602425.1_ASM960242v1_translated_cds	02425
GCF_001417965.1_Aiptasia_genome_1.1_translated_cds	17965
  1. Then I ran OMA
OMA=/OMA.2.5.0/bin/oma
${OMA} -p
# OutgroupSpecies := ['13875'];

# we usually use HPC array of 500. Still it took some hours.
${OMA} -s

${OMA} 

This resulted in

 $ grep OutgroupSpecies parameters.drw  
OutgroupSpecies := ['13875'];
# Outgro 
# ... 

$ ls myWorkingDir/Output/OrthologousGroupsFasta/ | head -n 2
OG10000.fa
OG10001.fa
OG10002.fa

$ head OG10000.fa
>02425|NW.022261782.1.XP.031572670.1.25016 [02425]
MVQQIEVDEILFEGDIVLDEKNRDVDEPQEQKRNARRDRSKIWKTKVVPYVIDQSLRSPTDAGPHIQQAIAEFHKHTCVK

The OG ids could be different in different run. But the format of the ID of fasta record is important.

  1. To make the process faster I only copied top 10 biggest OGs.
ls -althS myWorkingDir/Output/OrthologousGroupsFasta  | head -n 10
mkdir marker_genes
cp   myWorkingDir/Output/OrthologousGroupsFasta/ XX marker_genes  # replace XX with those 10 files.
  1. Then I ran the first step of read2tree (finished in 1 minutes )
read2tree  --standalone_path  marker_genes  --output_path output --reference --dna_reference  data/all_cdna_out.fa  

the output log

$ read2tree  --standalone_path  marker_genes  --output_path output --reference --dna_reference  ../data/all_cdna_out.fa  
--- Load OGs with min 0 species from oma marker_genes - mode = marker_genes ---
Loading files for pre-filter: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 5799.25 OGs/s]
2023-08-18 10:10:06,257 - read2tree.OGSet - INFO - --- Load ogs and find their corresponding DNA seq from ../data/all_cdna_out.fa ---
2023-08-18 10:10:06,257 - read2tree.OGSet - INFO - Loading all_cdna_out.fa into memory. This might take a while . . . 
Loading OGs: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 1135.44 OGs/s]
2023-08-18 10:10:08,205 - read2tree.OGSet - INFO - : Gathering of DNA seq for 8 OGs took 0.007969141006469727.
--- Generating reference for mapping ---
Loading records: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 113743.84 record/s]
2023-08-18 10:10:08,235 - read2tree.ReferenceSet - INFO - : Extracted 3 reference species form 8 ogs took 0.0005331039428710938
--- Alignment of 8 OGs ---
/work/FAC/FBM/DBC/cdessim2/default/smajidi1/software/miniconda/envs/r2t_3.10.8b/lib/python3.10/subprocess.py:961: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used
  self.stdout = io.open(c2pread, 'rb', bufsize)
/work/FAC/FBM/DBC/cdessim2/default/smajidi1/software/miniconda/envs/r2t_3.10.8b/lib/python3.10/subprocess.py:966: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used
  self.stderr = io.open(errread, 'rb', bufsize)
2023-08-18 10:10:22,284 - read2tree.Aligner - INFO - : Alignment of 8 OGs took 14.047221422195435.

this resulted in

$ ls
marker_genes  mplog.log  output
 ls output/
01_ref_ogs_aa  01_ref_ogs_dna  02_ref_dna  03_align_aa  03_align_dna
 $ ls marker_genes/
OG105.fa  OG14.fa  OG15.fa  OG17.fa  OG19.fa  OG1.fa  OG29.fa  OG6.fa

(I also ran for all OGs, and it looks fine: read2tree.Aligner - INFO - : Alignment of 18094 OGs took 2122 seconds)

could you possibly share the file OG1.fa in the marker_genes folder with us here?

You could also run step 4 and 5 again in a new folder.

@Alfred-RRu
Copy link
Author

Hello Sina,

Sorry for the delay in the answer.
I've been checking the first analysis I made, and I have find a mistake in the faa and fna data. Since this I'am running the analysis as you do, to obtain the same results, but my computing capacity seems to be less powerful than yours as it seems to take 300 hours to run OMA, even when -s is used. Or is this maybe indicating a trouble with my analysis?

Once I have the results I'll check them in comparison with yours, and proceed, if everything is okay, with the next steps as you said.

Nevertheless, I understand it will take more time since I'll include the SRA data in the analysis.

Once everything is finish I will be back to you.

Thank you so much.

Alfred

P.S: I have include my marker_genes folder here in case it helps.
marker_genes.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants