Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No tree generated after running multiple species mode #8

Open
wrengs opened this issue Aug 26, 2022 · 13 comments
Open

No tree generated after running multiple species mode #8

wrengs opened this issue Aug 26, 2022 · 13 comments

Comments

@wrengs
Copy link

wrengs commented Aug 26, 2022

Dear Authors, dear Sina and Adrian,

Sorry to contact you once more in this short timeframe.
After the previous issue was fixed, I re-installed read2tree and successfully ran the test dataset.
Thereafter, I re-ran my previously attempted command with 3 additional species.
Aligning now seems to run well, however during the merge / tree inference I get the following error:

2022-08-25 11:22:26,623 - read2tree.Aligner - INFO - merge: Appending 0 reconstructed sequences to present Alignments took 0.06581997871398926.
[Errno 2] No such file or directory: '/scratch/iqtree6xdi1c5c/tmp_output.treefile'
--- Tree inference ---
2022-08-25 11:23:26,788 - read2tree.wrappers.treebuilders.parsers - ERROR - [Errno 2] No such file or directory: '/scratch/iqtree6xdi1c5c/tmp_output.treefile'
Traceback (most recent call last):
  File "/netscratch/dep_mercier/grp_underwood/software2/anaconda3/envs/read2tree/bin/read2tree", line 4, in <module>
    __import__('pkg_resources').run_script('read2tree==0.1.2', 'read2tree')
  File "/netscratch/dep_mercier/grp_underwood/software2/anaconda3/envs/read2tree/lib/python3.9/site-packages/pkg_resources/__init__.py", line 672, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/netscratch/dep_mercier/grp_underwood/software2/anaconda3/envs/read2tree/lib/python3.9/site-packages/pkg_resources/__init__.py", line 1479, in run_script
    exec(script_code, namespace, namespace)
  File "/netscratch/dep_mercier/grp_underwood/software2/anaconda3/envs/read2tree/lib/python3.9/site-packages/read2tree-0.1.2-py3.9.egg/EGG-INFO/scripts/read2tree", line 16, in <module>
  File "/netscratch/dep_mercier/grp_underwood/software2/anaconda3/envs/read2tree/lib/python3.9/site-packages/read2tree-0.1.2-py3.9.egg/read2tree/main.py", line 392, in main
  File "/netscratch/dep_mercier/grp_underwood/software2/anaconda3/envs/read2tree/lib/python3.9/site-packages/read2tree-0.1.2-py3.9.egg/read2tree/TreeInference.py", line 42, in __init__
  File "/netscratch/dep_mercier/grp_underwood/software2/anaconda3/envs/read2tree/lib/python3.9/site-packages/read2tree-0.1.2-py3.9.egg/read2tree/TreeInference.py", line 54, in _infer_tree
  File "/netscratch/dep_mercier/grp_underwood/software2/anaconda3/envs/read2tree/lib/python3.9/site-packages/read2tree-0.1.2-py3.9.egg/read2tree/wrappers/treebuilders/iqtree.py", line 73, in __call__
  File "/netscratch/dep_mercier/grp_underwood/software2/anaconda3/envs/read2tree/lib/python3.9/site-packages/read2tree-0.1.2-py3.9.egg/read2tree/wrappers/treebuilders/iqtree.py", line 118, in _read_result
TypeError: 'NoneType' object is not subscriptable

Once more, the file mentioned /scratch/iqtree6xdi1c5c/tmp_output.treefile was not to be found.
The full command used was:

read2tree --standalone_path marker_genes/ --output_path output/ --reference  
read2tree --standalone_path marker_genes/ --output_path output/ --threads 10 --read_type long --reads MB.fq
read2tree --standalone_path marker_genes/ --output_path output/ --threads 10 --read_type long --reads specie2.fq
read2tree --standalone_path marker_genes/ --output_path output/ --threads 10 --read_type long --reads specie3.fq
read2tree --standalone_path marker_genes/ --output_path output/ --threads 10 --read_type long --reads specie4.fq
read2tree --standalone_path marker_genes/ --output_path output/ --merge_all_mappings --tree

Where the reads of specie 2 to 4 are similarly generated and subsampled as for the previously shared MB.fq

The nohup file is added here
nohup.txt

Since the mplog file was too big, I have uploaded it to wetransfer:
https://we.tl/t-UF9I5I3dqK

Kind regards,
Willem

@alpae
Copy link
Member

alpae commented Aug 26, 2022

Hi Willem, were the file concat_merge_aa.phy and concat_merge_dna.phy created? (they should be in your output/ path)

@wrengs
Copy link
Author

wrengs commented Aug 26, 2022

Hi Adrian,

Yes both files were created.
concat_merge_aa.phy is 14.911 KB
concat_merge_dna.phy is 44.733 KB

The concat files are also created per input species.

There are also text files called "input_species"_all_cov.txt which are only 1KB in size and only contain the following:
#species,og,gene_id,coverage,std

I am happy to share any further files if this helps.

All the best,
Willem

@alpae
Copy link
Member

alpae commented Sep 2, 2022

Hi Willem,

sorry for the slow progress on this. I had an initial look at the problem, but couldn't spend too much time. At first look, it seems that the subsampling of your input species for MD.fq.gz produces too few hits to place the sequences. For my test run, the produced alignment contains no sequence trace of the MD species at all.

I'm not sure how big the original coverage of the ENA dataset is, but maybe the subsampling of a very low coverage, but huge genome could be an issue, as we simply have too few hits. We try to verify this hypothesis further, but maybe it would make sense to run read2tree with the full reads set in the meanwhile to see if it produces some hits.

Best wishes
Adrian

@wrengs
Copy link
Author

wrengs commented Sep 2, 2022

Hi Adrian

Thanks for getting back to me.
I ran the command using the full input datasets which is about ~20-30 X coverage, depending on the species.

The same error popped up:
2022-09-02 14:18:10,709 - read2tree.Aligner - INFO - merge: Appending 0 reconstructed sequences to present Alignments took 0.1346571445465088. [Errno 2] No such file or directory: '/scratch/iqtree5zxpjcez/tmp_output.treefile' --- Tree inference --- 2022-09-02 14:19:21,487 - read2tree.wrappers.treebuilders.parsers - ERROR - [Errno 2] No such file or directory: '/scratch/iqtree5zxpjcez/tmp_output.treefile' Traceback (most recent call last): File "/netscratch/dep_mercier/grp_underwood/software2/anaconda3/envs/read2tree/bin/read2tree", line 4, in <module> __import__('pkg_resources').run_script('read2tree==0.1.2', 'read2tree') File "/netscratch/dep_mercier/grp_underwood/software2/anaconda3/envs/read2tree/lib/python3.9/site-packages/pkg_resources/__init__.py", line 672, in run_script self.require(requires)[0].run_script(script_name, ns) File "/netscratch/dep_mercier/grp_underwood/software2/anaconda3/envs/read2tree/lib/python3.9/site-packages/pkg_resources/__init__.py", line 1479, in run_script exec(script_code, namespace, namespace) File "/netscratch/dep_mercier/grp_underwood/software2/anaconda3/envs/read2tree/lib/python3.9/site-packages/read2tree-0.1.2-py3.9.egg/EGG-INFO/scripts/read2tree", line 16, in <module> File "/netscratch/dep_mercier/grp_underwood/software2/anaconda3/envs/read2tree/lib/python3.9/site-packages/read2tree-0.1.2-py3.9.egg/read2tree/main.py", line 392, in main File "/netscratch/dep_mercier/grp_underwood/software2/anaconda3/envs/read2tree/lib/python3.9/site-packages/read2tree-0.1.2-py3.9.egg/read2tree/TreeInference.py", line 42, in __init__ File "/netscratch/dep_mercier/grp_underwood/software2/anaconda3/envs/read2tree/lib/python3.9/site-packages/read2tree-0.1.2-py3.9.egg/read2tree/TreeInference.py", line 54, in _infer_tree File "/netscratch/dep_mercier/grp_underwood/software2/anaconda3/envs/read2tree/lib/python3.9/site-packages/read2tree-0.1.2-py3.9.egg/read2tree/wrappers/treebuilders/iqtree.py", line 73, in __call__ File "/netscratch/dep_mercier/grp_underwood/software2/anaconda3/envs/read2tree/lib/python3.9/site-packages/read2tree-0.1.2-py3.9.egg/read2tree/wrappers/treebuilders/iqtree.py", line 118, in _read_result TypeError: 'NoneType' object is not subscriptable

Attached here is the mplog.log
mplog.log

Hope this helps in identifying the issue.

All the best and have a good weekend!
Willem

@dvdylus
Copy link
Contributor

dvdylus commented Sep 30, 2022

Dear Willem,

Sorry for the late reply. I managed to run your example successfully on our cluster after adapting the code that I just pushed. Please try it again with the new code. I basically used your marker genes and ran the following lines:

read2tree --standalone_path marker_genes --output_path output --reference
read2tree --standalone_path marker_genes --output_path output --reference --reads MD.fq.gz --read_type long

For each run read set you add you should basically see X_all_cov.tx files that should show to which OG and sequence a successful mapping took place.

Please let me know if there are still issues.

Best,
David

@wrengs
Copy link
Author

wrengs commented Oct 11, 2022

Dear David,

Thank you for getting back to me.

I re-ran the whole pipeline using the example command you mentioned above.

read2tree --standalone_path marker_genes/ --output_path output/ --reference  
read2tree --standalone_path marker_genes/ --output_path output/ --reference --reads /biodata/dep_mercier/grp_underwood/raw_sequencing_data/PacBio_project-4804/4804_B_CCS.fastq.gz --threads 10 --read_type long 
read2tree --standalone_path marker_genes/ --output_path output/ --reference --reads /biodata/dep_mercier/grp_underwood/raw_sequencing_data/PacBio_project-4804/4804_A_CCS.fastq.gz --threads 10 --read_type long 
read2tree --standalone_path marker_genes/ --output_path output/ --reference --reads /biodata/dep_mercier/grp_underwood/raw_sequencing_data/pacbio_project_5169_S.cheesmaniae_LA1039_HiFi/m64078e_210902_155849.hifi_reads.fastq.gz --threads 10 --read_type long 
read2tree --standalone_path marker_genes/ --output_path output/ --reference --reads /biodata/dep_mercier/grp_underwood/raw_sequencing_data/MPGC_project_5320/5320.A_data/m64253e_211210_204421.hifi_reads.fastq.gz --threads 10 --read_type long 
read2tree --standalone_path marker_genes/ --output_path output/ --reference --merge_all_mappings --tree

The --tree command was tried including and excluding --reference
Unfortunately, the same error (tree inference) was observed in both cases when running for multiple species.

X_all_cov.txt files are produced, however, empty as they only contain a single line #species,og,gene_id,coverage,std

Below is a copy of the log as observed within the terminal.
The previous mappings / alignments are found back and inferred.
However, when trying to merge, all of a sudden 0 reconstructed sequences are found back.

--- Re-load ogs and find their corresponding DNA seq from output folder ---
Re-loading files: 14506 OGs [00:56, 256.57 OGs/s]
--- Generating reference for mapping from folder ---
Re-loading references for mapping from folder: 100%|█| 2/2 [00:01<00:00,  1.03 s
Loading alignments : 14506 Alignment [01:30, 161.06 Alignment/s]
Adding mapped seq to alignments: 100%|█| 14506/14506 [00:00<00:00, 177669.53 ali
Adding mapped seq to OG: 100%|██████| 14506/14506 [00:00<00:00, 197721.86 OGs/s]
--- Retrieve mapped consensus sequences ---
Loading consensus read mappings : 100%|█████| 2/2 [00:01<00:00,  1.46 species/s]
--- Add inferred mapped sequence back to OGs ---
Adding mapped seq to OG: 100%|███████| 14506/14506 [00:01<00:00, 11321.50 OGs/s]
2022-10-11 09:12:54,969 - read2tree.OGSet - INFO - merge: Appending 1022 reconstructed sequences to present OG took 1.3366966247558594.
--- Add inferred mapped sequence back to alignment ---
Adding mapped seq to alignments: 100%|█| 14506/14506 [00:00<00:00, 79454.46 alig
2022-10-11 09:12:55,153 - read2tree.Aligner - INFO - merge: Appending 0 reconstructed sequences to present Alignments took 0.18392062187194824.
--- Retrieve mapped consensus sequences ---
Loading consensus read mappings : 100%|█████| 2/2 [00:00<00:00,  3.95 species/s]
--- Add inferred mapped sequence back to OGs ---
Adding mapped seq to OG: 100%|███████| 14506/14506 [00:01<00:00, 11593.92 OGs/s]
2022-10-11 09:12:59,507 - read2tree.OGSet - INFO - merge: Appending 832 reconstructed sequences to present OG took 1.2801895141601562.
--- Add inferred mapped sequence back to alignment ---
Adding mapped seq to alignments: 100%|█| 14506/14506 [00:00<00:00, 84283.27 alig
2022-10-11 09:12:59,681 - read2tree.Aligner - INFO - merge: Appending 0 reconstructed sequences to present Alignments took 0.17362451553344727.
--- Retrieve mapped consensus sequences ---
Loading consensus read mappings : 100%|█████| 2/2 [00:00<00:00,  4.66 species/s]
--- Add inferred mapped sequence back to OGs ---
Adding mapped seq to OG: 100%|███████| 14506/14506 [00:01<00:00, 12022.22 OGs/s]
2022-10-11 09:13:02,627 - read2tree.OGSet - INFO - merge: Appending 361 reconstructed sequences to present OG took 1.2268311977386475.
--- Add inferred mapped sequence back to alignment ---
Adding mapped seq to alignments: 100%|█| 14506/14506 [00:00<00:00, 87728.83 alig
2022-10-11 09:13:02,794 - read2tree.Aligner - INFO - merge: Appending 0 reconstructed sequences to present Alignments took 0.1669604778289795.
--- Retrieve mapped consensus sequences ---
Loading consensus read mappings : 100%|█████| 2/2 [00:00<00:00,  3.19 species/s]
--- Add inferred mapped sequence back to OGs ---
Adding mapped seq to OG: 100%|███████| 14506/14506 [00:01<00:00, 11719.80 OGs/s]
2022-10-11 09:13:07,597 - read2tree.OGSet - INFO - merge: Appending 843 reconstructed sequences to present OG took 1.2775540351867676.
--- Add inferred mapped sequence back to alignment ---
Adding mapped seq to alignments: 100%|█| 14506/14506 [00:00<00:00, 99287.48 alig
2022-10-11 09:13:07,745 - read2tree.Aligner - INFO - merge: Appending 0 reconstructed sequences to present Alignments took 0.14760518074035645.
--- Tree inference ---
[Errno 2] No such file or directory: '/scratch/iqtreeib3ffrle/tmp_output.treefile'
2022-10-11 09:16:13,971 - read2tree.wrappers.treebuilders.parsers - ERROR - [Errno 2] No such file or directory: '/scratch/iqtreeib3ffrle/tmp_output.treefile'
Traceback (most recent call last):
  File "/netscratch/dep_mercier/grp_underwood/software2/anaconda3/envs/read2tree/bin/read2tree", line 4, in <module>
    __import__('pkg_resources').run_script('read2tree==0.1.2', 'read2tree')
  File "/netscratch/dep_mercier/grp_underwood/software2/anaconda3/envs/read2tree/lib/python3.9/site-packages/pkg_resources/__init__.py", line 672, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/netscratch/dep_mercier/grp_underwood/software2/anaconda3/envs/read2tree/lib/python3.9/site-packages/pkg_resources/__init__.py", line 1479, in run_script
    exec(script_code, namespace, namespace)
  File "/netscratch/dep_mercier/grp_underwood/software2/anaconda3/envs/read2tree/lib/python3.9/site-packages/read2tree-0.1.2-py3.9.egg/EGG-INFO/scripts/read2tree", line 16, in <module>
  File "/netscratch/dep_mercier/grp_underwood/software2/anaconda3/envs/read2tree/lib/python3.9/site-packages/read2tree-0.1.2-py3.9.egg/read2tree/main.py", line 392, in main
  File "/netscratch/dep_mercier/grp_underwood/software2/anaconda3/envs/read2tree/lib/python3.9/site-packages/read2tree-0.1.2-py3.9.egg/read2tree/TreeInference.py", line 42, in __init__
  File "/netscratch/dep_mercier/grp_underwood/software2/anaconda3/envs/read2tree/lib/python3.9/site-packages/read2tree-0.1.2-py3.9.egg/read2tree/TreeInference.py", line 54, in _infer_tree
  File "/netscratch/dep_mercier/grp_underwood/software2/anaconda3/envs/read2tree/lib/python3.9/site-packages/read2tree-0.1.2-py3.9.egg/read2tree/wrappers/treebuilders/iqtree.py", line 73, in __call__
  File "/netscratch/dep_mercier/grp_underwood/software2/anaconda3/envs/read2tree/lib/python3.9/site-packages/read2tree-0.1.2-py3.9.egg/read2tree/wrappers/treebuilders/iqtree.py", line 118, in _read_result
TypeError: 'NoneType' object is not subscriptable

All the best,
Willem

@combosch
Copy link

combosch commented Apr 1, 2024

Hi David and Willem,
We are having exactly the same issues you've discussed here a while ago:

  1. X_all_cov.txt files are produced but are empty/contain only a single line
    #species,og,gene_id,coverage,std
  2. We have no tmp_output.treefiles.

We are also using our own reference dataset so chances are the problem might related to that... we have some promising looking output in the sample-specific fa files(!) but no alignments or trees are generated.
Please let me know if you have a solution, suggestions, ideas... ?
Thanks!!
David

@sinamajidian
Copy link
Contributor

sinamajidian commented Apr 1, 2024

Hi David @combosch
Could you please send us the mplog.log file and the output log? and please tell us how you generated the marker genes and the command line you ran.
when *_all_cov.txt are empty, it means that reads are not aligned. can you check whether samtools, ngm and ngmlr are installed correctly in the env. You can run them with -h.
(I've updated the github code and now logging is improved and read2tree prints more information, which helps to resolve the case).
It is also important not to use the same folder for re-runs (if you are using multi-species mode, run each step only once).

Ps. I just noticed that you are using own reference dataset. Could you describe it in more details? Did you run OMA standalone? read2tree requires very specific formatting of input fasta and fna files, some of which are described here. e.g. first five letter of fasta record should be unique. The nucleotide sequence should only include coding sequences.

$ head -n2 s0002.fa 
>s0002|AF092942.1.AAC96301.1.1
MGSETLSVIQVRLRNIYDNDKVALLKITCHTNRLILLTHTLAKSVIHTIKLSGIVFIHII

$ head -n2 all_cdna_out.fa 
>s0002|AF092942.1.AAC96301.1.1
ATGGGCAGTGAAACATTGAGTGTAATTCAAGTCAGATTAAGGAATATATATGATAATGAT

I guess I described many things at once. It might be better to discuss step by step and make sure each step works as intended.

@combosch
Copy link

combosch commented Apr 3, 2024

Thanks for your fast reply Sina!!

  • I'm attaching the mplog.log file and the output log.
  • I forgot to mentioned that our problem is definitely reference-related since everything completes well with default OMA files - the output just does not provide enough resolution for us...
  • We generated our reference by using the transcriptome (DNA & AA) of one species only - is that ok?
  • There appear to be soft-masked regions in the transcriptome, do we have to remove them or can read2tree deal with that?
  • Since we used only one species, this most likely doesn't appky to us but could you please specific what exactly you mean by "first five letter of fasta record should be unique"? We re-coded the first five letters of each reference fasta record with the same 5 letter code to indicate our species (NAUPO) - same error - and tried a unique code for every single fasta record (s0001, s0002...) but that seems to be interpreted as each fasta record being a separate species (run didn't conclude due to mem issues).
    Best,
    David
    mplog.log
    outlog.txt

@sinamajidian
Copy link
Contributor

You're welcome.

Right, when you have one species, uniqueness of the species code should be fine.

I assume you want use multiple species mode, several sequencing read dataset of different samples? otherwise one reference and one sample are not enough as a tree should have at least 3 leaves.
One reference is not ideal, since per each gene, there might not be enough resolution to distinguish different samples. I think read2tree will generate the tree, it might be needed to double check the alignment.

Regarding the reference (gene marker folder and the dna_ref.fa files), Read2tree uses the exact order of the codons, meaning that every 3 letters in dna should correspond to one letter in aa (and vice versa).
This would relate to your comment about "soft-masked regions in the transcriptome". Does it mean that the aa and dna sequences do not match? i.e. their length each DNA fasta record should be 3 times the aa fasta records.

And I can see these lines in the log

2024-04-02 07:24:45,915 - read2tree.OGSet - INFO - NAPOINLI31_S156_L002_R1_001: Appending 12247 reconstructed sequences to present OG took 0.007348537445068359.
2024-04-02 07:24:45,917 - read2tree.Aligner - INFO - NAPOINLI31_S156_L002_R1_001: Appending 0 reconstructed sequences to present Alignments took 0.001039266586303711.

It seems that read2tree couldn't append the "assembled" sequence to the reference gene. Read2tree does that at both levels of dna and amino-acid.

Could you check the output folder whether you have concat_*_aa.phy concat_*_dna.phy in your output folder, specifically whether each of them has the reference species NAUPO and sth like NAPOINLI31_S156_L002_R1_001.

Best,
Sina

@combosch
Copy link

combosch commented Apr 5, 2024

Hi Sina,
Let's see:

  • "I assume you want use multiple species mode, several sequencing read dataset of different samples? otherwise one reference and one sample are not enough as a tree should have at least 3 leaves."
    Yes, we have several species and several samples per species - should be good

  • "One reference is not ideal, since per each gene, there might not be enough resolution to distinguish different samples. I think read2tree will generate the tree, it might be needed to double check the alignment."
    Yes, We planned to double check and filter the alignment extensively

  • "Regarding the reference (gene marker folder and the dna_ref.fa files), Read2tree uses the exact order of the codons, meaning that every 3 letters in dna should correspond to one letter in aa (and vice versa). This would relate to your comment about "soft-masked regions in the transcriptome". Does it mean that the aa and dna sequences do not match? i.e. their length each DNA fasta record should be 3 times the aa fasta records."
    All AA sequences are exactly 1/3 of the DNA sequence (incl soft-masked regions) so that does not seem to be our problem. We have stop codons at the end of some contigs and * at the end of the corresponding amino acids - we already tried without the * but read2tree seemed to have simply cut the stop codon from the DNA but failed with the same error...

  • "It seems that read2tree couldn't append the "assembled" sequence to the reference gene. Read2tree does that at both levels of dna and amino-acid."
    Yes, agreed, that seems to be the problem...

  • "Could you check the output folder whether you have concat_aa.phy concatdna.phy in your output folder, specifically whether each of them has the reference species NAUPO and sth like NAPOINLI31_S156_L002_R1_001."
    We have concat
    aa.phy and concatdna.phy files BUT both only seem to have the reference sequence, not the sample sequence...
    We found sample sequences in the folder 04_mapping
    * in the *_OGs_consensus.fa and could pull them out manually but that seems very tedious... the other 3 files in there consist of mapping stats (2 txt files) and a bam file...

Thanks a lot for your help!!
Best,
David

@sinamajidian
Copy link
Contributor

Hi David. sorry I couldn't respond as I was travelling. Please let us know if you had some progress on this.

@combosch
Copy link

Hi Sina, no problem!
No progress unfortunately. We had ruled out several potential reasons (see my last response) and we identified and localized a few remaining issues (e.g. "read2tree couldn't append the "assembled" sequence to the reference gene" and "We have concataa.phy and concatdna.phy files BUT both only seem to contain the reference sequence, not the sample sequence" but we're not sure what else to do and were hoping for guidance from you.
Best,
D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants