OMA standalone vs. OMAmer HOGs #7

biomendi · 2022-01-25T17:06:25Z

As mentioned in my previous issue (#6), I am bit confused when comparing the results produced by OMA standalone and OMAmer. In particular, despite using the same dataset, I found large differences in the number of HOGs (at the topmost taxonomic level) inferred by the both programs with default settings. That is, OMAmer grouped the proteins in 4 different HOGs, while OMA standalone divided them into 16 HOGs. In the case of OMAmer, three of the HOGs included a relatively large percentage of species [around 80-90% of the total each] and the other HOG a relatively smaller number [around 10% of the species included]. Nonetheless, the 4 HOGs detected by OMAmer included a relatively similar percentage of sequences each (49%, 20%, 16%, 15%). However, this was not the case with OMA standalone. In the latter, 7 HOGs (out of 16) consisted of only 2 sequences each, and other HOGs also seem to be over-specific in general. The reason why I believe so is because the dataset was built by joining proteins that very likely have the same functionality. Thus, I think there shouldn't be more than one or a few gene families.

I noticed that OMAmer includes a flag called "--threshold" which could help reducing the number of HOGs. However, in the case of OMA standalone, it seems there are several parameters (MinScore, LengthTol, MinSeqLen) which could make the grouping more stringent. I believe the most important one is the "MinScore" (default value is 181). Unfortunately, I have little knowledge about the algorithms behind the two programs, so it is not clear to me what could be a more stringent value in either case. How can I determine the best MinScore value for my dataset?

Moreover, I would like to know if the "top-down" algorithm in OMA standalone (combined with the use of "StableIdsForGroups") is able to determine the actual HOG id in a similar way as OMAmer does (e.g. "HOG:B0561231" instead of "HOG1")?

alpae · 2023-02-09T09:51:57Z

Hi @biomendi

The big difference among OMAmer and OmaStandalone is that in the first case we map sequences to a (large) existing dataset of precomputed Hierarchical Orthologous Groups (HOGs) - usually all of OMA. The HOGs in there were initially also computed with the bottom up approach implemented in OmaStandalone.

If you compute HOGs with OmaStandalone, you don't use the knowledge that all the other genomes in OMA could bring. The HOGs are computed by recursively checking the completeness of pairwise ortholog relations among the sub-HOGs along the species tree. If you think your groups are too fragmented, this could have several reasons:

the gene models in the species are fragmented -> this leads to splits as the LengthTol parameter breaks putative orthologs because they vary a lot in their length
the sequences you are analyzing are relatively short -> the MinScore of 181 corresponds roughly to a bit score of 50 in a blast query, so it is quite stringent.
you can also play around with the MinEdgeCompletenessFraction, to allow more aggressive merging of clusters (lower values)

Regarding the question of HOG-IDs: The StableIdsForGroups won't produce IDs similar to OMAmer. Those IDs are actually OMA browser release dependent (we try to provide a forward mapping of those IDs). The StableIdForGroups option rather produces an AA-fingerprint that is only found in this HOG. We haven't implemented this for the bottom_up version.

sorry for the late response. I hope it is still helpful.

Best wishes
Adrian

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OMA standalone vs. OMAmer HOGs #7

OMA standalone vs. OMAmer HOGs #7

biomendi commented Jan 25, 2022 •

edited

Loading

alpae commented Feb 9, 2023

OMA standalone vs. OMAmer HOGs #7

OMA standalone vs. OMAmer HOGs #7

Comments

biomendi commented Jan 25, 2022 • edited Loading

alpae commented Feb 9, 2023

biomendi commented Jan 25, 2022 •

edited

Loading