-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OMA standalone vs. OMAmer HOGs #7
Comments
Hi @biomendi The big difference among OMAmer and OmaStandalone is that in the first case we map sequences to a (large) existing dataset of precomputed Hierarchical Orthologous Groups (HOGs) - usually all of OMA. The HOGs in there were initially also computed with the bottom up approach implemented in OmaStandalone. If you compute HOGs with OmaStandalone, you don't use the knowledge that all the other genomes in OMA could bring. The HOGs are computed by recursively checking the completeness of pairwise ortholog relations among the sub-HOGs along the species tree. If you think your groups are too fragmented, this could have several reasons:
Regarding the question of HOG-IDs: The sorry for the late response. I hope it is still helpful. Best wishes |
As mentioned in my previous issue (#6), I am bit confused when comparing the results produced by OMA standalone and OMAmer. In particular, despite using the same dataset, I found large differences in the number of HOGs (at the topmost taxonomic level) inferred by the both programs with default settings. That is, OMAmer grouped the proteins in 4 different HOGs, while OMA standalone divided them into 16 HOGs. In the case of OMAmer, three of the HOGs included a relatively large percentage of species [around 80-90% of the total each] and the other HOG a relatively smaller number [around 10% of the species included]. Nonetheless, the 4 HOGs detected by OMAmer included a relatively similar percentage of sequences each (49%, 20%, 16%, 15%). However, this was not the case with OMA standalone. In the latter, 7 HOGs (out of 16) consisted of only 2 sequences each, and other HOGs also seem to be over-specific in general. The reason why I believe so is because the dataset was built by joining proteins that very likely have the same functionality. Thus, I think there shouldn't be more than one or a few gene families.
I noticed that OMAmer includes a flag called "--threshold" which could help reducing the number of HOGs. However, in the case of OMA standalone, it seems there are several parameters (MinScore, LengthTol, MinSeqLen) which could make the grouping more stringent. I believe the most important one is the "MinScore" (default value is 181). Unfortunately, I have little knowledge about the algorithms behind the two programs, so it is not clear to me what could be a more stringent value in either case. How can I determine the best MinScore value for my dataset?
Moreover, I would like to know if the "top-down" algorithm in OMA standalone (combined with the use of "StableIdsForGroups") is able to determine the actual HOG id in a similar way as OMAmer does (e.g. "HOG:B0561231" instead of "HOG1")?
The text was updated successfully, but these errors were encountered: