Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Initiative]: Annotate Ersilia's models following BioModels standards #1059

Open
2 of 5 tasks
miquelduranfrigola opened this issue Mar 11, 2024 · 68 comments
Open
2 of 5 tasks
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@miquelduranfrigola
Copy link
Member

miquelduranfrigola commented Mar 11, 2024

Summary

We have partnered with BioModels at EMBL-EBI (Hinxton) to explore potential ways to incorporate Ersilia's models into well-established BioModels resource.

Of note, BioModels model annotation is based on ontologies as reported in the Ontology Lookup Service. We expect to reach similar standards thanks to the current project.

Scope

Initiative 🐋

Objective(s)

The objectives of the project are the following:

  1. Incorporate Ersilia's models into BioModels (metadata only).
  2. Adopt an ontology-based model annotation procedure for Ersilia that is harmonized with that of BioModels.
  3. Set the basis for a more ambitious incorporation of models based on ONNX format.

Team

Role & Responsibility Username(s)
DRI / Lead Developer @Zainab-ik
Project Manager @miquelduranfrigola

@Zainab-ik is currently doing an internship at EBI-EMBL in the BioModels team.

Importantly, @Zainab-ik will meet with @miquelduranfrigola twice a week to report progress and decide next steps. Previous to the meeting, @Zainab-ik will update the corresponding model issues and, after the meeting, actionables will be reflected in the issues.

Timeline

The project timeline is still up for discussion. This are some tentative milestones:

  • Incorporate metadata only of a simple model into BioModels (i.e. antimalarial activity prediction).
  • Incorporate metadata only of a a more complex model into BioModels, potentially involving multiple outputs (i.e. H3D models).
  • Define ontology-based rules to improve Ersilia's metadata, harmonizing it with BioModels standards.
  • Incorporate metadata only for a substantial number of models.
  • Incorporate at least one model in ONNX format.

Documentation

A backlog of models can be found in the Ersilia BioModels Spreadsheet. This spreadsheet should act as a centralized resource to keep track of progress.

The shared folder in Google Drive can be accessed here.

@miquelduranfrigola miquelduranfrigola added the documentation Improvements or additions to documentation label Mar 11, 2024
@miquelduranfrigola
Copy link
Member Author

Hello @Zainab-ik, as discussed, let's start by conceiving an issue template to prompt discussion about each model individually.

I suggest that we start by doing this in the current antimalarial model, then we can replicate the template to other models as we see fit. In my opinion, the template should not be too complex.

@miquelduranfrigola
Copy link
Member Author

@Zainab-ik here are some questions in preparation with our meeting with Sheriff. Feel free to add more:

  • What is the minimum and maximum number of qualifiers in a model? How many are recommended?
  • Is there a convention for naming models? Is it the year & title of publication?
  • Is there a structure or guidelines for model descriptions?
  • Many papers have extra analysis not directly related to the model. For example, dimensionality reduction with UMAP, or clustering. Do we need to include these in the metadata?
  • Do you have any experience with the chemical information ontology?

@Zainab-ik
Copy link

I'd be working on the issue template.
Note: The Ersilia BioModel spreadsheet seems to be empty.

@Zainab-ik
Copy link

Zainab-ik commented Mar 12, 2024

@Zainab-ik here are some questions in preparation with our meeting with Sheriff. Feel free to add more:

  • What is the minimum and maximum number of qualifiers in a model? How many are recommended?
  • Is there a convention for naming models? Is it the year & title of publication?
  • Is there a structure or guidelines for model descriptions?
  • Many papers have extra analysis not directly related to the model. For example, dimensionality reduction with UMAP, or clustering. Do we need to include these in the metadata?
  • Do you have any experience with the chemical information ontology?
  • For the Machine learning standard ontology, what standard should we stick to : OBCS / MCRO / STATO
  • In situation where there are no related ontology terms to a metadata, what's the way forward?
  • In metadata term differences, what standard should we stick to? Ontology term or Research paper term.
  • BioModel identification numbers; How to assign? This might not be so important but we can ask.
  • Ontology request for new terms. For example, ZairaChem e.t.c.

@miquelduranfrigola
Copy link
Member Author

I'd be working on the issue template. Note: The Ersilia BioModel spreadsheet seems to be empty.

Yes it is empty for now. Please add the two models that we are currently working on and then we will add more.

@Zainab-ik
Copy link

Zainab-ik commented Mar 13, 2024

Update
After meeting with Sheriff;

  • Conventions for naming model: First author, Year, a one-line description.
    For example; Swanson2023 - ADMET Properties Predictions. Reference - eos7d58
  • It's a free text standard for model descriptions. Ersilia model description is best used.
  • Extra Analysis not related to the model functions should not be included in annotation. It's a high-level annotation with focus on discoverability.
  • Chemical information Ontology is quite interesting, suits the models more. Provides qualitative attributes to chemical entities - Cheminf
  • A much better standard for ML ontologies would be STATO - It encompasses quite a number of ML terms - still exploring other options
  • Other important ontology to consider: Bioassay Ontology, Software Ontology
  • A new metadata term could be added to the ontology search; in non-resolvable situation (if a metadata doesn't exist in ontology search)
  • Ontology search term should be the standard in case of metadata clash
  • A resolver - identifiers.org is used to standardize references e.g pubmed url

@miquelduranfrigola Am I missing anything?

@miquelduranfrigola
Copy link
Member Author

Thanks @Zainab-ik - this is very useful. I don't think anything is missing.
Perhaps just mention that BAO is also an important ontology to consider.

@Zainab-ik
Copy link

Update!!!

  • Regarding Citation.
    Sheriff mentioned there's an option to indicate Modeller while uploading the annotation files. The modeller incorporates the model into the Ersilia Model Hub. He mentioned he'd have a discussion with @GemmaTuron regarding this.

  • Mode Annotation
    I've completed the first 2 annotation and I've made comparison with the initial annotation. I think ours is more detailed.
    I included more model properties, and used ontologies closer to the Chemistry term.
    However, there's a couple of things to be done before finalizing.
    Some ontologies aren't registered with the resolver which i'm making requests for at the moment. They'd be updated after it's published in the resolver registry. We are making use of the resolver for safe referencing and to standardize the URL.
    Although, not yet finalized, I've added the 2 models; eos80ch and eos7kp for review.

@Zainab-ik
Copy link

GitHub Issue Template

While discussing with @miquelduranfrigola, He suggested I create an issue template, open it for each models i'm annotating, link them to this main issue to keep track of the work, and finally close them after the model is uploaded to the BioModel repository.

Using the Ersilia issue template as sample, I came up with a draft and I'd like a review before incorporating into each model repository.
BioModel Incorporation Issue

I'd like to ask about the issue usage considering we'd have to open in each model repository and not the general repository?

@GemmaTuron
Copy link
Member

Hi @Zainab-ik

After our meeting today, please:

  • go ahead an open the issues in the two models we are working on following your proposed template. We will try it out and once we are happy with it, we will upload it to all repos as a template
  • Add the publications of the models in the folder
  • Finish the model annotations for both and add any questions / comments you might have on the issues, so we can initiate a discussion

From my side, I'll prioritize some further models for annotation. And we have decided that, once we have completed the annotation of at least 10 models, we will start thinking about:

  • validation of the models
  • automatically storing biomodel annotations in Ersilia

@Zainab-ik
Copy link

Zainab-ik commented Mar 20, 2024

Hi @Zainab-ik

After our meeting today, please:

  • go ahead an open the issues in the two models we are working on following your proposed template. We will try it out and once we are happy with it, we will upload it to all repos as a template
  • Add the publications of the models in the folder
  • Finish the model annotations for both and add any questions / comments you might have on the issues, so we can initiate a discussion

From my side, I'll prioritize some further models for annotation. And we have decided that, once we have completed the annotation of at least 10 models, we will start thinking about:

  • validation of the models
  • automatically storing biomodel annotations in Ersilia

Following the meeting.

  • Step 1: I've created the issues in each model repository linked here. eos80ch and eos7kbp.
  • Step 2: Both models publication uploaded in the folder and linked here. eos80ch and eos7kpb.

I'd work on completing the annotation, I've sorted the compact identifiers with the EBI team. I'd also try uploading one model to the BioModels with Sheriff to give a sample of what the issue template information would look like.

@GemmaTuron
Copy link
Member

Hi @Zainab-ik

Thanks! This is looking good, as I stated in the model issues I suggest we have two issues, one for discussion and one we will only open once we know which data from BioModels we want to store in Ersilia as well.
If you agree, then let's go ahead and use the open issues to create those "discussion" issues around models eos80ch and eos7kbp so we can fully annotate these two and then proceed onto the next ones.
I'd say the second issue, to collect data from BioModels for storing in Ersilia, can be built once we have at least 10 models annotated and know better the kind of information we want to collect

@Zainab-ik
Copy link

Hi @GemmaTuron

I've created the discussion issue around eos80ch and eos7kpb.

I've completed the annotation of eos80ch and I'd like your review before uploading.
Annotation of eos7kpb should be completed before tomorrow.
I'd make changes to the uploaded file since it's not google sheet.

@GemmaTuron
Copy link
Member

Thanks @Zainab-ik !
I have a few suggestions on the discussion template, let me know your thoughts

@Zainab-ik
Copy link

Zainab-ik commented Mar 22, 2024

Hi @GemmaTuron

I've worked around the suggestions.
Completed the annotation for the 2 models, updated the link, and added metadata information for eos7kbp. I'm clear on the eos80ch model, and it's been uploaded. I'd share when it's available to the public, that'd be by tomorrow.

Do I go ahead and start working on the priority models in the sheet?

Also, there's an option of opening an account on BioModels to review submissions.
BioModels facilitates some ways to offer collaboration or review or access of models.

  1. Invite your team/colleagues/contributors to open an account on BioModels and then you add them as model contributors. You can also grant write or read permission to these contributors.
  2. Regarding review account, you can also request and open a reviewer account. Using this option is when the scientific manuscript is in the middle of the review process and reviewers ask you to allow them to look into your model. The reviewer account comes in handy in this case. Off course, this type of account only gives read-access permission.

I think 1 applies to us. I could share my submission for review.
Either @GemmaTuron or @miquelduranfrigola or both can have an account, what do you think?

@miquelduranfrigola
Copy link
Member Author

@GemmaTuron feel free to take the lead here 👍
Thanks @Zainab-ik for a very clear update.

@GemmaTuron
Copy link
Member

Hi @Zainab-ik !

Thanks, good start! Feedback from today's meeting:

  • Let's consolidate both models, eos80ch and eos7kbp - some fields are only present in one but apply to both (like, Blood, Machine Learning...) We will annotate these two models with the maximum depth possible, make the changes we have discussed in the meeting
  • Get feedback from BioModels Team on which fields are redundant and we should not add them (like Machine Learning)
  • Improve DOME annotation with more granularity
  • Start working on the three other models in the list

If you are done with all the tasks before our next meeting, I suggest you have a look at the model incorporation that is still midway, but this is less prioritary

@Zainab-ik
Copy link

Zainab-ik commented Mar 25, 2024

Feedback from BioModels (Sheriff) !!

  • For Proprietary data, URL should be added if it's available. If not, it should be included in the metadata for transparency. For eos7kbp, I added it and annotated it with a suitable ontology since there's no URL available.
  • General output can be added comprehensiveness, however, specific output is preferred. I retained the general output, let me know if i should do otherwise.
  • Broader terms like machine learning and Artificial intelligence should be added to enhance findability. BioModels will undergo an upgrade and broader terms might not be essential in future but for now, it's important.
  • Infectious disease is a central theme across all Ersilia models, so it is important to annotate the models with infectious disease and it should be a standard.
  • Since active/inactive can be central to training data, it should be included.
  • For terms with synonyms; compounds/molecule. One is enough. I picked compound since it's a more suitable term.

I've incorporated all feedbacks into the two models. I believe both models are fully annotated.

Based on the feedback

The following are/would be standard metadata in all models;

  • Infectious diseases
  • Compounds
  • Small molecules
  • Active & Inactive
  • Machine learning & Artificial Intelligence
  • drug discovery
  • Smiles (Input)
  • Ersilia implementation
  • PubMed ID

@Zainab-ik
Copy link

Zainab-ik commented Mar 25, 2024

Update!!!

DOME annotation completed and both models are up on BioModels.
eos7kpb - https://www.ebi.ac.uk/biomodels/MODEL2403270001
eos80ch - https://www.ebi.ac.uk/biomodels/MODEL2403270002

This has been linked in the respective repository.

@Zainab-ik
Copy link

Zainab-ik commented Mar 28, 2024

eos46ev !!!

  • I opened an issue here and listed a few comments from both papers and Ersilia implementation, also listed below;
  1. For this model, 4 ML algorithm was used in building the models. I added all to the metadata considering that the final model (deployed to the web server) is a combination of all.
  2. Although, stated that XGBoost is the best, the final model is a fusion of 4 algorithm; Random forest, Deep Neural Network, Support Vector Machine, XGBoost.
  3. Looking at the repository, I realised Ersilia only implemented XGBoost model, does that nullify the rest of the algorithm as an unimportant metadata?

A more detailed comments/question is in the issue here
The curation/annotation completed and can be accessed here

@Zainab-ik
Copy link

Zainab-ik commented Mar 28, 2024

eos4e40 !!!

  • I opened an issue here, and added a comment below;
  1. Halicin was discovered with the DNN model, as an important part of the paper, would it be an important metadata?And which category (property or output)?
  2. Halicin has bactericidal activity against Mycobacterium tuberculosis and carbapenem-resistant Enterobacteriaceae, do they classify as biological properties of the model.

A quick question

I realized the use of term active, inactive, hit, non-hit, when describing data binarization is dependent on a paper. How do we pick a standard then? They are all mapped with ontology terms except non-hit

The curation/annotation can be accessed here

@Zainab-ik
Copy link

Zainab-ik commented Mar 28, 2024

eos5xng !!!

  • I opened an issue here, and added a comment below;
  1. ESKAPE pathogen inhibition is the experimental validation of the AI model, if i'm right? If yes, then those pathogens do not classify as a taxonomy in the metadata.
  2. For the model training and prediction, both classification and regression tasks were performed. Ersilia model only performed classification and that should be the only one included in the metadata, right?
  3. Both RMSE and MAE scores are evaluation metrics for regression tasks, if 2 is yes, then both methods would apply.

The curation/annotation completed and linked here

@Zainab-ik
Copy link

An open-ended Question

"How much of the model properties i.e. core model properties (e.g., packages, libraries, open source software) should be curated and annotated?"
Examples below;

  • XGBoost python package
  • Keras deep learning python package
  • TensorFlow
  • AtomPairs fingerprints e.t.c.,

@GemmaTuron
Copy link
Member

Hi @Zainab-ik,

Good job, thanks for the updates, please find below some comments:

  • I do not understand this sentence: For Proprietary data, URL should be added if it's available. If not, it should be included in the metadata for transparency. For eos7kbp, I added it and annotated it with a suitable ontology since there's no URL available. As it is proprietary data, it will never have an available URL as the data is not shared. What do you mean you have added it?
  • Regarding the updated models, please do not update them on BioModels until I have revised them and given the final OK. Remember to use this excel to track progress, if the model is still "To review" means it has not yet been approved - this way we can be sure all the information in biomodels is 100% correct
  • Some of the links in the BioModels website seem broken, could you check that?
  • Le'ts consolidate the tags for all models. Can you share with me what is the list of available tags?
  • Are Active / Inactive properties or Outputs?

@Zainab-ik
Copy link

Hi @Zainab-ik,

Good job, thanks for the updates, please find below some comments:

Thank you @GemmaTuron

  • I do not understand this sentence: For Proprietary data, URL should be added if it's available. If not, it should be included in the metadata for transparency. For eos7kbp, I added it and annotated it with a suitable ontology since there's no URL available. As it is proprietary data, it will never have an available URL as the data is not shared. What do you mean you have added it?

For this, I added H3D Priopetary term as a metadata and just annotated with a suitable ontology and the ontology link. I didn'r necessarily mean I added the priopetary data link. Sheriff mentioned the term should be added for transparency.

  • Regarding the updated models, please do not update them on BioModels until I have revised them and given the final OK. Remember to use this excel to track progress, if the model is still "To review" means it has not yet been approved - this way we can be sure all the information in biomodels is 100% correct

Noted @GemmaTuron, That was uploaded as a sample to have an insight into how the overview would look and if there's any comment or any changes the Ersilia team would like. I'd appreciate a feedback on that. The upload can always be updated.

  • Some of the links in the BioModels website seem broken, could you check that?

I'd inform the BioModels team. Could you please specify which so I can exactly mention.

  • Le'ts consolidate the tags for all models. Can you share with me what is the list of available tags?
Screenshot 2024-04-03 at 10 04 32

These are the lists of tags available. A new one can be proposed if that'd be more suitable for Ersilia models.

  • Are Active / Inactive properties or Outputs?

They are properties. More like data properties very relevant to the model.

@Zainab-ik
Copy link

Zainab-ik commented Apr 23, 2024

  • I created a new tag in BioModels called Ersilia and that'd be attached to all models.

Antimicrobial models annotation

  1. eos24jm - issue

  2. eos5cl7 - issue

  3. eos18ie - issue

Questions

  • Can all the drug discovery models be referred to as a QSAR model?
  • If an animal model is used to perform experimental validation of the model, should that be added as a biological properties of the mode i.e.,taxonomy

@Zainab-ik
Copy link

SARS-COV2 model annotation

  1. eos8fth - issue
  2. eos4cxk - issue
  3. eos9f6t - issue

Regarding eos9f6t - The publication here is the same as eose40 but this is SARS-COV2 Inhibition. the paper is discussing antibiotics but SARS-COV2 should be antiviral. Can you clarify please.

@GemmaTuron
Copy link
Member

GemmaTuron commented Apr 26, 2024

Hi @Zainab-ik !

Good job thanks for keeping it up! I have answered your questions in the respective models and below the general ones:

  • I created a new tag in BioModels called Ersilia and that'd be attached to all models. - Fantastic!
    Questions

  • Can all the drug discovery models be referred to as a QSAR model? Mmm at the moment, most of the models we have are QSAR yes, but that might not be true in the future. @miquelduranfrigola what do you say here?

  • If an animal model is used to perform experimental validation of the model, should that be added as a biological properties of the mode i.e.,taxonomy I don't think so, this is related to the validation but not how the dataset for the model was built.

  • The publication here is the same as eose40 but this is SARS-COV2 Inhibition. the paper is discussing antibiotics but SARS-COV2 should be antiviral. Can you clarify please. - The antiviral model does not have a publication per se, but they developed it in parallel with the antibiotic predictor, using the ChemProp. Since the antibiotic prediction paper is the one which describes the original ChemProp development, is the most appropriate citation

@Zainab-ik
Copy link

Hi @Zainab-ik !

Good job thanks for keeping it up! I have answered your questions in the respective models and below the general ones:

Thank you @GemmaTuron

  • I created a new tag in BioModels called Ersilia and that'd be attached to all models. - Fantastic!
    Questions
  • Can all the drug discovery models be referred to as a QSAR model? Mmm at the moment, most of the models we have are QSAR yes, but that might not be true in the future. @miquelduranfrigola what do you say here?

That's great. That'd mean a QSAR metadata should be constant one, right. Just a thought;can a generative model classify as QSAR too?

  • If an animal model is used to perform experimental validation of the model, should that be added as a biological properties of the model i.e.,taxonomy I don't think so, this is related to the validation but not how the dataset for the model was built.

Okay, that's clarified. What if an experimental method (in-vivo precisely) is used to generate the dataset then, should experimental method and the in-vivo model be added as a metadata then?

  • The publication here is the same as eose40 but this is SARS-COV2 Inhibition. the paper is discussing antibiotics but SARS-COV2 should be antiviral. Can you clarify please. - The antiviral model does not have a publication per se, but they developed it in parallel with the antibiotic predictor, using the ChemProp. Since the antibiotic prediction paper is the one which describes the original ChemProp development, is the most appropriate citation

The metadata would be the same except for the organism and output and adding an antiviral metadata to it.

@Zainab-ik
Copy link

SARS-COV2 model annotation

  1. eos8fth - issue
  2. eos4cxk - issue
  3. eos9f6t - issue

Regarding eos9f6t - The publication here is the same as eose40 but this is SARS-COV2 Inhibition. the paper is discussing antibiotics but SARS-COV2 should be antiviral. Can you clarify please.

@GemmaTuron All models ready for review.

@Zainab-ik
Copy link

Zainab-ik commented May 7, 2024

Hi @GemmaTuron

A few clarifications from the meeting;

  • Experimental method emerges from both data generation and model validation. How to represent in the annotation and curation should be
  1. If it's data generation - A dome annotation identifying data source and using metadata like in-vivo or in-vitro. e,g.,
    in-vivo model - data source
    in-vitro model - data source
  2. if it's model validation - A dome annotation identifying evaluation e.g.,
    in-vivo model - model validation
    Does this best describe the experimentation part of the model?
  • Organism without taxonomy; properties, right?
  • Model validation data source aren't essential part of the model and shouldn't be a metadata.
  • All models are QSAR at this moment and should be a constant metadata.
  • Removal of not-so important metadata e.g., hits
  • Evaluation metrics not used shouldn't be added
  • Hackathon schedule.

@GemmaTuron
Copy link
Member

Hi @Zainab-ik !
I have reviewed the models, please amend them and then upload to BioModels. A few general comments from our meeting:

  • There are general fields that do not add information. Please revise all the models and let's agree on which fields we do not want to add information (Hit, Compound Identification...) Please list them here so we know we won't be using them
  • Like wise there are general fields that should be everywhere like QSAR
  • The in vitro model and in vivo model should only refer to the model validation in the laboratory, please make sure to annotate the models accordingly and use the DOME to specify
  • The libraries used for model validation should not be listed as data sources

After redoing the current models to review, let's get back to the old ones before we move onto the new ones. Feel free to reopen the issues and note the changes that should be made

@Zainab-ik
Copy link

A clarification regarding the in-vivo and in-vitro, if it's used for data generation, it's not to be added, right @GemmaTuron

@GemmaTuron
Copy link
Member

A clarification regarding the in-vivo and in-vitro, if it's used for data generation, it's not to be added, right @GemmaTuron

exactly, all data has been eventually generated experimentally, so it is not that relevant to collect this information

@Zainab-ik
Copy link

General fields that do not add information;

  • Hit
  • Molecular representation
  • chemical libraries
  • I think the MACCS key is similar to RDKit and shouldn't be added also.

@GemmaTuron
Copy link
Member

Hi @Zainab-ik

I agree with most of them but MACCS keys are a different type of descriptor. IF the model is using RDKIT descriptors we should annotate that, if it is using MACCS we should annotate it and maybe we should think if we want to annotate all the different descriptors used

@Zainab-ik
Copy link

That's right. The only challenge is MACCS and RDKIT are the only descriptors present in OLS that can be annotated.

@Zainab-ik
Copy link

Zainab-ik commented May 9, 2024

New Models

  • eos4zfy - issue
  • eos6hy3 - issue
    This publication is the same as eos4cxk, and the same rules that applies to eos4e40 and eos9f6t can apply here, right?
  • eos42ez - issue
    This publication is the same as eos18ie, and same rule applies.
  • eos31ve - issue
    This is also same publication as eos9yy1

@Zainab-ik
Copy link

Antimicrobial and COVID models uploaded to BioModels

@GemmaTuron
Copy link
Member

Hey @Zainab-ik

Before starting with new models, can you have a look at the existing ones and make sure they all comply with the latest decisions we have made? Note down here any changes that had to be made in the annotations.

thanks!

@Zainab-ik
Copy link

Hey @Zainab-ik

Before starting with new models, can you have a look at the existing ones and make sure they all comply with the latest decisions we have made? Note down here any changes that had to be made in the annotations.

thanks!

Yes, working on that.

@Zainab-ik
Copy link

Zainab-ik commented May 10, 2024

Previous Model review
Summary - Removed general metadata, and confirmed experimental validation

  • eos46ev - removed unnecessary metadata e.g., molecular representation, confirmed there's no experimental validation
  • eos5xng - edited the metadata. Removed; hit selection, chemical library, compound, validation dataset, in-silico approach (it's also a general term). Added in-vitro experimental validation
  • eos4e40 - Model was validated experimentally in-vivo and in-vitro, both metadata added, QSAR added, data source confirmed, non-specific metadata removed e.g., chemical library, molecular representation.
  • NCATS CYP Models; eos44zp, eos5jz9, eos7nno, eos3ev6 .
    Metadata removed; chemical library, hit, molecular representation.
  • NCATS Permeability Models; eos81ew, eos9tyg .
    Metadata removed; molecular representation, Permeability assay (there's already a PAMPA metadata).
  • NCATS Stability Models; eos5505, eos9yy1.
    Metadata removed; insilico model, molecular representation, chemical library, CYP metabolism (doesn't fit the context of the model), compound stability.
  • NCATS Solubility model; eos74bo.
    Metadata revised; organic molecule, hit

@Zainab-ik
Copy link

Zainab-ik commented May 12, 2024

Regarding the first 2 models; eos7kpb, eos80ch

  • eos7kpb ;
    Physicochemical Assays
    Clearance
    Solubility assay
    cytotoxicity
    Aqueous solubility
    permeability assay
    Microsomal metabolic stability
    These metadata aren't integral to the Zairachem model, I want to run by you first.

  • eos80ch ;
    Removed the following metadata; compound screening, phenotype, molecular representation, molecular representation, parasites, phenotype.

@GemmaTuron
Copy link
Member

Hi @Zainab-ik

Good on the corrections, as we discussed let's leave all the biological endpoints on eos7kpb

@Zainab-ik
Copy link

Zainab-ik commented May 14, 2024

Update:
eos4zfy ready for review.

BioModels Upload;

  1. All revised model have been re-uploaded
  2. New model upload

To-do's

@Zainab-ik
Copy link

Zainab-ik commented May 14, 2024

Automating Metadata Annotation using Zooma
This process involves mapping the right ontology to the metadata automatically to speed up annotation process
For this process, I'd be starting with these two models

Steps;

  1. Extract relevant metadata manually
  2. Copy the metadata on Zooma to Annotate
  3. Compare annotation accuracy with manual annotation.

Comments/Observation

  • Biological component mapping for organism has high accuracy
  • Biological component mapping for property is average
  • Computational component mapping for property is low.

@Zainab-ik
Copy link

Zainab-ik commented May 17, 2024

Coloring molecules model annotation

All ready for review.

More permeability model annotation

  • eos97yu - issue
  • eos2hbd - issue
    Ready for review.

@Zainab-ik
Copy link

@Zainab-ik
Copy link

Zainab-ik commented May 21, 2024

New model Annotation - In Progress

eos2lqb - issue
eos6oli - issue
eos7d58 - issue
eos8lok - issue

Note: I've been working with a lot of regression model recently which is quite exciting. One of the evaluating metrics is root-mean-square error (RMSE), which I believe is also known as RMSD while reading. On OLS, RMSE doesn't exists but RMSD does, and i've been using that in my annotation.

@GemmaTuron
Copy link
Member

GemmaTuron commented May 23, 2024

Hi @Zainab-ik !

I'm having a look at the models you are annotating, let me know when the excel files are ready - RMSE and RMSD are the same ;)

@Zainab-ik
Copy link

Hi @Zainab-ik !

I'm having a look at the models you are annotating, let me know when the excel files are ready - RMSE and RMSD are the same ;)

Alright, Thanks @GemmaTuron

@Zainab-ik
Copy link

New model Annotation - In Progress

eos2lqb - issue eos6oli - issue eos7d58 - issue eos8lok - issue

Note: I've been working with a lot of regression model recently which is quite exciting. One of the evaluating metrics is root-mean-square error (RMSE), which I believe is also known as RMSD while reading. On OLS, RMSE doesn't exists but RMSD does, and i've been using that in my annotation.

Hi @GemmaTuron
All models ready for review except eos7d58. It has a broad output and I'd like to comfirm if all the output are incorporated into the Ersilia version.

@Zainab-ik
Copy link

Zainab-ik commented May 29, 2024

Grover Models

  • is Grover a framework/code base like Chemprop that's fine-tuned and trained on different datasets for different outputs?
  • What's a labelled and unlabelled molecular data?
  • What's the difference between pre-training and training an ML/DL model?
  • There's no clear mention of how the models were evaluated except for comparism with other models based on the mean and standard deviation. There's also a mention of % relative improvement - can that be classified as accuracy?. Are these regarded as the model evaluation metrics. (In the author-feedback section, AUC-ROC was mentioned as the metric for comparism) - This is the metric for Grover
  • How's the fine-tuning task evaluated? Let's say, Grover was trained on predicting Water solubility as is the case for grover-esol - eos8451, how's the model performance evaluated to be good or not? - In the supplementary file, ROC-AUC is the metric for the classification tasks while RMSE is the metric for Physical chemistry regression tasks while MAE is the metric for Quantum mechanics regression tasks. (it feels like i'm answering myself 🙂).
  • Can you kindly clarify validation loss and training loss.
    Thanks.

General comments about the Grover model

  • The metadata is determined by what task the Grover model is fine-tuned on.
  • Grover was leveraged for Molecular property prediction task and task -specific fine tuning. We'd be annotating for the task-specific fine-tuning taking note of the specific dataset, the type of task (classification/regression), and predictions.
  • In the context of data-splitting for fine-tuning, active and inactive suits...

@Zainab-ik
Copy link

Zainab-ik commented May 30, 2024

eos7w6n - This is the base model (GROVER) that was fine-tuned for task-specific dataset.

Grover Models - Annotation in Progress (Metadata extraction and curation done)

@Zainab-ik
Copy link

eos7w6n - This is the base model (GROVER) that was fine-tuned for task-specific dataset.

Grover Models - Annotation in Progress (Metadata extraction and curation done)

All models ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
Status: In Progress
Development

No branches or pull requests

3 participants