Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✍️ Contribution period: Adhithyan vp #1025

Closed
15 tasks done
Adhivp opened this issue Mar 7, 2024 · 32 comments
Closed
15 tasks done

✍️ Contribution period: Adhithyan vp #1025

Adhivp opened this issue Mar 7, 2024 · 32 comments

Comments

@Adhivp
Copy link

Adhivp commented Mar 7, 2024

Week 1 - Get to know the community

  • Join the communication channels
  • Open a GitHub issue (this one!)
  • Install the Ersilia Model Hub and test the simplest model
  • Install Docker if needed, and test another model
  • Write a motivation statement to work at Ersilia
  • Submit your first contribution to the Outreachy site

Week 2 - Get Familiar with Machine Learning for Chemistry

  • Select a model from the list suggested in GitBook
  • Download and serve the model via the Ersilia Model Hub to ensure it works
  • Open a repository on your GitHub user with all the necessary files
  • Select and clean a dataset of 1000 molecules (example notebook 1)
  • Run predictions for the molecules on the selected model and evaluate the results

Week 3 - Validate a Model in the Wild

  • Find a suitable dataset with sufficient experimental results
  • Clean and standardize the dataset
  • Run predictions and calculate metrics.

Week 4 - Prepare your final application

  • Submit the final application in the Outreachy website
@Adhivp
Copy link
Author

Adhivp commented Mar 9, 2024

image
image
Successfully Fetched the first simple model

@Adhivp
Copy link
Author

Adhivp commented Mar 9, 2024

Docker is successfully installed and docker pull also succesfully worked , sucessfully served the model after the docker pull , eos30gr .

I use Mac M1 which is arm based , some models are not supported here sad to here that.

Ran 2nd model - eos30gr successfully and here is the result
image
image

@Adhivp
Copy link
Author

Adhivp commented Mar 9, 2024

Motivation letter

Hi , my name is Adhithyan vp , I am a data science student from kerala,India. The motivation that helped me choose data science , will be kind of same for joining this program.

It was during my high school where i found my love/passion towards computers and tech, and when i started learning python my interest in tech grew huge. After that my old laptop became so slow that i couldn't use it, so with a suggestion from my friend I change my os from windows to Linux(Kubuntu)(recently bought my Mac M1 air). That's when I was first able to see this amazing world of open-source. I was really amazed by seeing people contributing to world-class software ,for free and maintaining this community . That's when i decided I will choose IT field as my career.

Then it came the most difficult part , choosing a field inside Tech, there were many options infront of me Cybersecurity, app developnment, web developnment, Data science/AI etc.. What i did was I started trying bit by bit of every technology , I started taking beginner hacking courses, I went to some Web3 hackathons and all . While i was trying each technologies , that's when i stumbled upon Dalle from OpenAI, chatgpt was not famous during that time it was just in it's early stage. The ability of Dalle to draw anything from scratch with just plain text , just amazed me . I was really amazed and decided to choose Data Science/AI/DL/ML as my career path.

Then I choose data science as a degree option for my college , then I went to college and start following my dreams. I started participating in many events, hackathons and detail of this can be found in my linkdein - https://www.linkedin.com/in/adhithyanvp/. I worked in some open source projects and it was all software python based. After that i really wanted to work on open-source and something ML based , Both ML and open-source these 2 criteria perfectly aligned with ersilia organisation. It also had clear documentation and guidelines on what to do and how to do. Also i found slack communtiy to be very friendly. that's why I choose ersilia.

To be honest i don't like or want to study chemistry , or be perfect in it. But my love for ML/ tech is so huge that i am willing to do the work. Ersilia model hub really inspires me as it has lot of models in it , and my mind wants to test all the models in it , I know it is not possible because of the time constraint. I really want to work on ersilia even after this outreachy contribution period. Please try to make it possible @DhanshreeA .

I hope i can do as much contributions for ersilia as possible. Looking forward for completing all the tasks.
Thank you having the patience in reading my motiviation letter. Have a nice Day

@Adhivp
Copy link
Author

Adhivp commented Mar 16, 2024

image
Got the output successfully

@Adhivp
Copy link
Author

Adhivp commented Mar 16, 2024

Succesfully completed task_1 of model bais -
https://github.com/Adhivp/Ersilia_Tasks here is the link

@Adhivp
Copy link
Author

Adhivp commented Mar 17, 2024

image
Output for reproducibility task

@Adhivp
Copy link
Author

Adhivp commented Mar 17, 2024

Completed the reproducibility tasks - https://github.com/Adhivp/Ersilia_Tasks @DhanshreeA
Took table S7 from the dataset of original paper https://doi.org/10.1021/acs.jcim.8b00769

  • Was unable to reproduce the value of probability in the paper
  • Was able to reproduce 22 molecules as hREG blockers ,while the paper identified 49 molecules as hREG blocker
  • Check the notebook for deatiled analysis

@Adhivp
Copy link
Author

Adhivp commented Mar 17, 2024

@DhanshreeA Please give me your valuable feedback , so that I can improve if anything is wrong and also suggest me suggestions to find new dataset , so that i can move to next Week
Thank you @DhanshreeA for your valuable time

@GemmaTuron
Copy link
Member

Thanks @Adhivp
We will provide feedback today and you can then proceed :)

@DhanshreeA
Copy link
Member

Completed the reproducibility tasks - https://github.com/Adhivp/Ersilia_Tasks @DhanshreeA Took table S7 from the dataset of original paper https://doi.org/10.1021/acs.jcim.8b00769

* Was unable to reproduce the value of probability in the paper

* Was able to reproduce 22 molecules as hREG blockers ,while the paper identified 49 molecules as hREG blocker

* Check the notebook for deatiled analysis

Thank you for your work so far, good job! It appears that the model we have retrained may not have been trained correctly thus explaining the discrepancies in the results you have obtained vs the results in the paper.

@Adhivp
Copy link
Author

Adhivp commented Mar 22, 2024

ok thank you @DhanshreeA for considering the reproducibility problem, can I get guidance of what to do next?

@Adhivp
Copy link
Author

Adhivp commented Mar 24, 2024

I really wanted to do the 3rd task from the task list and even had the time to do so , because I respect @GemmaTuron words in Slack Channel , who said not to do , that's why I didn't start the task .
As my both tasks were already finished without any additional changes needed, I decided to do one more dataset for the second task Table S6 , and also improve the tasks as much as I can.

Took table S6 from the dataset of original paper https://doi.org/10.1021/acs.jcim.8b00769

@Adhivp
Copy link
Author

Adhivp commented Mar 24, 2024

Then the model model eos30gr , started showing issues , it started giving me null outcomes , tried everything standardising, giving simple input,tried with other models and everything was working fine for other models.

@Adhivp
Copy link
Author

Adhivp commented Mar 24, 2024

I then searched the whole slack channel for issues and also github issues, finally in a thread @GemmaTuron told use fetch with --from_github tag, I even tried that still no result.

@Adhivp
Copy link
Author

Adhivp commented Mar 24, 2024

Instead of giving up , I used google collab then ran the model there , it took me whole 4 hours to get the output (because of a bug in code wasted another 4 hour). So total after 8 hours I got the output (don't worry I just set it on before sleep) and here are the results.
Screenshot 2024-03-24 at 8 00 58 AM

Screenshot 2024-03-24 at 8 01 08 AM
Screenshot 2024-03-24 at 8 01 32 AM

@Adhivp
Copy link
Author

Adhivp commented Mar 24, 2024

Then I followed done the analysis as usual and here are the conclusions.

  • Values predicted doesn't match with values in the research paper
  • Values are entirely different from the paper the graph can be seen above
  • Considering a treshold greater than 0.5, 410 molecules have shown as a blocker and 1318 as non-blocker
  • In the original research paper Out of 1,728 is considered 526 postive and rest 1202 is negative
  • 324 molecules match as blocker in both datasets
  • So probability values were not being able to reproduce
  • 410 molecules are considered as blocker (324 is the real number as it gave many false positive)
  • More deatils with charts can be seen in this notebook (https://github.com/Adhivp/Ersilia_Contributions/blob/main/notebooks/eos30gr%20(main)/01_model_reproducibility(Table%20S6).ipynb)

@GemmaTuron
Copy link
Member

Hi @Adhivp

Thanks for your conclusions, which are right as there is a slight mismatch between the results in the paper and the model used in the ersilia implementation that we are currently fixing.

As we are in the last week of the contribution period, please go ahead and start preparing your final application since mentors will only be reviewing those this week.

@Adhivp
Copy link
Author

Adhivp commented Mar 25, 2024

Thanks @GemmaTuron

@Adhivp
Copy link
Author

Adhivp commented Mar 25, 2024

As I was told not to do task3 and I had enough time , so I built and deployed a streamlit app highlighting my whole works for contributions. It provides unique features such as fully interactive graphs (which is not possible in jupyter notebook),easly navigate able interface etc... A full summary of what I have done , background research of the model and hERG gene. I took me some time to build this app, and had many issues while deploying the same , anyways after those hardships my hardwork is paid off , as I got a fully working app.

@Adhivp
Copy link
Author

Adhivp commented Mar 25, 2024

I tried my best to make the app visually appealing and also easy to get graphs for mentors or anybody using my app. Minor issues I faced during the app building can be understood from the commit messages of issue fixed in my original repo.

@Adhivp
Copy link
Author

Adhivp commented Mar 25, 2024

This is the link to my app - https://ersilia-contributions.onrender.com
(It is hosted on a free service render that's why it rarely may show some lag)

This is the link to the subfolder of my repo with app files - https://github.com/Adhivp/Ersilia_Contributions/tree/main/streamlit_app

@Adhivp
Copy link
Author

Adhivp commented Mar 25, 2024

@DhanshreeA and @GemmaTuron Please review my final work before submitting final application. Please give me your valuable feedback , so that I can improve if anything is wrong, also your words are inspirations for me , which help me to do work on new innovative ideas like this.

@Adhivp
Copy link
Author

Adhivp commented Mar 25, 2024

The graphs are fully interactive , please feel free to play around with the graphs and also give me any suggestions to do in my app.

@DhanshreeA
Copy link
Member

Hi @Adhivp what can I say, the app looks fun, I hope it was equally fun to build it. I am going to reiterate Gemma's words, please start working on your final application. You will not be penalized for not finishing task 3 due to delayed feedback.

@Adhivp
Copy link
Author

Adhivp commented Mar 30, 2024

Submitted the Final Application

Thank you for the review done by @DhanshreeA before submitting the application

@Adhivp
Copy link
Author

Adhivp commented Mar 31, 2024

Task 3

External dataset

  • https://www.nature.com/articles/s41598-019-47536-3#Sec18
  • filename - 41598_2019_47536_MOESM2_ESM.xlsx
  • Has 87,367 molecules , will use random 500 positive and 1000 negative for testing (total 1500)
  • After removing the common(to avoid lekage), the 1500 molecules becomes 1287 molecules , in which 360 are positive and 927 are negative
  • First 2967 are positive and rest all are negative in the large data set

@Adhivp
Copy link
Author

Adhivp commented Mar 31, 2024

Done the model evaluatin in google collab

Screenshot 2024-03-31 at 2 52 45 AM
Screenshot 2024-03-21 at 6 53 54 PM

Took 3 hours to process 1287 molecules in google collab

Screenshot 2024-03-31 at 12 59 45 PM

@Adhivp
Copy link
Author

Adhivp commented Mar 31, 2024

Conclusion of Task3

  1. Accuracy:

    • The accuracy of the model is 74.90%, indicating that it correctly predicts the class labels for nearly three-quarters of the observations.
  2. Sensitivity (True Positive Rate):

    • The sensitivity of the model is 64.72%, indicating that it correctly identifies 64.72% of the actual positive cases.
  3. Specificity (True Negative Rate):

    • The specificity of the model is 78.86%, indicating that it correctly identifies 78.86% of the actual negative cases.
  4. Precision (Positive Predictive Value):

    • The precision of the model is 54.31%, indicating that when it predicts a positive case, it is correct 54.31% of the time.
  5. Recall (Same as Sensitivity):

    • The recall of the model is 64.72%, indicating the same as sensitivity.
  6. Negative Predictive Value:

    • The negative predictive value of the model is 85.20%, indicating that when it predicts a negative case, it is correct 85.20% of the time.
  7. Balanced Accuracy:

    • The balanced accuracy of the model is 71.79%, which is the average of sensitivity and specificity, providing a balanced view of the model's performance.
  8. Matthew's Correlation Coefficient:

    • The Matthew's correlation coefficient of the model is 0.41, indicating a moderate level of correlation between the predicted and true binary classifications.
  9. F1 Score:

    • The F1 score of the model is 59.06%, which is the harmonic mean of precision and recall, providing a balance between the two metrics.
  10. AUROC (Area Under the Receiver Operating Characteristic Curve):

    • The AUROC of the model is 71.79%, indicating the model's ability to distinguish between the positive and negative classes across various threshold values.
  11. R2 Value:

    • The R-squared value of the model is -0.25, which is negative, indicating that the model performs worse than a horizontal line (a horizontal line would have an R2 value of 0), suggesting that the model does not fit the data well in the context of regression analysis.

@Adhivp
Copy link
Author

Adhivp commented Mar 31, 2024

@Adhivp
Copy link
Author

Adhivp commented Mar 31, 2024

As per your availability, please review my last task @DhanshreeA @GemmaTuron

@Adhivp
Copy link
Author

Adhivp commented Mar 31, 2024

https://ersilia-contributions.onrender.com - Added Task 3 to my app
Please feel free to check all graphs and tables as everything is made interactive and easy to use.

@Adhivp
Copy link
Author

Adhivp commented Mar 31, 2024

I am delighted to complete all my tasks, do extra works , make a interactive app to show my results. Thank you @DhanshreeA @GemmaTuron for your support .
Also Big thanks to the community , as I could help many and get help from them.

This Journey is really memorable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants