DataSpeak, one of the industry's largest providers of predictive analytics solutions, needed a proof-of-concept machine learning model that can automatically generate answers to user-input questions.
Text2TextGeneration, Transformers, Tokenizers, PyTorch, Hugging Face, Flan-T5 LLM, spaCy, Streamlit, Render, GPU, BeautifulSoup, Google Colab
- Developed a generative language model using
google/flan-t5-base
, fine-tuned on Stack Overflow data. - Conducted cosine semantic similarity analysis on a generated vector embeddings database to identify the top 5 most similar questions in the dataset for user-input questions.
- Developed a web application featuring a chatbot UI that provides generative answers from the model and generates 5 alternative answers based on cosine similarity, along with percent similarity scores.
- Improved training set quality by pre-processing and normalizing raw text data.
- Achieved a 19% ROUGE-1 score and an average perplexity of 1.96.
- Demonstrated high efficiency, with response times under 15 seconds.
![Screenshot 2023-10-30 at 8 45 15 PM](https://private-user-images.githubusercontent.com/97048468/279800681-bc380430-48af-44fb-969f-198ba69053ba.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTkxMDE5ODMsIm5iZiI6MTcxOTEwMTY4MywicGF0aCI6Ii85NzA0ODQ2OC8yNzk4MDA2ODEtYmMzODA0MzAtNDhhZi00NGZiLTk2OWYtMTk4YmE2OTA1M2JhLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA2MjMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNjIzVDAwMTQ0M1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTgxOWUyMjFlODVhOWExMTUyZDllNGNiNGFjZDA2ZTBkNzc3NDFhOWE3NGU3NjM5NDk0NjAyMjZkYjgxN2Y3ZDEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.OI_mI0DgdRK_nrrlh76Cd9y3rvf-RIjMPnf9nXxZr9g)
Python libraries: pandas, numpy, matplotlib, seaborn, scikit-learn, nltk, transformers, spacy, torch