Skip to content

An experiment driven Search Engine Project, developed to index and retrieve best documents given a query using ensemble of models.

License

Notifications You must be signed in to change notification settings

TF4ces/TF4ces-search-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

77 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TF4ces Search Engine

An experiment driven Search Engine Project, developed to index and retrieve best documents given a query using ensemble of models.

Architecture Diagram

System Design : Search Engine

img.png

System Design : Ensemble Model

img.png

Retrieval Models

  • Filter Models
    • BM25
    • TF-IDF
  • Voter Models
    • MPNET
    • RoBERTa

Project Plan

  • Phase 1
    • Data Analysis & Pipeline
    • Model Pipeline
    • Evaluation Pipeline
  • Phase 2
    • BM25 Model + MPNet Model
    • Hyperparameter tuning
    • Ensemble Pipeline
  • Phase 3
    • RoBERTa Model
    • Ensemble enhancement
    • Experimentation

Future works

  • Finetune ColBERT
  • Implement Clustering of docs

How to run Project

Note : The project was tested on linux and MacOS. (Windows has dependency issues, refer Troubleshooting)

  1. Clone repository

    $ git clone https://github.com/TF4ces/TF4ces-search-engine.git
  2. Setup Environment repository

    $ python3 -m venv venv
    $ source venv/bin/activate                [LINUX/MAC]
    $ .\venv\Scripts\activate                 [WINDOWS]
    $ pip install -r src/requirements.txt 
  3. Download pre-loaded embeddings to this path: ./dataset/embeddings_test from GDrive

    Note: To generate embeddings from scratch run./tests/test_evaluate_model.py script setting MODEL to all-mpnet-base-v2, all-roberta-large-v1 individually twice.

    WARNING: use a GPU machine and it is expected to take 1hr to generate.

  4. Run TF4ces Search Engine [install jupyter by $pip install jupyter notebook and to run $jupyter notebook]

    1. Run Eval Pipeline from ./tests/notebooks/TF4ces_Search_Eval.ipynb ipynb notebook.
    2. Run prediction Demo Pipeline from ./tests/notebooks/TF4ces_Search_Demo.ipynb ipynb notebook.

Troubleshooting :

  1. Windows Systems are seen to have issue while reading data with ir-datasets==0.4.1

    For windows the doc.iter might throw decoding error while reading tsv file, You would need to change the encoding in source files of dependency as per this issue.

    Issue : allenai/ir_datasets#208 (comment)

About

An experiment driven Search Engine Project, developed to index and retrieve best documents given a query using ensemble of models.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •