Skip to content

This repository provdes a retrieval system for searching covid-19 relevant papers, built upon CORD19 dataset

License

Notifications You must be signed in to change notification settings

wangcongcong123/covidsearch

Repository files navigation

Covid19Search

Contributions welcome License: MIT

This repository contains source code for searching covid-19 relevant papers based on the COVID-19 Open Research Dataset (CORD-19). The repository also provides a solution to the tasks in COVID-19 Open Research Dataset Challenge on Kaggle (CORD-19). Update: 2020-04-14.

Features

  • Support multiple bag-of-words models (count, tf-idf, bm25).
  • Support semantic search models such as fasttext, glove.
  • Enable to combine the aforementioned two types of models.
  • Provide a live web application that can customize models for end-users.

Quick Start

git clone https://github.com/wangcongcong123/covidsearch.git
cd covidsearch
pip install -e .
from cord import *

# make sure put the paper collections (four .tar.gz files) and medataset csv file under the dataset_folder
dataset_folder = "dataset/"
# load metadata and full texts of papers
metadata = load_metadata_papers(dataset_folder, "metadata.csv")
full_papers = load_full_papers(dataset_folder)
# full_input_instances include title, abstract, body text
full_input_instances = [(id_, metadata[id_]["title"], metadata[id_]["abstract"], body) for id_, body in
                        full_papers.items() if id_ in metadata]
tfidf_model = FullTextModel(full_input_instances, weights=[3, 2, 1], vectorizer_type="tfidf")
query = "covid-19 transmission characteristics"
top_k = 10
start = time.time()
results = tfidf_model.query(query, top_k=top_k)
print("Query time: ", time.time() - start)
# around 0.3 s after re-run (the first time runs more time for object serilisation)

Try to run python examples/insight_extract.py where a pre-trained insights file is loaded and presented to you. If you do not want to use the pre-trained insights, you can pre-train it from scratch by python examples/insight_from_scratch.py. (have a look at this file to customize the pre-training process).

Start as a web server

Here just demonstrating pre-trained insights as an example. For customisation (query search), have a hack on app.py and templates/layout.html to easily figure out. Make sure you download the metadata.csv from CORD19 dataset and put it under ./dataset first, then enter:

python app.py

Go browser via http://127.0.0.1:5000, the web application is as follows.

Server as service

  • The server can also be requested in a cross-origin way.
  • You send a GET/POST request to obtaining insights by task name.
  • A GET request example is like this: http://127.0.0.1:5000/kaggle_task?task_name=task1.
  • A POST request example is like this: curl -i -X POST -H "Content-Type: application/json" -d "{\"task_name\":\"task1\"}" http://127.0.0.1:5000/kaggle_task.
  • Adapt these to Ajax GET/POST request in your case where you want to embed it to your front-end web html pages!
  • Try the live one: https://www.thinkingso.cf/kaggle_task?task_name=task1

Contributions

Feedback and pull requrest are welcome for getting the project better off.

About

This repository provdes a retrieval system for searching covid-19 relevant papers, built upon CORD19 dataset

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published