Skip to content

A web service that exposes semantic similarity search via a web GUI and a RESTful API.

Notifications You must be signed in to change notification settings

fredriko/metacurate-lexicon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Metacurate Lexicon

tl;dr

The metacurate lexicon, and the accompanying API, are the results of an investigation into the feasibility to deploy a web service that uses a reasonably large set of word embeddings to platform-as-a-service Heroku.

Verbose

The metacurate lexicon is available at https://metacurate-lexicon.herokuapp.com/ (it is running on a free dyno, so it takes 30 seconds or so to spin it up). It is a python/Flask web application that exposes interfaces (a web GUI and a RESTful API) for looking up semantically similar (multi-word) terms in a lexicon, as well as the appropriate pre-processing of raw text into sentences and term tokens. The word embeddings in the lexicon are generated by the gensim word2vec implementation, and the recognition of multi-word terms is based on gensim Phraser:s.

Here's a screenshot of looking up the term word embedding in the lexicon:

first page of metacurate lexicon

Here's a screenshot of the automatically generated API docmentation:

the api documentation

Why?

Upcoming features at metacurate.io require access to a lexicon of semantically similar multi-word terms. Since metacurate.io is hosted on heroku, I wanted to find out whether the required semantic lexicon functionality can be deployed to heroku too, without violating their application size constraints.

The answer is yes.

How to run the web service locally

To install and run this program you need:

I also suggest you use, e.g., virtualenv to create a virtual environment in which you install the requirements of this program.

Once the above requirements are in place, at a command line prompt, do the following:

$ git clone https://github.com/fredriko/metacurate-lexicon.git
$ cd metacurate-lexicon

to clone this repository to your local machine, and

$ virtualenv ~/venv/mcl
$ source ~/venv/mcl/bin/activate

to set up and activate a virtual environment called mcl. To install the python dependencies of metacurate-lexicon, type:

$ pip install -r requirements.txt

You're done installing the metacurate-lexicon. Let's run the server. Still in at the top level of the directory to which you cloned the repository, type:

$ python -m src.run

After a little while, a message similar to the following should be printed to the screen:

 * Serving Flask app "src.app" (lazy loading)
 * Environment: production
   WARNING: Do not use the development server in a production environment.
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://127.0.0.1:4990/ (Press CTRL+C to quit)

The server is up and running! Point your browser to the address in the message and play around with it.

Todo

  • Instructions for how to deploy the service to heroku.
  • Write-up regarding data collection, cleaning, collocation extraction, and the training of word2vec/fasttext models.