Skip to content
/ sumo Public

Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more

License

Notifications You must be signed in to change notification settings

gdamdam/sumo

Repository files navigation

Sumo 0.1

Sumo it's a tool for the semantic analysis of web articles. It extracts the content from an article web page and analyzing it an returning: frequency words, entity recognition, automatic summarization. It returns also the releted articles previously analized, using the term vector distance.

Main requirements

MongoDB >=2.6.5 Python >=2.7.5

for debian and ubuntu:

apt-get install mongodb python python-dev python-virtualenv libxml2-dev libxslt-dev zlib1g-dev libjpeg-dev gcc

Using Docker

We provide a Dockerfile to run a dockerized Sumo server.

docker build -t sumoserver .
docker run -p 5000:5000 sumoserver

Basic Installation

git clone https://github.com/gdamdam/sumo.git
cd sumo
virtualenv ./venv
source venv/bin/activate
pip install -r requirements.txt
python requirements_nltk.py

Start

Just lunch the server

sudo service mongodb start
python ./sumo_server.py -s IP

for help and all the options you can use

python ./sumo_server.py --help

The server provides a REST resource for analyze and store the analysis data of a web document.

API Usage

The following comand returns the list of all the documents stored

curl http://host:5000/sumo

The stored documents are labeled with a ID_DOC, where the / caracter in the URL are substitued with __ (double underscore).

e.g.:

 TARGET_URL: www.google.com/test
     ID_DOC: www.google.com__test

To analyze and store a document and store it on the db:

curl http://host:5000/sumo -X POST -d 'url=TARGET_URL'

HTTP Status returned:

	201:	Created		- the document at TARGET_URL sucessfully analyzed and stored
	409:	Conflict	- if the TARGET_URL already exists in the storade
	415:	Unsupported	- the TARGET_URL is malformed

To retrieve a stored document analysis:

curl http://host:500/sumo/ID_DOC

HTTP Status returned:

	200:	OK			
	404:	Not Found 	- the document does not exist

To delete a stored document:

curl http://host:500/sumo/ID_DOC -X DELETE

HTTP Status returned:

	204:	No Content	- document deleted 
	404:	Not Found 	- the document does not exist

It is possible retrieve the cluster of similar documents using the cluster resource

curl http://host:500/sumo/cluster/ID_DOC

HTTP Status returned:

	200:	OK
	404:	Not Found 	- the document does not exist

Web Interface

The running server provides also a very minimal javascript web interface to interact with the API. The interface is reacheable at:

http://host:5000

Tips:

  • single click on an ID_DOC in the index to fill the form and click analyze to retrieve the analysis.
  • double click on an ID_DOC in the index to delete it.

About

Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published