Skip to content
Madhav Sharan edited this page Mar 20, 2016 · 1 revision

Memex-GeoParser

About GeoParser:

The Geoparser is an open source tool that can process information from any file, extract geographic coordinates, and visualize locations on a map. After the information is parsed and points are plotted on the map, users are able to filter their results by density, or by searching a keyword and applying a "facet" to the parsed information. On the map, users can click on location points to reveal more information about the location and how it is related to their search.

Motivation behind creating GeoParser:

While the GeoParser can be used to parse any type of data/information, the focus for the DARPA Memex project was on analyzing weapon ads on the dark web. With the GeoParser, users can extract geographical locations from weapons data crawled from the dark web. This can be particularly helpful for members of law enforcement who are trying to track down possible trade locations, manufacturing areas, or areas of high density for arms trafficking.

With the GeoParser, we intend to analyse data crawled over weapons websites active in United States. These locations need to be plotted over a world map using density clusters. These density clusters tell us which continent, which country, which state is referenced most in crawled data.

Introduction to functionalities

Upload files - User can upload any file and GeoParser will take it through all the stages of geotagging and finally draw pointers on map based on location retrieved out of file.

Queue a solr index - User can introduce a solr core to GeoParser. It will start a background process which will query solr core in batches and start storing locations retrieved out it.

Introduction to technologies used.

Tika - The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.

Tika- GeoTopicParser - The GeoTopicParser combines a Gazetteer (a lookup dictionary of names/places to latitudes, longitudes) and a Named Entity Recognition (NER) modeling technique that identifies names and places in text to provide a way to geotag documents and text i.e., to identify places in the text, and then to look up the latitude/longitude pairs for those places.

lucene-geo-gazetteer - A command line and REST gazetteer built around the Geonames.org dataset, that uses the Apache Lucene library to create a searchable gazetter

High Level Architecture of GeoParser

File Input

  1. Allow user to upload any file
  2. Extract text from that file using tika-python and save it in Apache SOLR
  3. Run NER to find location out of that text using Apache OpenNLP integrated in Apache Tika
  4. Query location from Lucene Geo Gazetteer integrated within Apache Tika to get coordinates
  5. Save location details in Apache SOLR
  6. Project locations on a world map using Openlayers 3 and REST API of GeoParser

Crawled data

  1. Allow user to queue a solr index to be geotagged
  2. Make a simplified text bundle of solr records one after another
  3. Extract locations same way as file using Apache OpenNLP, Lucene Geo Gazetteer integrated within Apache Tika.
  4. Save progress in Apache SOLR.
  5. Continue till we tag whole index.

References -

GeoParser: https://github.com/MBoustani/GeoParser
Memex: http://memex.jpl.nasa.gov/
Apache Tika: https://tika.apache.org/
Lucene Geo Gazetteer: https://github.com/chrismattmann/lucene-geo-gazetteer
Apache OpenNLP: https://opennlp.apache.org/
Apache Solr: http://lucene.apache.org/solr/