Skip to content
/ hfts Public

A small java library for enhancing linked data benchmarking datasets.

License

Notifications You must be signed in to change notification settings

santifa/hfts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HFTS

This is a small library which uses NIF datasets to produce meta informations. It is designed to be integrated in other tools like GERBIL. Also it should be easily extended with further meta inforamtions.

Background

Entity linking is the task of connecting entities in a natural language text with their formal pendants from a knowledge base like Dbpedia. Benchmarking and evaluation of such annotation systems is done with well-known benchmarking datasets e.g. KORE50, MSNBC, WES2015. These datasets provides a rich set of texts with annotaions but they're lacking a rich description about themself. Since these datasets could be unbalanced or inappropriate for some tools (person dataset in conjunction with a geo-information annotator) these meta informations could guide researchers selecting the appropriate dataset for their tool. In addition, the meta informations could further be used to assemble new datasets out of all existing ones with special features like person-only, low-popularity, hard-to-disambiguate organisations, cf. remix.

NIF

NIF stands for Natural Language Processing Interchange Format and is one defacto standard for exchanging datasets for nlp tools.
The NIF core ontology already provides good definitions for documents (text with annotaions and some further informations) but is missing a class for meta informations about collections of documents and also some properties to store meta informations in documents.

We provide a new small ontology which can be found here. The only new base class is hfts:Dataset which is represented as hfts:{Dataset Name}. The properties describing the meta informations are related to this object and are further explaines alongside with the metrics.

Core API

The core library consumes text files or strings in Turtle-NIF format and enriches these datasets. Also, a command line tool is provided.

Metrics

To get a deeper insight into datasets which are used for evaluating and benchmarking of entity linking tools researchers proposed multiple metrics. A quick excerpt:

  • Density
  • PageRank and HIT Score
  • Ambiguity

The metrics provided by this library as well as an explanation of the metrics can be found here

Installation

Clone the repository and then

cd hfts
mvn clean install -DskipTests

The library is installed into your local m2 folder. Now add tou you pom.xml

<dependency>
    <groupId>org.santifa</groupId>
    <artifactId>hfts-core</artifactId>
    <version>1.0</version>
</dependency>

To get all measures working you'll further need the dictionary data for ambiguity, diversity and popularity. This package contains the source data as well as the scripts to produce the neccessary format for the library as well as production ready dictionaries.

Get the dictionaries from here. In order to run the tests you have to place the dictionary data in the hfts/data directory otherwise only provide the paths at programming time.

Usage

This library provides a fluent interface for programming.

/* Obtain a new api object */
HftsApi api = new HftsApi()
    .withMetric(new Density(), new NotAnnotated());

/* load datasets */
for (Path p : nifFiles) {
    api.withDataset(p, p.getName());
}

/* run metrics against the datasets and print */
List<NifDataset> datasets = api.run();
for (NifDataset ds : datasets) {
    System.out.println(ds.write());
}

Since the ambiguity and diversity metrics are memory intensive use it with care:

Dictionary<Integer> connectorEntity = AmbiguityDictionary.getDefaultEntityConnector();
Dictionary<Integer> connectorSf = AmbiguityDictionary.getDefaultSFConnector();
HftsApi api = new HftsApi()
    .withMetric(new Ambiguity(connectorEntity, connectorSf), 
                new Diversity(connectorEntity, connectorSf));

Also owl:sameAs retrival is provided with http://sameas.org/

api.withSameAsRetrival();

Extension

The library can be extended with every possible metric. See the metrics document for further information.

Remarks

Some drawbacks and remarks:

  • only nif data sets can be parsed. the enhanced one NOT!
  • the diversity and ambiguity metric are using an external dictionary which takes long and much memory so be careful.

CLI

We also provide a basic command-line interface which takes a list of NIF documents and writes them in return to a file with the same name in the cli directory.

cd cli
./hfts <arguments> <datasets>

Arguments:

  • -v: Enable verbose mode
  • --macro: Only calculate macro metrics
  • --micro: Only calculate micro metrics
  • --sameAs: Do owl:sameAs retrieval for the entity URIs
  • -m: Provide a comma-separated list of metrics. Available are notannotated, density, hits, pagerank, type, diversity, ambiguity

Contributions

Feel free to fill a bug report, propose a new metric or make a pull request.

TODO

  • extensible reading and writing of nif + easier data structure

About

A small java library for enhancing linked data benchmarking datasets.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published