This is a small library which uses NIF datasets to produce meta informations. It is designed to be integrated in other tools like GERBIL. Also it should be easily extended with further meta inforamtions.
Entity linking is the task of connecting entities in a natural language text with their formal pendants from a knowledge base like Dbpedia. Benchmarking and evaluation of such annotation systems is done with well-known benchmarking datasets e.g. KORE50, MSNBC, WES2015. These datasets provides a rich set of texts with annotaions but they're lacking a rich description about themself. Since these datasets could be unbalanced or inappropriate for some tools (person dataset in conjunction with a geo-information annotator) these meta informations could guide researchers selecting the appropriate dataset for their tool. In addition, the meta informations could further be used to assemble new datasets out of all existing ones with special features like person-only, low-popularity, hard-to-disambiguate organisations, cf. remix.
NIF
stands for Natural Language Processing Interchange Format and is one defacto
standard for exchanging datasets for nlp tools.
The NIF core ontology already provides good definitions for documents (text with annotaions and some further informations)
but is missing a class for meta informations about collections of documents and also some
properties to store meta informations in documents.
We provide a new small ontology which can be found here.
The only new base class is hfts:Dataset
which is represented as hfts:{Dataset Name}
.
The properties describing the meta informations are related to this object and are
further explaines alongside with the metrics.
The core library consumes text files or strings in Turtle-NIF format and enriches these datasets. Also, a command line tool is provided.
To get a deeper insight into datasets which are used for evaluating and benchmarking of entity linking tools researchers proposed multiple metrics. A quick excerpt:
- Density
- PageRank and HIT Score
- Ambiguity
The metrics provided by this library as well as an explanation of the metrics can be found here
Clone the repository and then
cd hfts
mvn clean install -DskipTests
The library is installed into your local m2
folder. Now add tou you pom.xml
<dependency>
<groupId>org.santifa</groupId>
<artifactId>hfts-core</artifactId>
<version>1.0</version>
</dependency>
To get all measures working you'll further need the dictionary data for ambiguity, diversity and popularity. This package contains the source data as well as the scripts to produce the neccessary format for the library as well as production ready dictionaries.
Get the dictionaries from here.
In order to run the tests you have to place the dictionary data
in the hfts/data
directory otherwise only provide the paths at programming time.
This library provides a fluent interface for programming.
/* Obtain a new api object */
HftsApi api = new HftsApi()
.withMetric(new Density(), new NotAnnotated());
/* load datasets */
for (Path p : nifFiles) {
api.withDataset(p, p.getName());
}
/* run metrics against the datasets and print */
List<NifDataset> datasets = api.run();
for (NifDataset ds : datasets) {
System.out.println(ds.write());
}
Since the ambiguity and diversity metrics are memory intensive use it with care:
Dictionary<Integer> connectorEntity = AmbiguityDictionary.getDefaultEntityConnector();
Dictionary<Integer> connectorSf = AmbiguityDictionary.getDefaultSFConnector();
HftsApi api = new HftsApi()
.withMetric(new Ambiguity(connectorEntity, connectorSf),
new Diversity(connectorEntity, connectorSf));
Also owl:sameAs
retrival is provided with http://sameas.org/
api.withSameAsRetrival();
The library can be extended with every possible metric. See the metrics document for further information.
Some drawbacks and remarks:
- only nif data sets can be parsed. the enhanced one NOT!
- the diversity and ambiguity metric are using an external dictionary which takes long and much memory so be careful.
We also provide a basic command-line interface which
takes a list of NIF documents and writes them in return to a
file with the same name in the cli
directory.
cd cli
./hfts <arguments> <datasets>
Arguments:
-v
: Enable verbose mode--macro
: Only calculate macro metrics--micro
: Only calculate micro metrics--sameAs
: Do owl:sameAs retrieval for the entity URIs-m
: Provide a comma-separated list of metrics. Available are notannotated, density, hits, pagerank, type, diversity, ambiguity
Feel free to fill a bug report, propose a new metric or make a pull request.
- extensible reading and writing of nif + easier data structure