IR-SMART contain the generated code for a university project located here.
Given a query formatted in natural language, the code should be able to predict the expected answer type from a set of candidate entitites from collected target ontology. In this project the target ontology used is from the DBpedia 2016 dump.
The project has utilized the following tools and libraries extensively:
To get a local copy up and running follow these simple steps. It is assumed that the user has jupyter notebook available, and it is recommended to use a Conda distribution(Anaconda/Miniconda).
Install the necessary python libraries(if conda is not used):
pip install --upgrade elasticsearch gensim numpy scipy scikit-learn
Other dependencies might exist, but they have been installed through conda-distribution
Due to overall size of dataset this has be downloaded separately:
- DBpedia long_abstract_en.ttl
- DBpedia instance_types_en.ttl
- SeMantic AnsweR Type dataset
- GloVe Wikipedia 2014 + Gigaword 5 pretrained embeddings
Once all the files has been downloaded, extract them and place them in such a way that the directory structure is as follows(the files highlighted with ##
are the files you need to download&place yourself):
📦IR-SMART
┣ 📂datasets
┃ ┣ 📂DBpedia
┃ ┃ ┣ 📜instance_types_en.ttl ##
┃ ┃ ┣ 📜long_abstracts_en.ttl ##
┃ ┃ ┣ 📜smarttask_dbpedia_test_questions.json ##
┃ ┃ ┗ 📜smarttask_dbpedia_train.json ##
┃ ┣ 📂gensim
┃ ┃ ┗ 📜...
┃ ┗ 📂glove
┃ ┣ 📜glove.6B.100d.txt ##
┃ ┣ 📜glove.6B.200d.txt ##
┃ ┣ 📜glove.6B.300d.txt ##
┃ ┗ 📜glove.6B.50d.txt ##
┣ 📂results
┃ ┣ 📜advanced.csv
┃ ┣ 📜advanced_word2vec.csv
┃ ┣ 📜baseline.csv
┃ ┗ 📜test_type_predictions.csv
┣ 📜.gitignore
┣ 📜baseline_variable_test.ipynb
┣ 📜evaluation.ipynb
┣ 📜indexer.ipynb
┣ 📜indexer_compact.ipynb
┣ 📜LICENSE
┣ 📜README.md
┗ 📜trial_and_error.ipynb
The necessary code to execute is located in indexer_compact.ipynb
and evaluation.ipynb
The other ipynb-files, contain an alternative larger index(indexer.ipynb
), tests to see how varying parameter values affected the score(baseline_variable_test
). trial_and_error
contain a failed early attempt to make the ES-indexing more effective by first loading all datafiles into memory and then initializing ES-indexing(not recommended to run)
-
Execute all cells within
indexer_compact.ipynb
, this will generate the ElasticSearch index necessary for all consecutive steps.- PS: Ensure that Elasticsearch is running either as a systemd-process(linux), or that the bat-file is running(Windows)
- PS: You will have to uncomment the functioncall
createTheIndex()
, in cell 5 to generate the index, andindexData(10000)
mear the bottom of the file.
-
Execute all cells within
evaluation.ipynb
, this will perform the evaluation using both the baseline and advanced implementation.- PS: Uncomment the
convertGlovetoGensim()
function call in cell 5, this is necessary to allow GenSim to parse the GloVe embedding-file.
- PS: Uncomment the
The achieved accuracy scores has been summarized in the table below:
Method | Accuracy | NDCG@5 | NDCG@10 |
---|---|---|---|
Strict Baseline | 0.492 | 0.237 | 0.323 |
Lenient Baseline | 0.492 | 0.312 | 0.414 |
Strict Word2Vec | 0.522 | 0.280 | 0.367 |
Lenient Word2Vec | 0.522 | 0.364 | 0.455 |
Strict LTR(pointwise) | 0.776 | 0.731 | 0.754 |
Lenient LTR(pointwise | 0.776 | 0.753 | 0.780 |
Distributed under the GPL-3.0 License. See LICENSE
for more information.
- e-mail: [email protected]
- GitHub: @BerntA
- e-mail: [email protected]
- GitHub: @Chrystallic