Skip to content

juntaoy/aracoref

Repository files navigation

Neural Coreference Resolution for Arabic

Introduction

This repository contains code introduced in the following paper:

Neural Coreference Resolution for Arabic
Abdulrahman Aloraini*, Juntao Yu* and Massimo Poesio *equal contribution
In Proceedings of the Third Workshop on Computational Models of Reference, Anaphora and Coreference (CRAC@COLING), 2020.

Setup Environments

  • The code is written in Python 2, the compatibility to Python 3 is not guaranteed.
  • Before starting, you need to install all the required packages listed in the requirment.txt using pip install -r requirements.txt.
  • After that run setup.sh to download the fastText embeddings that required by the system and compile the Tensorflow custom kernels.

To use a pre-trained model

  • Pre-trained models can be download from this link. We provide two pre-trained models:

    • One (arabic_cleaned_arabert) for Lee et al (2018) style training.
    • The second model (arabic_cleaned_arabert_e2e_annealing) that uses the predicted mention output from the Yu et al (2020) and also the best model from our paper.
    • We include the predicted mentions used in our evaluation for all three datasets (train, dev and test sets).
    • In the folder you will also find a file called char_vocab.arabic.txt which is the vocabulary file for character-based embeddings used by our pre-trained models.
  • Put the downloaded models along with the char_vocab.arabic.txt in the root folder of the code.

  • Modifiy the test_path and conll_test_path accordingly:

    • the test_path is the path to .jsonlines file, each line of the .jsonlines file must in the following format:
    {
    "clusters": [[[0,0],[5,5]],[[2,3],[7,8]],
    "pred_mentions":[[0,0],[2,3],[5,5],[7,9]], #Optional
    "doc_key": "nw",
    "sentences": [["John", "has", "a", "car", "."], ["He", "washed", "the", "car", "yesteday","."],["Really","?","it", "was", "raining","yesteday","!"]],
    "speakers": [["sp1", "sp1", "sp1", "sp1", "sp1"], ["sp1", "sp1", "sp1", "sp1", "sp1","sp1"],["sp2","sp2","sp2","sp2","sp2","sp2","sp2"]]
    }
    
    • For "clusters" and "pred_mentions" the mentions contain two properties [start_index, end_index] the indices are counted in document level and both inclusive.
    • the conll_test_path is the path to the file of gold data in CoNLL format, see the CoNLL 2012 shared task page for more detail
    • For how to create the json and CoNLL files please follow the instractions from the Lee et al (2018).
    • You can preprocess the Arabic tokens by using python preprocess_arabic.py test.jsonlines test.cleaned.jsonlines.
  • Then you need to run the extract_bert_features.sh to compute the BERT embeddings for the test set.

  • Then use python evaluate.py config_name to start your evaluation.

To train your own model

  • To train your own model you need first create the character vocabulary by using python get_char_vocab.py train.jsonlines dev.jsonlines
  • Then you need to run the extract_bert_features.sh to compute the BERT embeddings for training, development and test sets.
  • Finally you can start training by using python train.py config_name

Training speed

The cluster ranking model takes about 40 hours to train (400k steps) on a GTX 1080Ti GPU.

About

Code for Neural Coreference Resolution for Arabic

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published