Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Data

We provide preprocessing scripts and datasets that we use in our paper.

Tasks

At this time we are preparing most of the (cross-lingual) task data for public release. If you'd like to receive a preliminary (undocumented) version of the data please write an e-mail to us.

Cross-Lingual Word Embeddings

As part of our work we trained word embeddings (BIVCD) and (re-)mapped others with the method described in the appendix of our paper.

Fasttext 300K only contain the 300K most frequent tokens (of both languages). The full versions are mapped variants of the full pre-trained fasttext. Use the full versions to reproduce our results.

Translated SNLI

We trained our cross-lingual adaptations of InferSent on (machine-) translated cross-lingual variants of SNLI:

The above contain SNLI with all possible language combinations of the sentence pairs (en-en, en-de, de-en, de-de). Thus, the datasets are four times as large as the original.

We plan to release translated SNLI corpora in different languages soon (de,fr,es,ar).

Translated downstream tasks

MR, CR, etc.

Licenses

Please read LICENSE.txt and NOTICE.txt in the project root. We distribute derivational data under the same license as the original.