BiLex: A Bilingual Lexicon Inducer From Non-Parallel Data

This software learns a bilingual lexicon from non-parallel data with the help of a small seed lexicon. The technique is described in the following paper:

Meng Zhang, Haoruo Peng, Yang Liu, Huanbo Luan, and Maosong Sun. Bilingual Lexicon Induction From Non-Parallel Data With Minimal Supervision. In Proceedings of AAAI, 2017.

Runtime Environment

This software has been tested in the following environment, but should work in a compatible one.

64-bit Linux
Python 3.4
GCC 4.9.4

Usage

1. Compile the code.

./compile.sh

2. Specify the variables in the config file. For example, if config contains the following lines:

config=zh-en
lang1=zh
lang2=en

then the data should be located in data/zh-en with file extensions zh and en.

3. Prepare data according to Step 2. Toy non-parallel data is provided, along with a Chinese-English seed lexicon with 100 word translation pairs. If your seed lexicon has more than 10000 entries, you need to modify the code by redefining MAX_LEXICON_SIZE.

4. Train and obtain the bilingual lexicon.

./run.sh

5. The following files will be generated in data/zh-en (the folder specified in config):

word-vec.zh/en: Bilingual word embeddings in a human readable format. From these files vocab.zh/en and vec.zh/en are extracted.
vocab.zh/en: Vocabularies.
vec.zh/en: Bilingual word embeddings.
result: Translations of vocab.zh. For each source word, there will be at most 10 translations after the tab character, each in the format <translation candidate>:<cosine similarity to the source word vector>, separated by space and sorted in decreasing order of the cosine similarity. </s> is the sentence marker; its translations should be ignored.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data/zh-en		data/zh-en
scripts		scripts
src		src
LICENSE.txt		LICENSE.txt
README.md		README.md
compile.sh		compile.sh
config		config
run.sh		run.sh
train.sh		train.sh
translate.sh		translate.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BiLex: A Bilingual Lexicon Inducer From Non-Parallel Data

Runtime Environment

Usage

About

Releases

Packages

Languages

License

THUNLP-MT/BiLex

Folders and files

Latest commit

History

Repository files navigation

BiLex: A Bilingual Lexicon Inducer From Non-Parallel Data

Runtime Environment

Usage

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages