Where to obtain datasets for training? #62

chrisspen · 2020-08-05T18:47:54Z

In your README, you say you trained your model on the TED and Europarl datasets. Where did you obtain these? I can't find any public download links for anything matching those names.

I'd like to train my own model, using those as a starting point, but these datasets don't seem to exist anywhere.

ottokart · 2020-08-06T09:51:53Z

Hi,

Europarl can be downloaded from here: http://hltshare.fbk.eu/IWSLT2012/training-monolingual-europarl.tgz

The TED dataset was preprocessed by the authors of http://www.lrec-conf.org/proceedings/lrec2016/pdf/103_Paper.pdf and the resulting dataset is shared at: https://drive.google.com/file/d/0B13Cc1a7ebTuMElFWGlYcUlVZ0k/view
I used this simple script to convert the format of the files: https://drive.google.com/open?id=1sW23C4kqRJ6rDSBurco8_0lJ3VZJIkta

chrisspen · 2020-08-07T02:21:42Z

Thanks. However, how do you use that converter.py script on those archives? Each archive contains multiple files.

For example, the LREC archive contains files dev2012, test2011, test2011asr, and train2012. I'm not sure what the difference is between test2011 and test2011asr. The readme just says it's "for ASR output", which tells us nothing. Do I need to convert all of these files?

How do I combine this with the Europarl file? There only appears to be one, europarl-v7.en, and it seems to be in a very different format than the LREC files, as it contains full sentences, whereas the LREC files appear to contain pairs of tokens.

chrisspen · 2020-08-07T16:55:21Z

Nevermind, I went through the scripts in ./examples, and figured out how to preprocess the raw datasets.

I put the train/dev/test files for both the TED and Europarl files in the same directory, so the data.py would include them all. Is that copacetic?

I'm now training a model using the recommended time python main.py mymodel.pcl 256 0.02, on a system without a GPU. Do you know how long that should take?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Where to obtain datasets for training? #62

Where to obtain datasets for training? #62

chrisspen commented Aug 5, 2020

ottokart commented Aug 6, 2020

chrisspen commented Aug 7, 2020

chrisspen commented Aug 7, 2020

Where to obtain datasets for training? #62

Where to obtain datasets for training? #62

Comments

chrisspen commented Aug 5, 2020

ottokart commented Aug 6, 2020

chrisspen commented Aug 7, 2020

chrisspen commented Aug 7, 2020