Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where to obtain datasets for training? #62

Open
chrisspen opened this issue Aug 5, 2020 · 3 comments
Open

Where to obtain datasets for training? #62

chrisspen opened this issue Aug 5, 2020 · 3 comments

Comments

@chrisspen
Copy link

In your README, you say you trained your model on the TED and Europarl datasets. Where did you obtain these? I can't find any public download links for anything matching those names.

I'd like to train my own model, using those as a starting point, but these datasets don't seem to exist anywhere.

@ottokart
Copy link
Owner

ottokart commented Aug 6, 2020

Hi,

Europarl can be downloaded from here: http://hltshare.fbk.eu/IWSLT2012/training-monolingual-europarl.tgz

The TED dataset was preprocessed by the authors of http://www.lrec-conf.org/proceedings/lrec2016/pdf/103_Paper.pdf and the resulting dataset is shared at: https://drive.google.com/file/d/0B13Cc1a7ebTuMElFWGlYcUlVZ0k/view
I used this simple script to convert the format of the files: https://drive.google.com/open?id=1sW23C4kqRJ6rDSBurco8_0lJ3VZJIkta

@chrisspen
Copy link
Author

Thanks. However, how do you use that converter.py script on those archives? Each archive contains multiple files.

For example, the LREC archive contains files dev2012, test2011, test2011asr, and train2012. I'm not sure what the difference is between test2011 and test2011asr. The readme just says it's "for ASR output", which tells us nothing. Do I need to convert all of these files?

How do I combine this with the Europarl file? There only appears to be one, europarl-v7.en, and it seems to be in a very different format than the LREC files, as it contains full sentences, whereas the LREC files appear to contain pairs of tokens.

@chrisspen
Copy link
Author

Nevermind, I went through the scripts in ./examples, and figured out how to preprocess the raw datasets.

I put the train/dev/test files for both the TED and Europarl files in the same directory, so the data.py would include them all. Is that copacetic?

I'm now training a model using the recommended time python main.py mymodel.pcl 256 0.02, on a system without a GPU. Do you know how long that should take?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants