-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No punctuation in result - Train more lines or preprocess data.dev.txt? #76
Comments
Push |
I think you need to have sentences on a new line and many more samples. I used https://www.statmt.org/wmt14/training-monolingual-europarl-v7/europarl-v7.fr.gz and modified run.sh to use this file. Run it and see what output .txt files look like. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I followed
I used the
europarl-v7.de-en.de
dataset and tookwith
data.dev.txt
being a long string on one line from kaldi, a speech-to-text engine. It's all lowercase, sometimes wrong words and no punctuation.<model_output_path>
is equal todata.dev.txt
Is the solution to train more lines or do I have to preprocess
data.dev.txt
? If the latter, how?The text was updated successfully, but these errors were encountered: