Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No punctuation in result - Train more lines or preprocess data.dev.txt? #76

Open
ErfolgreichCharismatisch opened this issue Oct 31, 2021 · 2 comments

Comments

@ErfolgreichCharismatisch

I followed

python data.py <data_dir>
python main.py <model_name> 256 0.02
cat data.dev.txt | python punctuator.py <model_path> <model_output_path>

I used the europarl-v7.de-en.de dataset and took

1800 lines for ep.dev.txt
1800 lines for ep.test.txt
7200 lines for ep.train.txt

with data.dev.txt being a long string on one line from kaldi, a speech-to-text engine. It's all lowercase, sometimes wrong words and no punctuation.

<model_output_path> is equal to data.dev.txt

Is the solution to train more lines or do I have to preprocess data.dev.txt? If the latter, how?

@ErfolgreichCharismatisch
Copy link
Author

Push

@ssabatier
Copy link

ssabatier commented Apr 5, 2022

I think you need to have sentences on a new line and many more samples. I used https://www.statmt.org/wmt14/training-monolingual-europarl-v7/europarl-v7.fr.gz and modified run.sh to use this file. Run it and see what output .txt files look like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants