Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tokenization for Tetun Dili (tdt) #144

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

BLKSerene
Copy link
Contributor

@BLKSerene BLKSerene commented Sep 25, 2023

This PR copies from and replaces #114 which seems to be stale for more than 2 years, and also updates nonbreaking prefixes for Tetun Dili.

@jelmervdl
Copy link
Collaborator

Hi, thanks for this addition!

Do you have some example sentences that trigger the added regular expressions and (ideally) some of the non-breaking prefixes unique to this language? In the future, I'd like to add tests for all supported languages so we can make sure we don't break/change anything by accident.

@BLKSerene
Copy link
Contributor Author

I don't speak Tetun Dili, so hope that these tests work as expected...

@jelmervdl
Copy link
Collaborator

I noticed there's a test sentence in the original mosesdecoder pull request but when I try that it yields a different output on the Perl and the Python implementations. The original pull request (and what's currently in the moses tokenizer) is also different.

I'll dig a bit deeper to see whether I can find out why #114 decided to implement it differently, I'm tempted to stick to what's in the old Moses repo unless there's a very good reason not to.

@BLKSerene
Copy link
Contributor Author

BLKSerene commented Apr 24, 2024

Hi, any updates on this? Shall I close this PR? Or I can modify this PR to only update the nonbreaking prefixes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants