-
Notifications
You must be signed in to change notification settings - Fork 818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Sentence input is not fully cleaned in "write", thus errors in "*_sentences.tsv" #4406
Labels
Comments
Thanks so much for raising this, I'll bring this to the team for further investigation but you've given us just a fantastic amount of context here to start from. |
For those who need to work on the released text-corpora, here is my custom load script, which fixes ALL problems in v17.0, although bulky and low performance. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
I found this when checking on the new
validated_sentences.tsv
, where the row is split into two or more. This happens in many languages. It seems to be connected to the <textarea> tag in the newwrite
page. Any prior source (.txt files from github, old sentence collector, etc do not have them - as far as I can see).Because the write page presents a
textarea
, people can enter CR and/or LF and/or TAB characters, on multiple lines. As they are not visible, it is not easy to pinpoint until you output it in a file. So, why would anyone do this (multi-line entry)?To Reproduce
Check your dataset and/or try pressing ENTER on write page.
Expected behavior
The invisible characters such as CR and LF should not enter the database.
Screenshots
Taken from Turkish v17.0 dataset.
From
![image](https://private-user-images.githubusercontent.com/8849617/316639305-f9be2c85-b1a9-4efd-b696-4c610ca1b0be.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgyNTk3NzIsIm5iZiI6MTcxODI1OTQ3MiwicGF0aCI6Ii84ODQ5NjE3LzMxNjYzOTMwNS1mOWJlMmM4NS1iMWE5LTRlZmQtYjY5Ni00YzYxMGNhMWIwYmUucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxMyUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MTNUMDYxNzUyWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9MzgyNjhlOTk4ZTdjOGZjZTg1OTNhZTNiMzRiZjMzMzBmM2Y3NWRjZTQ0YjY5ZjM3OGMyNmM4OTE3YzAxMTYwZiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.xg3ZeEW8kNvo_1pjQL-c8On7uhQAkNP5gR-9MsukEOk)
lv
locale, as an example of multi sentence entry (id: 1abe332a32c15a932b18eb7f9a4548e93578b1b44d2b162ddccf829486c4de5c):From
![image](https://private-user-images.githubusercontent.com/8849617/316638738-4b7267a6-8760-436b-a67c-a62ef4ed4a75.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgyNTk3NzIsIm5iZiI6MTcxODI1OTQ3MiwicGF0aCI6Ii84ODQ5NjE3LzMxNjYzODczOC00YjcyNjdhNi04NzYwLTQzNmItYTY3Yy1hNjJlZjRlZDRhNzUucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxMyUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MTNUMDYxNzUyWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NGRlNTI1MzVhODkzNzY1YzJiOTk0YjVhMDJmZmQ2MGQ3ZGEzZTliYWFkMjU2NzMwMjg4OGU5NTljYzIyZWQyMiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.d5fb82d4dxUXf5UKvk2bdI2MNxcH-YzAthGEdJC-3FY)
lg
, where the sentence contains many tab characters which confuses .tsv readers/parsers (id: 62fb289d81ac03a947a4e46071fae5e29b73c6157dab536540c46b7f89e2821e):Additional context
There are three points where you can correct this:
textarea
tag toinput
tag, so you cannot enter multi-line text.*_sentences.tsv
files.In my opinion, the correct course of action is:
input
tag for a single line, it is more intuitivePS: We checked this with @moz-dfeller on PM who could do a DB query. I post it here for completeness.
The text was updated successfully, but these errors were encountered: