Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data preprocessing #177

Open
Lj4040 opened this issue Dec 27, 2022 · 7 comments
Open

data preprocessing #177

Lj4040 opened this issue Dec 27, 2022 · 7 comments

Comments

@Lj4040
Copy link

Lj4040 commented Dec 27, 2022

What do SOURCE and TARGER stand for in data preprocessing? Could you explain them? Thank you for your reply

@skurzhanskyi
Copy link
Collaborator

As it was mentioned , source is original text, target is corrected text

@Lj4040
Copy link
Author

Lj4040 commented Dec 27, 2022

For example, what I downloaded is the FCE data set, which contains M2 file and json file. In this file, there is no distinction between correct and incorrect sentences. How should I pass the data processing file.I would like to ask for your guidance, for which I greatly appreciate it

@Lj4040
Copy link
Author

Lj4040 commented Dec 27, 2022

Only the downloaded synthetic data set has correct and incorrect sentences, do we have to use the synthetic data to pass in?

@skurzhanskyi
Copy link
Collaborator

You can take a look at the M2scorer repository and specifically the edit_creator.py script.
To get original/source sentences, you can simply run cat myfile.m2 | grep "^S " | cut -c3- > myfile.src

@Lj4040
Copy link
Author

Lj4040 commented Dec 27, 2022

Sincerely thank you for your answer, I will try

@Lj4040
Copy link
Author

Lj4040 commented Dec 27, 2022

Dear author, after data processing, the data set has become the following picture. This format file is quite different from the M2 file, so I'm not sure if it's correct.May I ask if the data set in this format is correct?
图片1
Because I am a beginner of GEC, some questions may be a little naive. I hope you can understand. Thank you for your reply

@skurzhanskyi
Copy link
Collaborator

Yes, this is a specific format for training to save only input tokens and corresponding tags

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants