data preprocessing #177

Lj4040 · 2022-12-27T10:28:45Z

What do SOURCE and TARGER stand for in data preprocessing? Could you explain them? Thank you for your reply

skurzhanskyi · 2022-12-27T10:32:27Z

As it was mentioned , source is original text, target is corrected text

Lj4040 · 2022-12-27T10:38:36Z

For example, what I downloaded is the FCE data set, which contains M2 file and json file. In this file, there is no distinction between correct and incorrect sentences. How should I pass the data processing file.I would like to ask for your guidance, for which I greatly appreciate it

Lj4040 · 2022-12-27T10:40:15Z

Only the downloaded synthetic data set has correct and incorrect sentences, do we have to use the synthetic data to pass in?

skurzhanskyi · 2022-12-27T10:45:53Z

You can take a look at the M2scorer repository and specifically the edit_creator.py script.
To get original/source sentences, you can simply run cat myfile.m2 | grep "^S " | cut -c3- > myfile.src

Lj4040 · 2022-12-27T10:50:34Z

Sincerely thank you for your answer, I will try

Lj4040 · 2022-12-27T11:32:24Z

Dear author, after data processing, the data set has become the following picture. This format file is quite different from the M2 file, so I'm not sure if it's correct.May I ask if the data set in this format is correct?

Because I am a beginner of GEC, some questions may be a little naive. I hope you can understand. Thank you for your reply

skurzhanskyi · 2023-01-15T19:58:48Z

Yes, this is a specific format for training to save only input tokens and corresponding tags

Chunngai mentioned this issue Jan 10, 2023

Conversion from m2 to parallel #179

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data preprocessing #177

data preprocessing #177

Lj4040 commented Dec 27, 2022

skurzhanskyi commented Dec 27, 2022

Lj4040 commented Dec 27, 2022

Lj4040 commented Dec 27, 2022

skurzhanskyi commented Dec 27, 2022

Lj4040 commented Dec 27, 2022

Lj4040 commented Dec 27, 2022

skurzhanskyi commented Jan 15, 2023

data preprocessing #177

data preprocessing #177

Comments

Lj4040 commented Dec 27, 2022

skurzhanskyi commented Dec 27, 2022

Lj4040 commented Dec 27, 2022

Lj4040 commented Dec 27, 2022

skurzhanskyi commented Dec 27, 2022

Lj4040 commented Dec 27, 2022

Lj4040 commented Dec 27, 2022

skurzhanskyi commented Jan 15, 2023