Improving Seq2Seq Grammatical Error Correction via Decoding Interventions

Houquan Zhou, Yumeng Liu, Zhenghua Li^✉️, Min Zhang, Bo Zhang, Chen Li, Ji Zhang, Fei Huang

^{Note: This cover image is created by DALL·E 3}

TL;DR

This repo contains the code for our EMNLP 2023 Findings paper: Improving Seq2Seq Grammatical Error Correction via Decoding Interventions.

We introduce a decoding intervention framework that uses critics to assess and guide token generation. We evaluate two types of critics: a pre-trained language model and an incremental target-side grammatical error detector. Experiments on English and Chinese data show our approach surpasses many existing methods and competes with SOTA models.

Citation

@inproceedings{zhou-et-al-2023-improving,
  title     = {Improving Seq2Seq Grammatical Error Correction via Decoding Interventions},
  author    = {Zhou, Houquan  and
               Liu, Yumeng  and
               Li, Zhenghua  and
               Zhang, Min  and
               Zhang, Bo  and
               Li, Chen  and
               Zhang, Ji  and
               Huang, Fei},
  booktitle = {Findings of EMNLP},
  year      = {2023},
  address   = {Singapore}
}

Setup

Clone this repo recursively:

git clone https://github.com/Jacob-Zhou/gecdi.git --recursive

# The newest version of parser is not compatible with the current code, 
# so we need to checkout to a previous version
cd 3rdparty/parser/ && git checkout 6dc927b && cd -

Then you can use following commands to create an environment and install the dependencies:

. scripts/set_environment.sh

# for Errant (v2.0.0) evaluation a python 3.6 environment is required
# make sure your system has python 3.6 installed, then run:
. scripts/set_py36_environment.sh

You can follow this repo to obtain the 3-stage train/dev/test data for training a English GEC model. The multilingual datasets are available here.

Before running, you are required to preprocess each sentence pair into the format of

S   [src]
T   [tgt]

S   [src]
T   [tgt]

Where [src] and [tgt] are the source and target sentences, respectively. A \t is used to separate the prefix S or T and the sentence. Each sentence pair is separated by a blank line. See data/toy.train for examples.

Download Trained Models

The trained models are avaliable in HuggingFace model hub. You can download them by running:

# If you have not installed git-lfs, please install it first
# The installation guide can be found here: https://git-lfs.github.com/
# Most of the installation guide requires root permission.
# However, you can install it locally using conda:
# https://anaconda.org/anaconda/git-lfs

# Create directory for storing the trained models
mkdir -p models
cd models

# Download the trained models
# First, clone the small files
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/HQZhou/bart-large-gec
# Then use git-lfs to download the large files
cd bart-large-gec
git lfs pull

# Return to the models directory
cd -

# The download process is the same for the GED model
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/HQZhou/bart-large-ged
cd bart-large-ged
git lfs pull

Run

English experiments:

# Baseline (vanilla decoding)
bash pred.sh  \
    devices=0  \
    gec_path=models/bart-large-gec/model  \
    dataset=bea19.dev

# w/ LM-critic
bash pred.sh  \
    devices=0  \
    gec_path=models/bart-large-gec/model  \
    lm_alpha=0.8 lm_beta=10  \
    dataset=bea19.dev

# w/ GED-critic
bash pred.sh  \
    devices=0  \
    gec_path=models/bart-large-gec/model  \
    ged_path=models/bart-large-ged/model  \
    ged_alpha=0.8 ged_beta=1  \
    batch=500  \
    dataset=bea19.dev

# w/ both LM-critic and GED-critic
bash pred.sh  \
    devices=0  \
    gec_path=models/bart-large-gec/model  \
    ged_path=models/bart-large-ged/model  \
    lm_alpha=0.8 lm_beta=10  \
    ged_alpha=0.8 ged_beta=1  \
    batch=250  \
    dataset=bea19.dev

Chinese experiments:

# Baseline (vanilla decoding)
bash pred.sh  \
    devices=0  \
    dataset=mucgec.dev

# w/ LM-critic
bash pred.sh  \
    devices=0  \
    lm_alpha=0.3  \
    lm_beta=0.1  \
    dataset=mucgec.dev

# w/ GED-critic
bash pred.sh  \
    devices=0  \
    ged_alpha=0.6 ged_beta=10  \
    dataset=mucgec.dev

# w/ both LM-critic and GED-critic
bash pred.sh  \
    devices=0  \
    lm_alpha=0.3 lm_beta=0.1  \
    ged_alpha=0.6 ged_beta=10  \
    dataset=mucgec.dev

Recommended Hyperparameters

We search the coefficient $\alpha$ and $\beta$ on the development set.

The optimal coefficients are varied across different datasets.

Hyperparameters for LM-critic:

Dataset	$\alpha$	$\beta$
CoNLL-14	0.8	10.0
BEA-19	0.8	10.0
GMEG-Wiki	1.0	10.0
MuCGEC	0.3	0.1

Hyperparameters for GED-critic:

Dataset	$\alpha$	$\beta$
CoNLL-14	0.8	1.0
BEA-19	0.8	1.0
GMEG-Wiki	0.9	1.0
MuCGEC	0.6	10.0

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
3rdparty		3rdparty
assets		assets
configs		configs
data		data
gec		gec
scripts		scripts
tools		tools
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
eval_pred.py		eval_pred.py
eval_pred_zh.py		eval_pred_zh.py
exp_pipe.sh		exp_pipe.sh
exp_pipe_zh.sh		exp_pipe_zh.sh
intervened_decode.py		intervened_decode.py
pred.sh		pred.sh
pred_zh.sh		pred_zh.sh
requirements.txt		requirements.txt
seq2seq.py		seq2seq.py
seq2seq_ged.py		seq2seq_ged.py
supar		supar
train.sh		train.sh
train_ged.sh		train_ged.sh
train_ged_zh.sh		train_ged_zh.sh
train_zh.sh		train_zh.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Improving Seq2Seq Grammatical Error Correction via Decoding Interventions

TL;DR

Citation

Setup

Download Trained Models

Run

Recommended Hyperparameters

About

Releases

Packages

Languages

License

ymliucs/gecdi

Folders and files

Latest commit

History

Repository files navigation

Improving Seq2Seq Grammatical Error Correction via Decoding Interventions

TL;DR

Citation

Setup

Download Trained Models

Run

Recommended Hyperparameters

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages