Skip to content
/ gecdi Public
forked from Jacob-Zhou/gecdi

The repo of "Improving Seq2Seq Grammatical Error Correction via Decoding Interventions"

License

Notifications You must be signed in to change notification settings

ymliucs/gecdi

 
 

Repository files navigation

Improving Seq2Seq Grammatical Error Correction via Decoding Interventions

Houquan Zhou, Yumeng Liu, Zhenghua Li✉️, Min Zhang, Bo Zhang, Chen Li, Ji Zhang, Fei Huang

cover
Note: This cover image is created by DALL·E 3

TL;DR

This repo contains the code for our EMNLP 2023 Findings paper: Improving Seq2Seq Grammatical Error Correction via Decoding Interventions.

We introduce a decoding intervention framework that uses critics to assess and guide token generation. We evaluate two types of critics: a pre-trained language model and an incremental target-side grammatical error detector. Experiments on English and Chinese data show our approach surpasses many existing methods and competes with SOTA models.

Citation

@inproceedings{zhou-et-al-2023-improving,
  title     = {Improving Seq2Seq Grammatical Error Correction via Decoding Interventions},
  author    = {Zhou, Houquan  and
               Liu, Yumeng  and
               Li, Zhenghua  and
               Zhang, Min  and
               Zhang, Bo  and
               Li, Chen  and
               Zhang, Ji  and
               Huang, Fei},
  booktitle = {Findings of EMNLP},
  year      = {2023},
  address   = {Singapore}
}

Setup

Clone this repo recursively:

git clone https://github.com/Jacob-Zhou/gecdi.git --recursive

# The newest version of parser is not compatible with the current code, 
# so we need to checkout to a previous version
cd 3rdparty/parser/ && git checkout 6dc927b && cd -

Then you can use following commands to create an environment and install the dependencies:

. scripts/set_environment.sh

# for Errant (v2.0.0) evaluation a python 3.6 environment is required
# make sure your system has python 3.6 installed, then run:
. scripts/set_py36_environment.sh

You can follow this repo to obtain the 3-stage train/dev/test data for training a English GEC model. The multilingual datasets are available here.

Before running, you are required to preprocess each sentence pair into the format of

S   [src]
T   [tgt]

S   [src]
T   [tgt]

Where [src] and [tgt] are the source and target sentences, respectively. A \t is used to separate the prefix S or T and the sentence. Each sentence pair is separated by a blank line. See data/toy.train for examples.

Download Trained Models

The trained models are avaliable in HuggingFace model hub. You can download them by running:

# If you have not installed git-lfs, please install it first
# The installation guide can be found here: https://git-lfs.github.com/
# Most of the installation guide requires root permission.
# However, you can install it locally using conda:
# https://anaconda.org/anaconda/git-lfs

# Create directory for storing the trained models
mkdir -p models
cd models

# Download the trained models
# First, clone the small files
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/HQZhou/bart-large-gec
# Then use git-lfs to download the large files
cd bart-large-gec
git lfs pull

# Return to the models directory
cd -

# The download process is the same for the GED model
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/HQZhou/bart-large-ged
cd bart-large-ged
git lfs pull

Run

English experiments:

# Baseline (vanilla decoding)
bash pred.sh  \
    devices=0  \
    gec_path=models/bart-large-gec/model  \
    dataset=bea19.dev

# w/ LM-critic
bash pred.sh  \
    devices=0  \
    gec_path=models/bart-large-gec/model  \
    lm_alpha=0.8 lm_beta=10  \
    dataset=bea19.dev

# w/ GED-critic
bash pred.sh  \
    devices=0  \
    gec_path=models/bart-large-gec/model  \
    ged_path=models/bart-large-ged/model  \
    ged_alpha=0.8 ged_beta=1  \
    batch=500  \
    dataset=bea19.dev

# w/ both LM-critic and GED-critic
bash pred.sh  \
    devices=0  \
    gec_path=models/bart-large-gec/model  \
    ged_path=models/bart-large-ged/model  \
    lm_alpha=0.8 lm_beta=10  \
    ged_alpha=0.8 ged_beta=1  \
    batch=250  \
    dataset=bea19.dev

Chinese experiments:

# Baseline (vanilla decoding)
bash pred.sh  \
    devices=0  \
    dataset=mucgec.dev

# w/ LM-critic
bash pred.sh  \
    devices=0  \
    lm_alpha=0.3  \
    lm_beta=0.1  \
    dataset=mucgec.dev

# w/ GED-critic
bash pred.sh  \
    devices=0  \
    ged_alpha=0.6 ged_beta=10  \
    dataset=mucgec.dev

# w/ both LM-critic and GED-critic
bash pred.sh  \
    devices=0  \
    lm_alpha=0.3 lm_beta=0.1  \
    ged_alpha=0.6 ged_beta=10  \
    dataset=mucgec.dev

Recommended Hyperparameters

We search the coefficient $\alpha$ and $\beta$ on the development set.

The optimal coefficients are varied across different datasets.

Hyperparameters for LM-critic:

Dataset $\alpha$ $\beta$
CoNLL-14 0.8 10.0
BEA-19 0.8 10.0
GMEG-Wiki 1.0 10.0
MuCGEC 0.3 0.1

Hyperparameters for GED-critic:

Dataset $\alpha$ $\beta$
CoNLL-14 0.8 1.0
BEA-19 0.8 1.0
GMEG-Wiki 0.9 1.0
MuCGEC 0.6 10.0

About

The repo of "Improving Seq2Seq Grammatical Error Correction via Decoding Interventions"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 92.3%
  • Shell 7.7%