Skip to content

CCF BDCI2021 Corrupted_Short_Message_Reconstruction

zhezhaoa edited this page Aug 15, 2023 · 4 revisions

Here is a short summary of our solution on CCF-BDCI2021-Corrupted_Short_Message_Reconstruction. Seq2seq model is used to generate clean text from corrupted text. One can obtain the pre-trained models used below from Modelzoo section:

BART-base

The example of fine-tuning and doing inference with Chinese BART-base:

CUDA_VISIBLE_DEVICES=0,1 python3 finetune/run_text2text.py --pretrained_model_path models/cluecorpussmall_bart_base_seq512_model.bin-1000000 \
                                                           --vocab_path models/google_zh_vocab.txt \
                                                           --config_path models/bart/base_config.json \
                                                           --train_path datasets/corrupted_short_message_reconstruction/train.tsv \
                                                           --dev_path datasets/corrupted_short_message_reconstruction/dev.tsv \
                                                           --seq_length 192 --tgt_seq_length 192 --learning_rate 5e-5 --epochs_num 3 --batch_size 16

CUDA_VISIBLE_DEVICES=0,1 python3 inference/run_text2text_infer.py --load_model_path models/finetuned_model.bin \
                                                                  --vocab_path models/google_zh_vocab.txt \
                                                                  --config_path models/bart/base_config.json \
                                                                  --test_path datasets/corrupted_short_message_reconstruction/test.tsv \
                                                                  --prediction_path datasets/corrupted_short_message_reconstruction/prediction.tsv \
                                                                  --seq_length 192 --tgt_seq_length 192 --batch_size 256

BART-large

The example of fine-tuning and doing inference with Chinese BART-large:

CUDA_VISIBLE_DEVICES=0,1 python3 finetune/run_text2text.py --pretrained_model_path models/cluecorpussmall_bart_large_seq512_model.bin-1000000 \
                                                           --vocab_path models/google_zh_vocab.txt \
                                                           --config_path models/bart/large_config.json \
                                                           --train_path datasets/corrupted_short_message_reconstruction/train.tsv \
                                                           --dev_path datasets/corrupted_short_message_reconstruction/dev.tsv \
                                                           --seq_length 192 --tgt_seq_length 192 --learning_rate 5e-5 --epochs_num 3 --batch_size 16

CUDA_VISIBLE_DEVICES=0,1 python3 inference/run_text2text_infer.py --load_model_path models/finetuned_model.bin \
                                                                  --vocab_path models/google_zh_vocab.txt \
                                                                  --config_path models/bart/large_config.json \
                                                                  --test_path datasets/corrupted_short_message_reconstruction/test.tsv \
                                                                  --prediction_path datasets/corrupted_short_message_reconstruction/prediction.tsv \
                                                                  --seq_length 192 --tgt_seq_length 192 --batch_size 256

PEGASUS-base

The example of fine-tuning and doing inference with Chinese PEGASUS-base:

CUDA_VISIBLE_DEVICES=0,1 python3 finetune/run_text2text.py --pretrained_model_path models/cluecorpussmall_pegasus_base_seq512_model.bin-1000000 \
                                                           --vocab_path models/google_zh_vocab.txt \
                                                           --config_path models/pegasus/base_config.json \
                                                           --train_path datasets/corrupted_short_message_reconstruction/train.tsv \
                                                           --dev_path datasets/corrupted_short_message_reconstruction/dev.tsv \
                                                           --seq_length 192 --tgt_seq_length 192 --learning_rate 5e-5 --epochs_num 3 --batch_size 16

CUDA_VISIBLE_DEVICES=0,1 python3 inference/run_text2text_infer.py --load_model_path models/finetuned_model.bin \
                                                                  --vocab_path models/google_zh_vocab.txt \
                                                                  --config_path models/pegasus/base_config.json \
                                                                  --test_path datasets/corrupted_short_message_reconstruction/test.tsv \
                                                                  --prediction_path datasets/corrupted_short_message_reconstruction/prediction.tsv \
                                                                  --seq_length 192 --tgt_seq_length 192 --batch_size 256

PEGASUS-large

The example of fine-tuning and doing inference with Chinese PEGASUS-large:

CUDA_VISIBLE_DEVICES=0,1 python3 finetune/run_text2text.py --pretrained_model_path models/cluecorpussmall_pegasus_large_seq512_model.bin-1000000 \
                                                           --vocab_path models/google_zh_vocab.txt \
                                                           --config_path models/pegasus/large_config.json \
                                                           --train_path datasets/corrupted_short_message_reconstruction/train.tsv \
                                                           --dev_path datasets/corrupted_short_message_reconstruction/dev.tsv \
                                                           --seq_length 192 --tgt_seq_length 192 --learning_rate 5e-5 --epochs_num 3 --batch_size 16

CUDA_VISIBLE_DEVICES=0,1 python3 inference/run_text2text_infer.py --load_model_path models/finetuned_model.bin \
                                                                  --vocab_path models/google_zh_vocab.txt \
                                                                  --config_path models/pegasus/large_config.json \
                                                                  --test_path datasets/corrupted_short_message_reconstruction/test.tsv \
                                                                  --prediction_path datasets/corrupted_short_message_reconstruction/prediction.tsv \
                                                                  --seq_length 192 --tgt_seq_length 192 --batch_size 256
Clone this wiki locally