Data Augmentation using Pre-trained Transformer Models

This code is originally released from amazon-research package (https://github.com/amazon-research/transformers-data-augmentation) In the paper, we mentioned https://github.com/varinf/TransformersDataAugmentation url so we are providing a copy of the same code here.

Code associated with the Data Augmentation using Pre-trained Transformer Models paper

Code contains implementation of the following data augmentation methods

EDA (Baseline)
Backtranslation (Baseline)
CBERT (Baseline)
BERT Prepend (Our paper)
GPT-2 Prepend (Our paper)
BART Prepend (Our paper)

DataSets

In paper, we use three datasets from following resources

Low-data regime experiment setup

Run src/utils/download_and_prepare_datasets.sh file to prepare all datsets.
download_and_prepare_datasets.sh performs following steps

Download data from github
Replace numeric labels with text for STSA-2 and TREC dataset
For a given dataset, creates 15 random splits of train and dev data.

Dependencies

To run this code, you need following dependencies

Pytorch 1.5
fairseq 0.9
transformers 2.9

How to run

To run data augmentation experiment for a given dataset, run bash script in scripts folder. For example, to run data augmentation on snips dataset,

run scripts/bart_snips_lower.sh for BART experiment
run scripts/bert_snips_lower.sh for rest of the data augmentation methods

How to cite

@inproceedings{kumar-etal-2020-data,
    title = "Data Augmentation using Pre-trained Transformer Models",
    author = "Kumar, Varun  and
      Choudhary, Ashutosh  and
      Cho, Eunah",
    booktitle = "Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems",
    month = dec,
    year = "2020",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.lifelongnlp-1.3",
    pages = "18--26",
}

Contact

Please reachout to [email protected] for any questions related to this code.

License

This project is licensed under the Creative Common Attribution Non-Commercial 4.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

CODE_OF_CONDUCT.md

CODE_OF_CONDUCT.md

CONTRIBUTING.md

CONTRIBUTING.md

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Data Augmentation using Pre-trained Transformer Models

DataSets

Low-data regime experiment setup

Dependencies

How to run

How to cite

Contact

License

About

Releases

Packages

Languages

License

varunkumar-dev/TransformersDataAugmentation

Folders and files

Latest commit

History

Repository files navigation

Data Augmentation using Pre-trained Transformer Models

DataSets

Low-data regime experiment setup

Dependencies

How to run

How to cite

Contact

License

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages