gutenberg-dialog ·

Code for downloading and building your own version of the Gutenberg Dialog Dataset. Easily extendable with new languages. Try trained chatbots in various languages here: https://ricsinaruto.github.io/chatbot.html.

Download datasets

Download link	Number of utterances	Average utterance length	Number of dialogues	Average dialogue length
English	14 773 741	22.17	2 526 877	5.85
German	226 015	24.44	43 440	5.20
Dutch	129 471	24.26	23 541	5.50
Spanish	58 174	18.62	6 912	8.42
Italian	41 388	19.47	6 664	6.21
Hungarian	18 816	14.68	2 826	6.66
Portuguese	16 228	21.40	2 233	7.27

Download resources for The Gutenberg Dialogue Dataset paper

Download responses from GPT2 trainings here

Download data used in the paper here

Download trained models here

The gpt2_training_scripts folder contains code for running the trainings from the paper. Code adapted from here.

Features

🔀 Generate your own dataset by tuning parameters affecting the size-quality trade-off of the dataset
🚀 The modular interface makes it easy to extend the dataset to other languages
💾 You can easily exclude books manually when building the dataset

Setup

Run setup.py which installs required packages.

python setup.py

Usage

The main file should be called from the root of the repo. The command below runs the dataset building pipeline for the comma-separated languages given as argument. Currently English, German, Dutch, Spanish, Portuguese, Italian, and Hungarian are supported.

python code/main.py -l=en,de,nl,es,pt,it,hu -a

All settable arguments can bee seen below:

Pipeline steps

The -a flag controls whether to run the whole pipeline automatically. If -a is omitted a subset of steps have to be specified using flags (see help above). Once a step is finished its output can be used in subsequent steps and it only has run again if parameters or code related to that step is changed. All steps run separately for each language.

1. Download (-d)

Download books for given languages.

Note: if all books fail to download with the error "Could not download book", a likely cause is that the default mirror used by the gutenberg package has become inaccessible. In the event that this occurs, it is possible to use any of the alternate mirrors listed at https://www.gutenberg.org/MIRRORS.ALL via the GUTENBERG_MIRROR environment variable. For example:

export GUTENBERG_MIRROR="https://gutenberg.pglaf.org"
python code/main.py ...

2. Pre-filter (-f1)

Pre-filtering removes some old books and noise.

3. Extract (-e)

Dialogs are extracted from books. When extending the dataset to new languages (see section below), this is the step that can be modified, thus previous steps can be skipped once finished.

4. Post-filter (-f2)

A second filtering step removing some dialogs based on vocabulary.

5. Create dataset and manual filtering (-c)

Putting together the final dataset and splitting into train/dev/test data. The final step creates the author_and_title.txt file in the output directory containing all books (plus titles and authors) used to extract the final dataset. Users can manually copy lines from this file to banned_books.txt corresponding to books which should not be allowed in the dataset. In subsequent runs of any steps, books in this file will not be taken into account.

Extending to other languages

The code can be easily extended to process other languages. A file named <language code>.py has to be created in the languages folder. Here a class should be defined named the upper-case language code (e.g. En for English), with LANG or any of the other subclasses as parent. With self.cfg config parameters can be accessed. Inside this class the 3 functions below have to be defined. Please see it.py for an example.

Languages statistics

delimiters

This function should return a dictionary where the keys are potential delimiters. For each delimiter a function should be defined (values in dictionary), which takes as input a line and returns a number. This number can be for example the count of delimiters, a flag whether there is a delimiter in the line, etc. Usually a weighted count is advisable, depending on the importance of different delimiters. The values will be used to determine the delimiter that should be used in the respective book (passed to the function below), and for filtering books which contain a low amount of delimiters. en.py contains examples of multiple delimiters.

process_file(paragraph_list, delimiter)

This function should extract the dialogs from a book and append them to self.dialogs, which is a list of dialogs, and each dialog is a list of consecutive utterances. paragraph_list contains the book as a list of consecutive paragraphs. delimiter is the most common delimiter in this file which should be used to extract dialogs.

clean_line(line)

This function is used for post-processing dialogs (e.g. remove certain characters). It takes as input an utterance. Please note that nltk word tokenization is run automatically.

Authors

Richard Csaky (If you need any help with running the code: [email protected])

License

This project is licensed under the MIT License - see the LICENSE file for details.
Please include a link to this repo if you use any of the dataset or code in your work and consider citing the following paper:

@inproceedings{Csaky:2021,
    title = "The Gutenberg Dialogue Dataset",
    author = "Cs{\'a}ky, Rich{\'a}rd and Recski, G{\'a}bor",
    booktitle = "Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics",
    month = apr,
    year = "2021",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/2004.12752",
}

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
code		code
docs		docs
gpt2_trainings_scripts		gpt2_trainings_scripts
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

License

ricsinaruto/gutenberg-dialog

Folders and files

Latest commit

History

Repository files navigation

gutenberg-dialog ·

Download datasets

Download resources for The Gutenberg Dialogue Dataset paper

Download responses from GPT2 trainings here

Download data used in the paper here

Download trained models here

The gpt2_training_scripts folder contains code for running the trainings from the paper. Code adapted from here.

Features

Setup

Usage

Pipeline steps

1. Download (-d)

2. Pre-filter (-f1)

3. Extract (-e)

4. Post-filter (-f2)

5. Create dataset and manual filtering (-c)

Extending to other languages

delimiters

process_file(paragraph_list, delimiter)

clean_line(line)

Authors

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages