Skip to content

Latest commit

 

History

History
71 lines (62 loc) · 6.72 KB

v1.0_README.md

File metadata and controls

71 lines (62 loc) · 6.72 KB

Seq2seqChatbots

This repository contains the code that I have written for experiments described in this paper. I made my own problem, hparams and model registrations to the tensor2tensor library in order to try out different datasets with the Transformer modell for training dialog agents. The folders in the repository contain the following content:

  • docs: Latex files and pictures required to generate my paper. Also check my research proposal for a detailed description of my current research interests.
  • t2t_csaky: This folder contains all the code that I have written, more detailed description can be found lower.
  • train_dir: Here you can find inference outputs from the various trainings that I have run.
  • wiki_images: Contains images used for the wiki, where I write about more than 100 publications related to chatbots.

Quick Guide

First, install all the required packages in your python environment:

pip install -r requirements.txt

Data

In order to download and preprocess the data and generate source and target pairs, the following command can be used from this directory:

t2t-datagen --t2t_usr_dir=t2t_csaky --data_dir=$Path-to-data-dir --problem=$Name-of-problem

Where:

  • $Path-to-data-dir: The path to the directory where you want to generate the source and target pairs, and other data. The dataset will be downloaded one level higher from this directory into a raw_data folder.
  • $Name-of-problem: This is the name of a registered problem that tensor2tensor needs. Currently I have 3 registered problems:
    • opensubtitles_chatbot: This problem can be used to work with the OpenSubtitles dataset. Since there are several versions of this dataset, you can specify the year of the dataset that you want to download with the dataset_version property inside the class.
    • cornell_chatbot_basic: This problem implements the Cornell Movie-Dialog Corpus.
    • cornell_chatbot_separate_names: This problem uses the same Cornell corpus, however the names of the speakers and addressees of each utterance are appended, resulting in source utterances like below. Thus the size of the vocabulary containing these names can be specified through the targeted_name_vocab_size property inside the CornellChatbotSeparateNames class.

      BIANCA_m0 what good stuff ? CAMERON_m0

    • character_chatbot: This is a general character-based problem that works with any dataset. Before using this, the .txt files generated by any of the problems above have to be placed inside the data directory, and after that this problem can be used to generate tensor2tensor character-based data files.
  • Further properties that affect the data generation and can be set in each of the classes:
    • targeted_vocab_size: Size of the vocabulary that we want to use for the problem. Words outside this vocabulary will be replaced with the token.
    • targeted_dataset_size: Number of utterance pairs, if we don't want to use the full dataset.
    • dataset_split: Specify a train-val-test split for the problem.

Training

If you have generated the data files, you can train any model offered by tensor2tensor, using this command:

t2t-trainer \
  --t2t_usr_dir=t2t_csaky \
  --generate_data=False \
  --data_dir=$Path-to-data-dir \
  --problems=$Name-of-problem \
  --model=$Name-of-model \
  --hparams_set=$Name-of-hparams \
  --output_dir=$Path-to-train-dir \
  --train_steps=$Number-of-training-steps

Where:

  • $Name-of-model: The name of a registered modell, eg. transformer, which I have used for most of my trainings. I also subclassed some models and made my own registriations with small modifications:
    • roulette_transformer: Original transformer modell, now with modified beam search, where roulette-wheel selection can be used to select among the top beams, instead of argmax.
    • own_hparams_seq2seq: Small modification of the lstm based seq2seq model, so that i can user my own hparams entirely. The hparams set named chatbot_lstm_hparams has to be used with this model.
  • $Name-of-hparams: Name of an hparams set. You can use official tensor2tensor hparams, or use my definitions found here. You can find different batch size and dropout variations.

Decoding

You can decode from the trained models interactively, using the command below. Also, for all 4 trainings that I ran, I uploaded the checkpoint files here so you can try them out without needing to train. Just copy the checkpoint files to your train_dir folder, and provide the folder which you want for the output_dir flag.

t2t-decoder \
  --t2t_usr_dir=t2t_csaky \
  --data_dir=$Path-to-data-dir \
  --problems=$Name-of-problem \
  --model=transformer \
  --hparams_set=$Name-of-hparams \
  --output_dir=$Path-to-train-dir \
  --decode_interactive

Sample conversations from the various trainings

S2S is a baseline seq2seq model from this paper, Cornell is the Transformer model trained on Cornell data, Cornell S is similar, but trained with speaker-addressee annotations. OpenSubtitles is the Transformer trained with OpenSubtitles data, and OpenSubtitles F, is the previous training finetuned (further trained) on Cornell speaker annotated data.

If you require any help with running my code or if you want the files of the trained models, just contact me via e-mail and I will make them available. ([email protected])