GitHub - henryhungle/MM_DST: Code for the paper Multimodal Dialogue State Tracking (NAACL22)

Multimodal Dialogue State Tracking

This is the official code for the paper Multimodal Dialogue State Tracking (NAACL 2022, Oral).

Authors: Hung Le, Nancy F. Chen, Steven C.H. Hoi

Overview

Multimodal Dialogue State Tracking (MM-DST): We proposed to extend the traditional DST from unimodality to multimodality. Compared to traditional DST, MM-DST define dialogue states, consisting of slots and slots values for visual objects that are mentioned in dialogues

Video-Dialogue Transformer Network(VDTN): For the MM-DST task, we propose a strong baseline, VDTN. The model has 4 key components: (a) Visual Perception and Encoder (b) Dialogue Encoder (c) Transformer Network (d1) State Decoder and (d2) Visual Decoder.

Installation

The code was created with Pytorch 1.9.0. Please follow this to install the appropriate Pytorch libraries.

Dataset

Note: We will release the URL to download the DVD-DST benchmark, including dataset of dialogues and videos with extracted bounding box and ResNeXt features.

DVD-DST

We propose a new dataset, DVD-DST, which was developed based on DVD, our prior work in response prediction for synthetic video-grounded dialogue benchmark. Compared to DVD, DVD-DST includes annotations for dialogue state tracking and ground-truth bounding box labels from an extended split of the CATER video dataset.

Bounding Box Feature Extraction

To extract bounding of visual objects, we adopt the learned Faster R-CNN model used published here. The model was finetuned to predict object bounding boxes and object classes. The object classes are derived based on object appearance, based on the four attributes of size, color, material, and shape. In total, there are 193 object classes.

ResNeXt Feature Extraction

For segment embeddings, we adopted the ResNeXt-101 model pretrained here which was finetuned on the Kinetics dataset.

Download all files and unzip to the data folder. This should include 2 folders, dvd for dialogue data and cater for the video data.

Model

We will release our pretrained VDTN model checkpoint which was finetuned on the DVD-DST benchmark. When evaluated on the test split of the DVD-DST benchmark, this model should achieve the following results:

Obj Identity F1	Obj Slot F1	Obj State F1	Joint Acc	Acc [email protected]	Acc [email protected]
84.5	72.8	60.4	28.0	15.3	13.1

Processes

Preprocessing Data

We create scripts/run_preprocess.sh to preprocess data on the DVD-DST benchmark. The preprocessing steps include procedures to extract dialogue state annotations from DVD-style annotations. These steps also include connecting the object ID (classes) from bounding box features to create dialogue state labels.

You can directly run this file by configuring the following parameters:

Parameters	Description	Example Values
`results_dir`	Path to save preprocessed data	data/preprocessed
`config`	Path to the preprocessing configuration file	configs/preprocess_dial_config.json

To preprocess different data split, please modify the fields dials_dir and video_dir in the configuration file e.g. configs/preprocess_dial_config.json. For instance, the current configuration file is to preprocess the validation split in the DVD-DST benchmark.

Training Models

We created scripts/run_train.sh to train a VDTN model on the DVD-DST benchmark. You can directly run this file by configuring the following parameters:

Parameters	Description	Example Values
`model_config`	Path to the model configuration file	configs/vdtn_model_config.json
`training_config`	Path to the training configuration file	configs/training_vdtn_config.json

For more fine-grained parameters, please refer to the corresponding configuration files. Some important parameters are defined below:

Parameters	Description	Example Values
`prior_state`	Whether to use the dialogue states of previous turns as parts of input sequence	1: use prior state; 0: do not use
`max_turns`	Maxinum number of past dialogue turns to use; usually set to small values (e.g. 1) if prior_state is set to 1	0 to 10 (0 if do not use dialogue history; max 10 turns in the DVD universe)
`frame_rate`	Sampling rate of video features. One feature vector will be selected per frame_rate frames. Same for segment-based (e.g. ResNeXt features).	1 to 300 (1 to sample all possible frames, max 300 frames in the CATER universe)
`max_objects`	Maximum number of object-based features (bounding boxes) per frame	1, 2, 3, ...
`mask_bb`	Whether to randomly mask bounding box features for self-supervised learning tasks	1: mask; 0: do not mask
`mask_resnext`	Wether to randomly mask resnext features for self-supervised learning tasks	1: mask; 0: do not mask

Running the training script will initialize a VDTN model, load the preprocessed data and pre-extracted features, and start the training process. Model checkpoints are saved whenever the validation loss improves. All checkpoints and loss logs (log.csv) are saved to a folder (e.g. exps/mmdst_dvd_vdtn/) specified by the checkpoints_path parameter of the training configurations.

Generating Dialogue States

We created scripts/run_test.sh to generate dialogue states using a learned VDTN model on the DVD-DST benchmark. You can directly run this file by configuring the following parameters:

Parameters	Description	Example Values
`model_path`	Path to the saved model checkpoint file	exps/mmdst_dvd_vdtn/ model_checkpoint.pth
`inference_config`	Path to the generation/inference configuration file	configs/inference_vdtn_config.json
`inference_style`	Decoding style to generate tokens of dialogue state sequences	greedy: greedy decoding; beam_search: beam search decoding

For more fine-grained parameters e.g. state_maxlen (maximum number of tokens in the output sequences), please refer to the corresponding configuration file.

Evaluating Dialogue States

To evaluate generated dialogue states, we adopt automatic metrics from conventional unimodal DST and include additional metrics such as Object Identity F1, Slot F1, IoU@k (for time-based slots), etc.

We created compute_acc.py to calcuate these metrics. To run this file, specific the following parameters:

Parameters	Description	Example Values
`results`	Path to the generated dialogue states	e.g. exps/mmdst_dvd_vdtn/ all_preds.json
`frame_rate`	The corresponding frame_rate used during training/test time	e.g. 12
`by_turns`	Whether to generate automatic metrics by turn positions	include `--by_turns` in command to output turn-specific results

Citation

If you find the paper or the source code useful to your projects, please cite the following bibtex:

@article{le2022multimodal,
  title={Multimodal Dialogue State Tracking},
  author={Le, Hung and Chen, Nancy F and Hoi, Steven CH},
  journal={arXiv preprint arXiv:2206.07898},
  year={2022}
}

License

The code is released MIT License - see LICENSE.txt for details.

This code is developed from other open source projects: including our prior work DVD, CATER, and related work for object tracking. We thank the original contributors of these works for open-sourcing their valuable source codes.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
baselines		baselines
configs		configs
datasets		datasets
images		images
models		models
preprocess		preprocess
processes		processes
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
compute_acc.py		compute_acc.py
main.py		main.py
object_attributes.py		object_attributes.py
object_indices.py		object_indices.py

License

henryhungle/MM_DST

Folders and files

Latest commit

History

Repository files navigation

Multimodal Dialogue State Tracking

Contents:

Overview

Installation

Dataset

DVD-DST

Bounding Box Feature Extraction

ResNeXt Feature Extraction

Model

Processes

Preprocessing Data

Training Models

Generating Dialogue States

Evaluating Dialogue States

Citation

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages