Skip to content
/ Prime Public

A simple module consistently outperforms self-attention and Transformer model on main NMT datasets with SoTA performance.

Notifications You must be signed in to change notification settings

lancopku/Prime

Repository files navigation

News

2019/12/10 We have changed the model name from MUSE(parallel MUlti-Scale attEntion) to PRIME(PaRallel Intersected Multi-scale AttEntion)

Introduction

Core Code:

Relevent links:

About the paper:

TL;DR: A simple module consistently outperforms self-attention and Transformer model on main NMT datasets with SoTA performance.

We ask three questions:

  • Is attention alone good enough?
  • Is parallel representation learning applicable to sequence data and tasks?
  • How to design a module that combines both inductive bias of convolution and self-attention?

We find that there are shortcomings in stand-alone self-attention, and present a new module that maps the input to the hidden space and performs the three operations of self-attention, convolution and nonlinearity in parallel, simply stacking this module outperforms all previous models including Transformer (Vasvani et al., 2017) on main NMT tasks under standard setting.

Key features:

  • Design a multi-branch schema evolving self attention and first successfully combine convolution and self-attention in one module for sequence tasks by the proposed shared projection,
  • SOTA on three main translation datasets, including WMT14 En-Fr, WMT14 En-De and IWSLT14 De-En,
  • Parallel learn sequence representations and thus have potential for acceleration.

Results:

  1. Better than previous models on large NMT datasets; can scale to small datasets and base model setting.
  2. The shared projection is key to combine conv and self-attn; generate better long sequences;potential for acceleration. )
Task size test (BLEU)
IWSLT14 De-En Base 36.3
WMT14 En-De Large 29.9
WMT14 En-Fr Large 43.5

Requirements and Installation

  • PyTorch version >= 1.0.0
  • Python version >= 3.6
  • For training new models, you'll also need an NVIDIA GPU and NCCL
  • torch==1.3.1 with cuda==10.0

Installing from source

To install from source and develop locally:

pip install --editable . --user

We provide pre-trained models and detailed example training and evaluation in examples/parallel_intersected_multi-scale_attention(Prime)/README.md.

Citation

Please cite as:

@article{zhao2019muse,
  title={MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning},
  author={Zhao, Guangxiang and Sun, Xu and Xu, Jingjing and Zhang, Zhiyuan and Luo, Liangchen},
  journal={arXiv preprint arXiv:1911.09483},
  year={2019}
}

Notes

The code is based on fairseq-0.6.2

About

A simple module consistently outperforms self-attention and Transformer model on main NMT datasets with SoTA performance.

Topics

Resources

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages