Skip to content

Latest commit

 

History

History
 
 

maskformer

MaskFormer

Per-Pixel Classification is Not All You Need for Semantic Segmentation

Abstract

Modern approaches typically formulate semantic segmentation as a per-pixel classification task, while instance-level segmentation is handled with an alternative mask classification. Our key insight: mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks in a unified manner using the exact same model, loss, and training procedure. Following this observation, we propose MaskFormer, a simple mask classification model which predicts a set of binary masks, each associated with a single global class label prediction. Overall, the proposed mask classification-based method simplifies the landscape of effective approaches to semantic and panoptic segmentation tasks and shows excellent empirical results. In particular, we observe that MaskFormer outperforms per-pixel classification baselines when the number of classes is large. Our mask classification-based method outperforms both current state-of-the-art semantic (55.6 mIoU on ADE20K) and panoptic segmentation (52.7 PQ on COCO) models.

Introduction

MaskFormer requires COCO and COCO-panoptic dataset for training and evaluation. You need to download and extract it in the COCO dataset path. The directory should be like this.

mmdetection
├── mmdet
├── tools
├── configs
├── data
│   ├── coco
│   │   ├── annotations
│   │   │   ├── panoptic_train2017.json
│   │   │   ├── panoptic_train2017
│   │   │   ├── panoptic_val2017.json
│   │   │   ├── panoptic_val2017
│   │   ├── train2017
│   │   ├── val2017
│   │   ├── test2017

Results and Models

Backbone style Lr schd Mem (GB) Inf time (fps) PQ SQ RQ PQ_th SQ_th RQ_th PQ_st SQ_st RQ_st Config Download
R-50 pytorch 75e 16.2 - 46.757 80.297 57.176 50.829 81.125 61.798 40.610 79.048 50.199 config model | log
Swin-L pytorch 300e 27.2 - 53.249 81.704 64.231 58.798 82.923 70.282 44.874 79.863 55.097 config model | log

Note

  1. The R-50 version was mentioned in Table XI, in paper Masked-attention Mask Transformer for Universal Image Segmentation.
  2. The models were trained with mmdet 2.x and have been converted for mmdet 3.x.

Citation

@inproceedings{cheng2021maskformer,
  title={Per-Pixel Classification is Not All You Need for Semantic Segmentation},
  author={Bowen Cheng and Alexander G. Schwing and Alexander Kirillov},
  journal={NeurIPS},
  year={2021}
}