Skip to content

Latest commit





Swin Transformer for Object Detection


This directory contains the configs and results of Swin Transformer. Most configs and results are based on the official repository.

Please consider using the mmdet's configs when you train new models.

Results and Models


Backbone Pretrain Lr schd box AP config model
Swin-T ImageNet-1K 1x 43.7 config github

Mask R-CNN

Backbone Pretrain Lr schd box AP mask AP #params FLOPs config log model
Swin-T ImageNet-1K 1x 43.7 39.8 48M 267G config github/baidu github/baidu
Swin-T ImageNet-1K 3x 46.0 41.6 48M 267G config github/baidu github/baidu
Swin-S ImageNet-1K 3x 48.5 43.3 69M 359G config github/baidu github/baidu

Cascade Mask R-CNN

Backbone Pretrain Lr schd box AP mask AP #params FLOPs config log model
Swin-T ImageNet-1K 1x 48.1 41.7 86M 745G config github/baidu github/baidu
Swin-T ImageNet-1K 3x 50.4 43.7 86M 745G config github/baidu github/baidu
Swin-S ImageNet-1K 3x 51.9 45.0 107M 838G config github/baidu github/baidu
Swin-B ImageNet-1K 3x 51.9 45.0 145M 982G config github/baidu github/baidu




# single-gpu testing
python tools/ <CONFIG_FILE> <DET_CHECKPOINT_FILE> --eval bbox segm

# multi-gpu testing
tools/ <CONFIG_FILE> <DET_CHECKPOINT_FILE> <GPU_NUM> --eval bbox segm


To train a detector with pre-trained models, run:

# single-gpu training
python tools/ <CONFIG_FILE> --cfg-options model.pretrained=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

# multi-gpu training
tools/ <CONFIG_FILE> <GPU_NUM> --cfg-options model.pretrained=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

For example, to train a Cascade Mask R-CNN model with a Swin-T backbone and 8 gpus, run:

tools/ configs/swin_original/ 8 --cfg-options model.pretrained=<PRETRAIN_MODEL>

Note: use_checkpoint is used to save GPU memory. Please refer to this page for more details.

Mixed Precision Training

The current configs use mixed precision training via MMCV by default. Please install PyTorch >= 1.6.0 to use torch.cuda.amp.

If you find performance difference from apex (used by the original authors), please raise an issue. Otherwise, we will clean code for apex.

Click me to use apex

To install apex, run:

git clone
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Modify configs with the following code:

runner = dict(type='EpochBasedRunnerAmp', max_epochs=36)
fp16 = None
optimizer_config = dict(


  title={Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
  author={Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining},
  journal={arXiv preprint arXiv:2103.14030},

Other Links

Image Classification: See Swin Transformer for Image Classification.

Semantic Segmentation: See Swin Transformer for Semantic Segmentation.