Skip to content

The official repo for [CVPR'23] "DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting" & [ArXiv'23] "DeepSolo++: Let Transformer Decoder with Explicit Points Solo for Multilingual Text Spotting"

License

ViTAE-Transformer/DeepSolo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

23 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting

News | Main Results | Usage | Citation | Acknowledgement

This is the official repo for the papers:

DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting

DeepSolo++: Let Transformer Decoder with Explicit Points Solo for Multilingual Text Spotting

image

News

2024.04.25 Update models finetuned on BOVText and DSText video datasets.

2023.06.2 Update the pre-trained and fine-tuned Chinese scene text spotting model (78.3% 1-NED on ICDAR 2019 ReCTS).

2023.05.31 The extension paper (DeepSolo++) is submitted to ArXiv. The code and models will be released soon.

2023.02.28 DeepSolo is accepted by CVPR 2023. πŸŽ‰πŸŽ‰


Relevant Project:

✨ Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation | Code

✨ GoMatching: A Simple Baseline for Video Text Spotting via Long and Short Term Matching | Code

DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in Transformer | Code

Other applications of ViTAE inlcude: ViTPose | Remote Sensing | Matting | VSA | Video Object Segmentation

Main Results

Total-Text

Backbone External Data Det-P Det-R Det-F1 E2E-None E2E-Full Weights
Res-50 Synth150K 93.9 82.1 87.6 78.8 86.2 OneDrive
Res-50 Synth150K+MLT17+IC13+IC15 93.1 82.1 87.3 79.7 87.0 OneDrive
Res-50 Synth150K+MLT17+IC13+IC15+TextOCR 93.2 84.6 88.7 $\underline{\text{82.5}}$ $\underline{\text{88.7}}$ OneDrive
Res-101 Synth150K+MLT17+IC13+IC15 93.2 83.5 88.1 80.1 87.1 OneDrive
Swin-T Synth150K+MLT17+IC13+IC15 92.8 83.5 87.9 79.7 87.1 OneDrive
Swin-S Synth150K+MLT17+IC13 +C15 93.7 84.2 88.7 81.3 87.8 OneDrive
ViTAEv2-S Synth150K+MLT17+IC13+IC15 92.6 85.5 $\underline{\text{88.9}}$ 81.8 88.4 OneDrive
ViTAEv2-S Synth150K+MLT17+IC13+IC15+TextOCR 92.9 87.4 90.0 83.6 89.6 OneDrive

ICDAR 2015 (IC15)

Backbone External Data Det-P Det-R Det-F1 E2E-S E2E-W E2E-G Weights
Res-50 Synth150K+Total-Text+MLT17+IC13 92.8 87.4 90.0 86.8 81.9 76.9 OneDrive
Res-50 Synth150K+Total-Text+MLT17+IC13+TextOCR 92.5 87.2 89.8 $\underline{\text{88.0}}$ $\underline{\text{83.5}}$ $\underline{\text{79.1}}$ OneDrive
ViTAEv2-S Synth150K+Total-Text+MLT17+IC13 93.7 87.3 90.4 87.5 82.8 77.7 OneDrive
ViTAEv2-S Synth150K+Total-Text+MLT17+IC13+TextOCR 92.4 87.9 $\underline{\text{90.1}}$ 88.1 83.9 79.5 OneDrive

CTW1500

Backbone External Data Det-P Det-R Det-F1 E2E-None E2E-Full Weights
Res-50 Synth150K+Total-Text+MLT17+IC13+IC15 93.2 85.0 88.9 64.2 81.4 OneDrive

ICDAR 2019 ReCTS

Backbone External Data Det-P Det-R Det-H 1-NED Weights
Res-50 SynChinese130K+ArT+LSVT 92.6 89.0 90.7 78.3 OneDrive
ViTAEv2-S SynChinese130K+ArT+LSVT 92.6 89.9 91.2 79.6 OneDrive

Pre-trained Models for Total-Text & ICDAR 2015

Backbone Training Data Weights
Res-50 Synth150K+Total-Text OneDrive
Res-50 Synth150K+Total-Text+MLT17+IC13+IC15 OneDrive
Res-50 Synth150K+Total-Text+MLT17+IC13+IC15+TextOCR OneDrive
Res-101 Synth150K+Total-Text+MLT17+IC13+IC15 OneDrive
Swin-T Synth150K+Total-Text+MLT17+IC13+IC15 OneDrive
Swin-S Synth150K+Total-Text+MLT17+IC13+IC15 OneDrive
ViTAEv2-S Synth150K+Total-Text+MLT17+IC13+IC15 OneDrive
ViTAEv2-S Synth150K+Total-Text+MLT17+IC13+IC15+TextOCR OneDrive

Pre-trained Model for CTW1500

Backbone Training Data Weights
Res-50 Synth150K+Total-Text+MLT17+IC13+IC15 OneDrive

Pre-trained Model for ReCTS

Backbone Training Data Weights
Res-50 SynChinese130K+ArT+LSVT+ReCTS OneDrive
ViTAEv2-S SynChinese130K+ArT+LSVT+ReCTS OneDrive

for video datasets

Model finetuned on BOVText:

Backbone Config External Data Weights
Res-50 NUM_QUERIES: 100, NUM_POINTS: 25, VOC_SIZE: 5462 SynChinese130K+ArT+LSVT+ReCTS OneDrive

Model finetuned on DSText :

Backbone Config External Data Weights
Res-50 NUM_QUERIES: 300, NUM_POINTS: 25, VOC_SIZE: 37 Synth150K+Total-Text+MLT17+IC13+IC15+TextOCR OneDrive

Pre-trained Model for DSText

Backbone Config Training Data Weights
Res-50 NUM_QUERIES: 300, NUM_POINTS: 25, VOC_SIZE: 37 Synth150K+Total-Text+MLT17+IC13+IC15+TextOCR OneDrive

Usage

  • Installation

Python 3.8 + PyTorch 1.9.0 + CUDA 11.1 + Detectron2 (v0.6)

git clone https://github.com/ViTAE-Transformer/DeepSolo.git
cd DeepSolo
conda create -n deepsolo python=3.8 -y
conda activate deepsolo
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt
python -m pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu111/torch1.9/index.html
python setup.py build develop
  • Preparation

Datasets

[SynthText150K (CurvedSynText150K)] images | annotations(Part1) | annotations(Part2)

[MLT] images | annotations

[ICDAR2013] images | annotations

[ICDAR2015] images | annotations

[Total-Text] images | annotations

[CTW1500] images | annotations

[TextOCR] images | annotations

[Inverse-Text] images | annotations

[SynChinese130K] images | annotations

[ArT] images | annotations

[LSVT] images | annotations

[ReCTS] images | annotations

[Evaluation ground-truth] Link

Some image files need to be renamed. Organize them as follows (lexicon files are not listed here):

|- ./datasets
   |- syntext1
   |  |- train_images
   |  β””  annotations
   |       |- train_37voc.json
   |       β””  train_96voc.json
   |- syntext2
   |  |- train_images
   |  β””  annotations
   |       |- train_37voc.json
   |       β””  train_96voc.json
   |- mlt2017
   |  |- train_images
   |  |- train_37voc.json
   |  β””  train_96voc.json
   |- totaltext
   |  |- train_images
   |  |- test_images
   |  |- train_37voc.json
   |  |- train_96voc.json
   |  β””  test.json
   |- ic13
   |  |- train_images
   |  |- train_37voc.json
   |  β””  train_96voc.json
   |- ic15
   |  |- train_images
   |  |- test_images
   |  |- train_37voc.json
   |  |- train_96voc.json
   |  β””  test.json
   |- ctw1500
   |  |- train_images
   |  |- test_images
   |  |- train_96voc.json
   |  β””  test.json
   |- textocr
   |  |- train_images
   |  |- train_37voc_1.json
   |  β””  train_37voc_2.json
   |- inversetext
   |  |- test_images
   |  β””  test.json
   |- chnsyntext
   |  |- syn_130k_images
   |  β””  chn_syntext.json
   |- ArT
   |  |- rename_artimg_train
   |  β””  art_train.json
   |- LSVT
   |  |- rename_lsvtimg_train
   |  β””  lsvt_train.json
   |- ReCTS
   |  |- ReCTS_train_images  # 18,000 images
   |  |- ReCTS_val_images  # 2,000 images
   |  |- ReCTS_test_images  # 5,000 images
   |  |- rects_train.json
   |  |- rects_val.json
   |  β””  rects_test.json
   |- evaluation
   |  |- gt_*.zip
ImageNet Pre-trained Backbone

If you want to pre-train DeepSolo with ResNet-101, ViTAEv2-S or SwinTransformer , you can download the converted backbone weights and put them under pretrained_backbone for initialization: Swin-T | ViTAEv2-S | Res101 | Swin-S. You can also refer to the python files in pretrained_backbone and convert the backbones by yourself.

If you want to use the model trained on Chinese data, please download the font (simsun.ttc) and Chinese character list (chn_cls_list, a binary file) first.

wget https://drive.google.com/file/d/1dcR__ZgV_JOfpp8Vde4FR3bSR-QnrHVo/view?usp=sharing -O simsun.ttc
wget https://drive.google.com/file/d/1wqkX2VAy48yte19q1Yn5IVjdMVpLzYVo/view?usp=sharing -O chn_cls_list
  • Training

Total-Text & ICDAR2015

1. Pre-train

For example, pre-train DeepSolo with Synth150K+Total-Text+MLT17+IC13+IC15:

python tools/train_net.py --config-file configs/R_50/pretrain/150k_tt_mlt_13_15.yaml --num-gpus 4

2. Fine-tune

Fine-tune on Total-Text or ICDAR2015:

python tools/train_net.py --config-file configs/R_50/TotalText/finetune_150k_tt_mlt_13_15.yaml --num-gpus 4
python tools/train_net.py --config-file configs/R_50/IC15/finetune_150k_tt_mlt_13_15.yaml --num-gpus 4
CTW1500

1. Pre-train

python tools/train_net.py --config-file configs/R_50/CTW1500/pretrain_96voc_50maxlen.yaml --num-gpus 4

2. Fine-tune

python tools/train_net.py --config-file configs/R_50/CTW1500/finetune_96voc_50maxlen.yaml --num-gpus 4
ReCTS

1. Pre-train

python tools/train_net.py --config-file configs/R_50/ReCTS/pretrain.yaml --num-gpus 8

2. Fine-tune

python tools/train_net.py --config-file configs/R_50/ReCTS/finetune.yaml --num-gpus 8
  • Evaluation

python tools/train_net.py --config-file ${CONFIG_FILE} --eval-only MODEL.WEIGHTS ${MODEL_PATH}

Note: To conduct evaluation on ICDAR 2019 ReCTS, you can directly submit the saved file output/R50/rects/finetune/inference/rects_submit.txt to the official website for evaluation.

  • Visualization Demo

python demo/demo.py --config-file ${CONFIG_FILE} --input ${IMAGES_FOLDER_OR_ONE_IMAGE_PATH} --output ${OUTPUT_PATH} --opts MODEL.WEIGHTS <MODEL_PATH>

Citation

If you find DeepSolo helpful, please consider giving this repo a star:star: and citing:

@inproceedings{ye2023deepsolo,
  title={DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting},
  author={Ye, Maoyuan and Zhang, Jing and Zhao, Shanshan and Liu, Juhua and Liu, Tongliang and Du, Bo and Tao, Dacheng},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={19348--19357},
  year={2023}
}

@article{ye2023deepsolo++,
  title={DeepSolo++: Let Transformer Decoder with Explicit Points Solo for Multilingual Text Spotting},
  author={Ye, Maoyuan and Zhang, Jing and Zhao, Shanshan and Liu, Juhua and Liu, Tongliang and Du, Bo and Tao, Dacheng},
  booktitle={arxiv preprint arXiv:2305.19957},
  year={2023}
}

Acknowledgement

This project is based on Adelaidet. For academic use, this project is licensed under the 2-clause BSD License.

About

The official repo for [CVPR'23] "DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting" & [ArXiv'23] "DeepSolo++: Let Transformer Decoder with Explicit Points Solo for Multilingual Text Spotting"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published