Skip to content

Latest commit

 

History

History
73 lines (60 loc) · 3.46 KB

flash_attention_En.md

File metadata and controls

73 lines (60 loc) · 3.46 KB

中文说明 | English

Accelerate Chinese-CLIP with FlashAttention

Chinese-CLIP now supports the acceleration of training process through FlashAttention.

Environmental Preparation

  • Nvidia GPUs with Turning, Ampere, Ada or Hopper architecture (such as H100, A100, RTX 3090, T4, and RTX 2080). Please refer to this document for the corresponding GPUs of each Nvidia architecture.
  • CUDA 11.4 and above.
  • PyTorch 1.12 and above.
  • FlashAttention:Install FlashAttention by executing pip install flash-attn.

Please refer to the FlashAttention project repository for more information.

Use it in Chinese-CLIP!

Applying FlashAttention to the finetune process of Chinese-CLIP is very simple, just add --use-flash-attention to the sh script of finetune. We provide the sample script run_scripts/muge_finetune_vit-b-16_rbt-base_flashattn.sh.

Training Speed and Memory Usage Comparison

Enabling FlashAttention can significantly speed up the finetune process and reduce the memory usage of Chinese-CLIP without affecting the precision. Our experiments are conducted on an 8-card A100 GPU (80GB memory) machine,FlashAttention 0.2.8,Pytorch 1.10.1.

We present the comparison of the batch time and memory usage of FP16 precision finetune for each scale model. The improvement in training speed and reduction in memory usage are more significant for larger models.

Batch Time
Unit: s/itBatch sizew/o FlashAttentionw/ FlashAttentionSpeedup
CN-CLIPRN501200*81.7101.6801.02×
CN-CLIPViT-B/16450*81.4770.9601.54×
CN-CLIPViT-L/14128*81.2930.7851.65×
CN-CLIPViT-L/14@336px40*81.3970.5872.38×
CN-CLIPViT-H/1464*81.2650.8451.50×

Memory
Unit: GBBatch sizew/o FlashAttentionw/ FlashAttention
CN-CLIPRN501200*87975
CN-CLIPViT-B/16450*88056
CN-CLIPViT-L/14128*87750
CN-CLIPViT-L/14@336px40*87837
CN-CLIPViT-H/1464*87657