Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Usage] Deepspeed Zero Stage 3 not able to shard the model #1481

Open
shubhamagarwal92 opened this issue May 2, 2024 · 0 comments
Open

[Usage] Deepspeed Zero Stage 3 not able to shard the model #1481

shubhamagarwal92 opened this issue May 2, 2024 · 0 comments

Comments

@shubhamagarwal92
Copy link

shubhamagarwal92 commented May 2, 2024

Hi @haotian-liu !

Interesting work around LLaVa!

Issue:

I am trying to finetune LLaVa using 8 X H100.

When I try to use DeepSpeed Zero Stage 3, it seems that the model gets replicated on all the GPUs, instead of being sharded. I get OOM issues when finetuning model. I am trying to use a context length of 2048 and ViT with 336 resolution.

Could you please suggest what I might be doing wrong here?

Command:

deepspeed llava/train/train_mem.py \
    --deepspeed ./scripts/zero3.json \
    --model_name_or_path ../$MODEL_VERSION \
    --version $PROMPT_VERSION \
    --data_path ./finetune_data/cleaned_finetune_data.json \
    --image_folder ./finetune_data/images \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --pretrain_mm_mlp_adapter ./checkpoints/llava-$MODEL_VERSION-pretrain/mm_projector.bin \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 True \
    --output_dir ./checkpoints/llava-$MODEL_VERSION-finetune \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 500 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length  2048\
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \

When I run the model using CUDA_VISIBLE_DEVICES=0 bash ./scripts/sample_stage3.sh, the memory usage before training is:

Screenshot 2024-05-02 at 6 26 08 PM

However, when I am using the stage 3 deepspeed, the GPU usage before training is
Screenshot 2024-05-02 at 5 50 08 PM

And the model gets OOM after this. Could you please suggest what flag we might need to change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant