[Post 1.0] Multimodal distributed training support #3687

tonyhoo · 2023-11-10T09:25:39Z

Issue #, if available:

Description of changes:

Add optional sync_path argument to fit, which is required for synchronization
Implemented synchronization logic (Can be improved in the future to only upload checkpoint files)
Added deepzero and cpu offloading on multi-gpu support
Updated saving path logic when distributed training

Sample training script

import os
import warnings

import numpy as np
import time

warnings.filterwarnings("ignore")
np.random.seed(123)

from autogluon.core.utils.loaders import load_pd

train_data = load_pd.load("https://autogluon-text.s3-accelerate.amazonaws.com/glue/sst/train.parquet")
test_data = load_pd.load("https://autogluon-text.s3-accelerate.amazonaws.com/glue/sst/dev.parquet")
train_data = train_data.sample(1000)

print("train data loaded!")


from autogluon.multimodal import MultiModalPredictor

if __name__ == "__main__":
    model_path = f"Multimodal_distributed-{time.time()}"
    predictor = MultiModalPredictor(
        label="label",
        eval_metric="acc",
        path=model_path,
        hyperparameters={
             "model.hf_text.checkpoint_name": "google/flan-t5-xl",
            "optimization.top_k_average_method": "best",
            "env.num_nodes": 1,
            "env.strategy": "deepspeed_stage_3_offload",
        },
    )
    print("predictor created")
    predictor.fit(train_data, time_limit=180, sync_path="s3://tonyhu-autogluon/multimodal_distributed")
    new_predictor = MultiModalPredictor. load(path=model_path)
    print(new_predictor.predict(test_data[0:2]))

Log output:


8 GPUs are detected, and 8 GPUs will be used.
   - GPU 0 name: Tesla V100-SXM2-32GB
   - GPU 0 memory: 0.26GB/32.0GB (Used/Total)
   - GPU 1 name: Tesla V100-SXM2-32GB
   - GPU 1 memory: 0.26GB/32.0GB (Used/Total)
   - GPU 2 name: Tesla V100-SXM2-32GB
   - GPU 2 memory: 0.26GB/32.0GB (Used/Total)
   - GPU 3 name: Tesla V100-SXM2-32GB
   - GPU 3 memory: 0.26GB/32.0GB (Used/Total)
   - GPU 4 name: Tesla V100-SXM2-32GB
   - GPU 4 memory: 0.26GB/32.0GB (Used/Total)
   - GPU 5 name: Tesla V100-SXM2-32GB
   - GPU 5 memory: 0.26GB/32.0GB (Used/Total)
   - GPU 6 name: Tesla V100-SXM2-32GB
   - GPU 6 memory: 0.26GB/32.0GB (Used/Total)
   - GPU 7 name: Tesla V100-SXM2-32GB
   - GPU 7 memory: 0.26GB/32.0GB (Used/Total)
CUDA version is 11.7.

Enabling DeepSpeed FP16.
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
Parameter Offload: Total persistent parameters: 105474 in 52 params

  | Name              | Type                         | Params | Params per Device
---------------------------------------------------------------------------------------
0 | model             | HFAutoModelForTextPrediction | 1.2 B  | 152 M
1 | validation_metric | MulticlassAccuracy           | 0      | 0
2 | loss_func         | CrossEntropyLoss             | 0      | 0
---------------------------------------------------------------------------------------
1.2 B     Trainable params
0         Non-trainable params
1.2 B     Total params
4,894.126 Total estimated model params size (MB)

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

hohoCode · 2023-11-11T05:54:58Z

Could you please also test try bigger models like Flan-t5-XXL or XL? Currently it seems weird with deepspeed on, looks like a perfect candidate for the deepspeed trainer. Thanks.

github-actions · 2023-11-13T19:32:51Z

Job PR-3687-f97a876 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3687/f97a876/index.html

tonyhoo · 2023-11-13T23:33:51Z

Could you please also test try bigger models like Flan-t5-XXL or XL? Currently it seems weird with deepspeed on, looks like a perfect candidate for the deepspeed trainer. Thanks.

Good idea. Updated the description

hohoCode · 2023-11-15T16:58:11Z

Could you please also test try bigger models like Flan-t5-XXL or XL? Currently it seems weird with deepspeed on, looks like a perfect candidate for the deepspeed trainer. Thanks.

Good idea. Updated the description

Thanks a lot!

Last week, I tried a AG's version 1102 (a version at the beginning of this month without your commit). I used (env.strategy: "deepspeed"). Since I have 4 GPUs on my end, I found AG will create 4 separate folders with 'deepspeed' during training, because there is no doc on the 'deepspeed' setting so not sure it is the expected behavior. Hopefully your 'deepspeed_stage_3_offload' will address so.
Hopefully your 'deepspeed_stage_3_offload' can support LORA/IA3 etc.
Also wondering about the strict requirement on the sync path with 's3'. Can we also use shared folders such as '/nas' instead of 's3'? Maybe a relaxation will be better since many users use other cloud providers, or we just do 'nas' for distributed data sharing.
Any possibilities to support 8bit with 'deepspeed_stage_3_offload'? If so, that would be awesome. This will enable bigger LLM training (30B+) then. A whole new front for AutoGluon.

Thanks!

hohoCode · 2024-01-03T19:28:03Z

BTW, I tested run the codes with 'deepspeed_stage_3_offload' given Flan-t5-xl and "bf16" as the data type. The deepspeed shows it has the "must have the same dtype" error:

File deepspeed/runtime/zero/linear.py, line 111, in zero3_linear_wrap
return LinearFunctionForZeroStage3.apply(input, weight, bias)
File "torch/cuda/amp/autocast_mode.py", line 98, in decorate_fwd
return fwd(*args, **kwargs)
File "deepspeed/runtime/zero/linear.py", line 55, in forward
ret = torch.addmm(bias, input, weight.t())
RuntimeError: mat1 and mat2 must have the same dtype

So it seems deepspeed cannot handle mixed dtypes, probably need to enable mixed-precision training for the pulled request.

Also, all of the settings on "bf16"/"16-mixed"/"bf16-true" are having the same dtype mismatch error.

tonyhoo requested a review from zhiqiangdon November 10, 2023 09:26

tonyhoo added 9 commits November 13, 2023 18:13

changed the impl post AutoMM refactoring

1b72dfc

fix conflicts

3fc7847

fix inference

a737930

fix APIs

de16ecc

fix import lint check

9706090

fix import using isort

c723a9f

fix black check

c4494b0

make sync_path optional in clean_trainer_process

325c340

fix inference failure

1f90993

tonyhoo closed this Nov 13, 2023

tonyhoo force-pushed the deepspeed branch from e25fd59 to f97a876 Compare November 13, 2023 18:16

tonyhoo reopened this Nov 13, 2023

Delete multimodal/test.py

640d424

fix test failure for NER, few shot and OO

18efd53

tonyhoo added 2 commits November 14, 2023 04:06

fix object detection issue

d313d64

fix typo in base.py

9b1b74b

tonyhoo changed the title ~~Multimodal distributed training support~~ [Post 1.0] Multimodal distributed training support Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Post 1.0] Multimodal distributed training support #3687

[Post 1.0] Multimodal distributed training support #3687

tonyhoo commented Nov 10, 2023 •

edited

hohoCode commented Nov 11, 2023 •

edited

github-actions bot commented Nov 13, 2023

tonyhoo commented Nov 13, 2023

hohoCode commented Nov 15, 2023 •

edited

hohoCode commented Jan 3, 2024 •

edited

[Post 1.0] Multimodal distributed training support #3687

Are you sure you want to change the base?

[Post 1.0] Multimodal distributed training support #3687

Conversation

tonyhoo commented Nov 10, 2023 • edited

hohoCode commented Nov 11, 2023 • edited

github-actions bot commented Nov 13, 2023

tonyhoo commented Nov 13, 2023

hohoCode commented Nov 15, 2023 • edited

hohoCode commented Jan 3, 2024 • edited

tonyhoo commented Nov 10, 2023 •

edited

hohoCode commented Nov 11, 2023 •

edited

hohoCode commented Nov 15, 2023 •

edited

hohoCode commented Jan 3, 2024 •

edited