Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Post 1.0] Multimodal distributed training support #3687

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

tonyhoo
Copy link
Collaborator

@tonyhoo tonyhoo commented Nov 10, 2023

Issue #, if available:

Description of changes:

  • Add optional sync_path argument to fit, which is required for synchronization
  • Implemented synchronization logic (Can be improved in the future to only upload checkpoint files)
  • Added deepzero and cpu offloading on multi-gpu support
  • Updated saving path logic when distributed training

Sample training script

import os
import warnings

import numpy as np
import time

warnings.filterwarnings("ignore")
np.random.seed(123)

from autogluon.core.utils.loaders import load_pd

train_data = load_pd.load("https://autogluon-text.s3-accelerate.amazonaws.com/glue/sst/train.parquet")
test_data = load_pd.load("https://autogluon-text.s3-accelerate.amazonaws.com/glue/sst/dev.parquet")
train_data = train_data.sample(1000)

print("train data loaded!")


from autogluon.multimodal import MultiModalPredictor

if __name__ == "__main__":
    model_path = f"Multimodal_distributed-{time.time()}"
    predictor = MultiModalPredictor(
        label="label",
        eval_metric="acc",
        path=model_path,
        hyperparameters={
             "model.hf_text.checkpoint_name": "google/flan-t5-xl",
            "optimization.top_k_average_method": "best",
            "env.num_nodes": 1,
            "env.strategy": "deepspeed_stage_3_offload",
        },
    )
    print("predictor created")
    predictor.fit(train_data, time_limit=180, sync_path="s3://tonyhu-autogluon/multimodal_distributed")
    new_predictor = MultiModalPredictor. load(path=model_path)
    print(new_predictor.predict(test_data[0:2]))

Log output:


8 GPUs are detected, and 8 GPUs will be used.
   - GPU 0 name: Tesla V100-SXM2-32GB
   - GPU 0 memory: 0.26GB/32.0GB (Used/Total)
   - GPU 1 name: Tesla V100-SXM2-32GB
   - GPU 1 memory: 0.26GB/32.0GB (Used/Total)
   - GPU 2 name: Tesla V100-SXM2-32GB
   - GPU 2 memory: 0.26GB/32.0GB (Used/Total)
   - GPU 3 name: Tesla V100-SXM2-32GB
   - GPU 3 memory: 0.26GB/32.0GB (Used/Total)
   - GPU 4 name: Tesla V100-SXM2-32GB
   - GPU 4 memory: 0.26GB/32.0GB (Used/Total)
   - GPU 5 name: Tesla V100-SXM2-32GB
   - GPU 5 memory: 0.26GB/32.0GB (Used/Total)
   - GPU 6 name: Tesla V100-SXM2-32GB
   - GPU 6 memory: 0.26GB/32.0GB (Used/Total)
   - GPU 7 name: Tesla V100-SXM2-32GB
   - GPU 7 memory: 0.26GB/32.0GB (Used/Total)
CUDA version is 11.7.

Enabling DeepSpeed FP16.
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
Parameter Offload: Total persistent parameters: 105474 in 52 params

  | Name              | Type                         | Params | Params per Device
---------------------------------------------------------------------------------------
0 | model             | HFAutoModelForTextPrediction | 1.2 B  | 152 M
1 | validation_metric | MulticlassAccuracy           | 0      | 0
2 | loss_func         | CrossEntropyLoss             | 0      | 0
---------------------------------------------------------------------------------------
1.2 B     Trainable params
0         Non-trainable params
1.2 B     Total params
4,894.126 Total estimated model params size (MB)

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@hohoCode
Copy link

hohoCode commented Nov 11, 2023

Could you please also test try bigger models like Flan-t5-XXL or XL? Currently it seems weird with deepspeed on, looks like a perfect candidate for the deepspeed trainer. Thanks.

Copy link

Job PR-3687-f97a876 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3687/f97a876/index.html

@tonyhoo
Copy link
Collaborator Author

tonyhoo commented Nov 13, 2023

Could you please also test try bigger models like Flan-t5-XXL or XL? Currently it seems weird with deepspeed on, looks like a perfect candidate for the deepspeed trainer. Thanks.

Good idea. Updated the description

@tonyhoo tonyhoo changed the title Multimodal distributed training support [Post 1.0] Multimodal distributed training support Nov 14, 2023
@hohoCode
Copy link

hohoCode commented Nov 15, 2023

Could you please also test try bigger models like Flan-t5-XXL or XL? Currently it seems weird with deepspeed on, looks like a perfect candidate for the deepspeed trainer. Thanks.

Good idea. Updated the description

Thanks a lot!

  1. Last week, I tried a AG's version 1102 (a version at the beginning of this month without your commit). I used (env.strategy: "deepspeed"). Since I have 4 GPUs on my end, I found AG will create 4 separate folders with 'deepspeed' during training, because there is no doc on the 'deepspeed' setting so not sure it is the expected behavior. Hopefully your 'deepspeed_stage_3_offload' will address so.
  2. Hopefully your 'deepspeed_stage_3_offload' can support LORA/IA3 etc.
  3. Also wondering about the strict requirement on the sync path with 's3'. Can we also use shared folders such as '/nas' instead of 's3'? Maybe a relaxation will be better since many users use other cloud providers, or we just do 'nas' for distributed data sharing.
  4. Any possibilities to support 8bit with 'deepspeed_stage_3_offload'? If so, that would be awesome. This will enable bigger LLM training (30B+) then. A whole new front for AutoGluon.

Thanks!

@hohoCode
Copy link

hohoCode commented Jan 3, 2024

BTW, I tested run the codes with 'deepspeed_stage_3_offload' given Flan-t5-xl and "bf16" as the data type. The deepspeed shows it has the "must have the same dtype" error:

File deepspeed/runtime/zero/linear.py, line 111, in zero3_linear_wrap
return LinearFunctionForZeroStage3.apply(input, weight, bias)
File "torch/cuda/amp/autocast_mode.py", line 98, in decorate_fwd
return fwd(*args, **kwargs)
File "deepspeed/runtime/zero/linear.py", line 55, in forward
ret = torch.addmm(bias, input, weight.t())
RuntimeError: mat1 and mat2 must have the same dtype

So it seems deepspeed cannot handle mixed dtypes, probably need to enable mixed-precision training for the pulled request.

Also, all of the settings on "bf16"/"16-mixed"/"bf16-true" are having the same dtype mismatch error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants