Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Exporting / downloading model larger, than VRAM available (trained with DeepSpeed) fails #670

Open
AZ777xx opened this issue Apr 15, 2024 · 8 comments
Labels
type/bug Bug in code

Comments

@AZ777xx
Copy link

AZ777xx commented Apr 15, 2024

I trained a 33b model with DeepSpeed on 40GB cards. Based on the traceback, the model seems to be too large to fit into one GPU. Is it possible to fall back on the CPU for cases like this?

the .pth file is ~67 GB, so obviously it won't fit in CPU GPU (edit: GPU, obviously)

image

script_sources: ['/_f/b7768783-3906-4c38-8849-ca80666c7f7b/tmpif8apv9l.min.js']
initialized: True
version: 1.5.0-dev
name: H2O LLM Studio
heap_mode: False
wave_utils_stack_trace_str: ### stacktrace
Traceback (most recent call last):

  File "/workspace/./llm_studio/app_utils/handlers.py", line 332, in handle
    await experiment_download_model(q)

  File "/workspace/./llm_studio/app_utils/sections/experiment.py", line 1627, in experiment_download_model
    cfg, model, tokenizer = load_cfg_model_tokenizer(

  File "/workspace/./llm_studio/app_utils/sections/chat.py", line 197, in load_cfg_model_tokenizer
    model = cfg.architecture.model_class(cfg)

  File "/workspace/./llm_studio/src/models/text_causal_language_modeling_model.py", line 32, in __init__
    self.backbone, self.backbone_config = create_nlp_backbone(

  File "/workspace/./llm_studio/src/utils/modeling_utils.py", line 784, in create_nlp_backbone
    backbone = model_class.from_config(config, **kwargs)

  File "/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 435, in from_config
    return model_class._from_config(config, **kwargs)

  File "/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1307, in _from_config
    model = cls(config, **kwargs)

  File "/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1103, in __init__
    self.model = LlamaModel(config)

  File "/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 924, in __init__
    [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]

  File "/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 924, in <listcomp>
    [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]

  File "/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 699, in __init__
    self.self_attn = LLAMA_ATTENTION_CLASSES[config._attn_implementation](config=config, layer_idx=layer_idx)

  File "/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 288, in __init__
    self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias)

  File "/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 98, in __init__
    self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))

  File "/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/torch/utils/_device.py", line 77, in __torch_function__
    return func(*args, **kwargs)

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 98.00 MiB. GPU 0 has a total capacity of 39.38 GiB of which 3.81 MiB is free. Process 268834 has 414.00 MiB memory in use. Process 274612 has 38.96 GiB memory in use. Of the allocated memory 38.27 GiB is allocated by PyTorch, and 219.94 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

q.user
q.client
app_db: <llm_studio.app_utils.db.Database object at 0x7fb9245d5d20>
client_initialized: True
mode_curr: error
theme_dark: True
credential_saver: .env File
default_aws_bucket_name: bucket_name
default_azure_conn_string: 
default_azure_container: 
default_kaggle_username: 
set_max_epochs: 50
set_max_batch_size: 256
set_max_gradient_clip: 10
set_max_lora_r: 256
set_max_lora_alpha: 256
gpu_used_for_chat: 1
default_number_of_workers: 8
default_logger: None
default_neptune_project: 
default_openai_azure: False
default_openai_api_base: https://example-endpoint.openai.azure.com
default_openai_api_deployment_id: deployment-name
default_openai_api_version: 2023-05-15
default_gpt_eval_max: 100
default_safe_serialization: True
delete_dialogs: True
chart_plot_max_points: 1000
init_interface: True
notification_bar: None
nav/active: experiment/list
experiment/list/mode: train
dataset/list/df_datasets:    id                          name  ... validation rows                    labels
5   7  FIXSynthetic_v1_withSeatable  ...              28                    answer
4   6     Synthetiv_v1_withSeatable  ...              28                    answer
3   5            SeatableValidation  ...              28                      ideal answer
2   4             synthetic_full_v1  ...            None                    answer
1   2                           dpo  ...            None                    chosen
0   1                         oasst  ...            None                    output

[6 rows x 10 columns]
experiment/list/df_experiments:      id  ...               info
31  109  ...  Runtime: 06:37:55
30  108  ...          OOM error
29  107  ...  Runtime: 02:44:26
28  106  ...  Runtime: 00:42:06
27  105  ...  Runtime: 01:23:17
26  104  ...  Runtime: 01:23:22
25  103  ...  Runtime: 01:01:08
24  102  ...  Runtime: 00:30:30
23   92  ...                   
22   88  ...  Runtime: 00:41:06
21   87  ...           See logs
20   85  ...  Runtime: 03:39:11
19   83  ...  Runtime: 01:42:29
18   82  ...  Runtime: 00:06:36
17   81  ...  Runtime: 01:42:20
16   66  ...  Runtime: 04:53:16
15   56  ...  Runtime: 00:35:02
14   55  ...  Runtime: 00:17:31
13   48  ...  Runtime: 00:28:52
12   43  ...  Runtime: 00:02:51
11   42  ...  Runtime: 00:25:08
10   41  ...  Runtime: 00:38:22
9    37  ...  Runtime: 00:40:52
8    21  ...  Runtime: 06:52:30
7    18  ...  Runtime: 00:40:25
6    17  ...  Runtime: 00:39:37
5    16  ...  Runtime: 02:01:39
4    13  ...  Runtime: 01:17:46
3    12  ...  Runtime: 02:37:27
2    11  ...  Runtime: 00:35:32
1    10  ...  Runtime: 02:21:06
0     8  ...  Runtime: 00:40:47

[32 rows x 16 columns]
expander: True
dataset/list: False
dataset/list/table: []
experiment/list: True
experiment/list/table: ['0']
__wave_submission_name__: report_error
experiment/list/refresh: False
experiment/list/compare: False
experiment/list/stop: False
experiment/list/delete: False
experiment/list/new: False
experiment/list/rename: False
experiment/list/stop/table: False
experiment/list/delete/table/dialog: False
experiment/display/id: 0
experiment/display/logs_path: None
experiment/display/preds_path: None
experiment/display/tab: experiment/display/charts
experiment/display/experiment_id: 109
experiment/display/experiment: <llm_studio.app_utils.db.Experiment object at 0x7fb923bcd1e0>
experiment/display/experiment_path: /workspace/output/user/wrong_prompt_dividers_33b.1/
experiment/display/charts: {'cfg': {'experiment_name': 'wrong_prompt_dividers_33b.1', 'llm_backbone': 'deepseek-ai/deepseek-coder-33b-instruct', 'personalize': False, 'chatbot_name': 'h2oGPT', 'chatbot_author': 'H2O.ai', 'train_dataframe': '/workspace/data/user/synthetic_full_v1/synthetic_full_v1.csv', 'validation_strategy': 'automatic', 'validation_dataframe': 'None', 'validation_size': 0.2, 'data_sample': 1.0, 'data_sample_choice': ['Train', 'Validation'], 'system_column': 'system', 'prompt_column': ('full user_prompt',), 'answer_column': 'answer', 'parent_id_column': 'None', 'text_system_start': '<|begin▁of▁sentence|>', 'text_prompt_start': 'Instruction:', 'text_answer_separator': 'Response:', 'limit_chained_samples': False, 'add_eos_token_to_system': False, 'add_eos_token_to_prompt': False, 'add_eos_token_to_answer': True, 'mask_prompt_labels': False, 'max_length_prompt': 16384, 'max_length_answer': 16384, 'max_length': 16384, 'add_prompt_answer_tokens': True, 'padding_quantile': 1.0, 'use_fast': False, 'backbone_dtype': 'float16', 'gradient_checkpointing': True, 'force_embedding_gradients': False, 'intermediate_dropout': 0.0, 'pretrained_weights': '', 'loss_function': 'TokenAveragedCrossEntropy', 'optimizer': 'Adam', 'learning_rate': 0.0001, 'differential_learning_rate_layers': [], 'differential_learning_rate': 1e-05, 'use_flash_attention_2': False, 'batch_size': 1, 'epochs': 5, 'schedule': 'Cosine', 'warmup_epochs': 0.0, 'weight_decay': 0.0, 'gradient_clip': 0.0, 'grad_accumulation': 1, 'lora': True, 'lora_r': 128, 'lora_alpha': 128, 'lora_dropout': 0.1, 'lora_target_modules': '', 'save_best_checkpoint': False, 'evaluation_epochs': 1.0, 'evaluate_before_training': False, 'train_validation_data': False, 'token_mask_probability': 0.0, 'skip_parent_probability': 0.0, 'random_parent_probability': 0.0, 'neftune_noise_alpha': 0.0, 'metric': 'BLEU', 'metric_gpt_model': 'gpt-3.5-turbo-0301', 'metric_gpt_template': 'general', 'min_length_inference': 2, 'max_length_inference': 256, 'max_time': 120.0, 'batch_size_inference': 0, 'do_sample': False, 'num_beams': 1, 'temperature': 0.0, 'repetition_penalty': 1.0, 'stop_tokens': '', 'top_k': 0, 'top_p': 1.0, 'gpus': ['0', '1', '2', '3'], 'mixed_precision': True, 'compile_model': False, 'use_deepspeed': True, 'deepspeed_method': 'ZeRO3', 'deepspeed_allgather_bucket_size': 1000000, 'deepspeed_reduce_bucket_size': 1000000, 'deepspeed_stage3_prefetch_bucket_size': 1000000, 'deepspeed_stage3_param_persistence_threshold': 1000000, 'find_unused_parameters': False, 'trust_remote_code': True, 'huggingface_branch': 'main', 'number_of_workers': 8, 'seed': -1, 'logger': 'None', 'neptune_project': ''}, 'validation': {'BLEU': {'steps': [436, 876, 1316, 1756, 2196], 'values': [5.185628121873809, 4.8247619005813105, 5.378364072682938, 5.021293141909062, 4.948287839356161]}}, 'df': {'train_data': '/workspace/output/user/wrong_prompt_dividers_33b.1/batch_viz.parquet', 'validation_predictions': '/workspace/output/user/wrong_prompt_dividers_33b.1/validation_viz.parquet'}, 'train': {'loss': {'steps': [], 'values': [1.0, 2.0, 3.0, 4.0, 5.0]}}}
experiment/display/refresh: False
experiment/display/download_logs: False
experiment/display/download_predictions: False
experiment/display/download_model: True
experiment/display/push_to_huggingface: False
experiment/list/current: False
home: False
report_error: True
q.events
q.args
report_error: True
__wave_submission_name__: report_error
stacktrace
Traceback (most recent call last):

File “/workspace/./llm_studio/app_utils/handlers.py”, line 332, in handle await experiment_download_model(q)

File “/workspace/./llm_studio/app_utils/sections/experiment.py”, line 1627, in experiment_download_model cfg, model, tokenizer = load_cfg_model_tokenizer(

File “/workspace/./llm_studio/app_utils/sections/chat.py”, line 197, in load_cfg_model_tokenizer model = cfg.architecture.model_class(cfg)

File “/workspace/./llm_studio/src/models/text_causal_language_modeling_model.py”, line 32, in init self.backbone, self.backbone_config = create_nlp_backbone(

File “/workspace/./llm_studio/src/utils/modeling_utils.py”, line 784, in create_nlp_backbone backbone = model_class.from_config(config, **kwargs)

File “/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py”, line 435, in from_config return model_class._from_config(config, **kwargs)

File “/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/transformers/modeling_utils.py”, line 1307, in _from_config model = cls(config, **kwargs)

File “/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py”, line 1103, in init self.model = LlamaModel(config)

File “/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py”, line 924, in init [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]

File “/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py”, line 924, in [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]

File “/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py”, line 699, in init self.self_attn = LLAMA_ATTENTION_CLASSES[config._attn_implementation](config=config, layer_idx=layer_idx)

File “/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py”, line 288, in init self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias)

File “/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/torch/nn/modules/linear.py”, line 98, in init self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))

File “/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/torch/utils/_device.py”, line 77, in torch_function return func(*args, **kwargs)

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 98.00 MiB. GPU 0 has a total capacity of 39.38 GiB of which 3.81 MiB is free. Process 268834 has 414.00 MiB memory in use. Process 274612 has 38.96 GiB memory in use. Of the allocated memory 38.27 GiB is allocated by PyTorch, and 219.94 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Error
None

Git Version
fatal: not a git repository (or any of the parent directories): .git
@AZ777xx AZ777xx added the type/bug Bug in code label Apr 15, 2024
@AZ777xx AZ777xx changed the title [BUG] Exporting / downloading model larger, then VRAM available (trained with DeepSpeed) fails [BUG] Exporting / downloading model larger, than VRAM available (trained with DeepSpeed) fails Apr 15, 2024
@AZ777xx
Copy link
Author

AZ777xx commented Apr 15, 2024

This could be somewhat easily solved with something like this (highlighted pseudocode) in
llm_studio/app_utils/sections/experiment.py
image

@pascal-pfeiffer
Copy link
Collaborator

Thank you for reporting. When pushing the model to huggingface hub or downloading from the UI, these weights will be automatically sharded into smaller chunks (by default safetensors with I believe 5 GB each).

Is this happing only when using the weights from an old experiment to continue training with "Use previous experiment weights"?
image

@AZ777xx
Copy link
Author

AZ777xx commented Apr 16, 2024

Is this happing only when using the weights from an old experiment to continue training with "Use previous experiment weights"?

No, this is happening with a "freshly" trained model, I haven't tried the "Use previous experiment weights" option, yet. 'pretrained_weights': '' in this experiment, which I'm suspecting this option sets.
Happens as soon as I click "Download model".

I'm pretty sure it is because it tries to load the whole model into device, which is gpu[0].

Causing this if statment to evaluate to True (it sets device to CPU), either by forking the code or by running an experiment in the background lets me download the model

if num_running_queued > 0 or (
cfg.training.lora and cfg.architecture.backbone_dtype in ("int4", "int8")
):
logger.info("Preparing model on CPU. This might slow down the progress.")
device = "cpu"
with set_env(HUGGINGFACE_TOKEN=q.client["default_huggingface_api_token"]):
cfg, model, tokenizer = load_cfg_model_tokenizer(
experiment_path, merge=True, device=device
)

@AZ777xx
Copy link
Author

AZ777xx commented Apr 16, 2024

Thank you for reporting. When pushing the model to huggingface hub or downloading from the UI, these weights will be automatically sharded into smaller chunks (by default safetensors with I believe 5 GB each).

Just double checked, it "crashes" before reaching the sharding.

@pascal-pfeiffer
Copy link
Collaborator

Sorry, can't fully follow. So you are using a local model or a model from Huggingface to start your Experiment?

And what is the next step? You can load a sharded model into multiple GPUs using Deepspeed. Training entirely on CPU is too slow, so we will not be supporting this in H2O LLM Studio.

@psinger
Copy link
Collaborator

psinger commented Apr 16, 2024

When you are pushing a finished model to HF, you can choose the device:
image

@AZ777xx
Copy link
Author

AZ777xx commented Apr 16, 2024

Sorry, can't fully follow. So you are using a local model or a model from Huggingface to start your Experiment?

starting with a model from HF

And what is the next step? You can load a sharded model into multiple GPUs using Deepspeed. Training entirely on CPU is too slow, so we will not be supporting this in H2O LLM Studio.

this happens after training the model.

the issue happens when downloading the model, exporting it, using this button
image

When you are pushing a finished model to HF, you can choose the device: image

is it possible to choose the device, when downloading the model? (not pushing to hf?)

@psinger
Copy link
Collaborator

psinger commented Apr 24, 2024

We need to also add a device window there then. Need to see how easy that's doable.

For now the workaround is to hardcode it in code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Bug in code
Projects
None yet
Development

No branches or pull requests

3 participants