Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] HuggingFace export does not preserve bfloat16 weights but converts to float16 silently when using CPU for upload #702

Closed
tmostak opened this issue May 9, 2024 · 9 comments 路 Fixed by #707
Assignees
Labels
type/bug Bug in code

Comments

@tmostak
Copy link

tmostak commented May 9, 2024

馃悰 Bug

Native bfloat16 model fine-tuned with bfloat16 gets pushed to HuggingFace as float16

To Reproduce

  1. Choose a HF model like Llama-3 with weights natively as bfloat16
  2. Fine-tune it using dtype of bfloat16
  3. Export it to HuggingFace
  4. Note that the config.json specifies the weights of the fine-tuned model as float16 (not bfloat16) as you'd expect
@tmostak tmostak added the type/bug Bug in code label May 9, 2024
@pascal-pfeiffer
Copy link
Collaborator

Could you please share a config to reproduce the issue on the default dataset?
A quick check showed bfloat16 for me when uploading a fine-tune of danube2 to huggingface:
image

A known limitation is the upload using CPU. That is automatically converted to float16, as pytorch bfloat16 isn't usually supported on CPU.

@tmostak
Copy link
Author

tmostak commented May 9, 2024

Ah that's it exactly then, I've been using CPU to upload. Will try using GPU.

@pascal-pfeiffer
Copy link
Collaborator

pascal-pfeiffer commented May 9, 2024

Thanks, I'll change the topic of the issue to reflect that the conversion is done silently.
We probably want to raise a warning.

@pascal-pfeiffer pascal-pfeiffer changed the title [BUG] HuggingFace export does not preserve bfloat16 weights but converts to float16 [BUG] HuggingFace export does not preserve bfloat16 weights but converts to float16 silently when using CPU for upload May 9, 2024
@tmostak
Copy link
Author

tmostak commented May 9, 2024

Actually @pascal-pfeiffer I've found that unfortunately I don't have enough GPU memory on any single GPU on an 8XA100 80GB cluster to push Llama-3 70B to HF using bfloat16. I get the following OOM error. Any ideas of a workaround or way this could be done multi-GPU?

INFO: 127.0.0.1:56582 - "POST / HTTP/1.1" 200 OK
2024-05-09 17:59:46,609 - INFO: Initializing client True
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-05-09 17:59:47,245 - INFO: Stop token ids: [tensor([ 27, 91, 9125, 91, 29])]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-05-09 17:59:48,122 - INFO: Stop token ids: [tensor([ 27, 91, 9125, 91, 29], device='cuda:0')]
2024-05-09 17:59:48,137 - WARNING: PAD token id not matching between config and tokenizer. Overwriting with tokenizer id.
2024-05-09 17:59:48,137 - INFO: Setting pretraining_tp of model config to 1.
2024-05-09 17:59:48,159 - INFO: Using bfloat16 for backbone
2024-05-09 17:59:48,159 - INFO: Using Flash Attention 2.
2024-05-09 17:59:48,379 - ERROR: Unknown exception
Traceback (most recent call last):
File "/home/ubuntu/h2o_llm_2024_05_04/./llm_studio/app_utils/handlers.py", line 337, in handle
await experiment_push_to_huggingface_dialog(q)
File "/home/ubuntu/h2o_llm_2024_05_04/./llm_studio/app_utils/sections/experiment.py", line 1829, in experiment_push_to_huggingface_dialog
publish_model_to_hugging_face(
File "/home/ubuntu/h2o_llm_2024_05_04/./llm_studio/app_utils/hugging_face_utils.py", line 108, in publish_model_to_hugging_face
cfg, model, tokenizer = load_cfg_model_tokenizer(
File "/home/ubuntu/h2o_llm_2024_05_04/./llm_studio/app_utils/sections/chat.py", line 219, in load_cfg_model_tokenizer
model = cfg.architecture.model_class(cfg)
File "/home/ubuntu/h2o_llm_2024_05_04/./llm_studio/src/models/text_causal_language_modeling_model.py", line 32, in init
self.backbone, self.backbone_config = create_nlp_backbone(
File "/home/ubuntu/h2o_llm_2024_05_04/./llm_studio/src/utils/modeling_utils.py", line 804, in create_nlp_backbone
backbone = model_class.from_config(config, **kwargs)
File "/home/ubuntu/miniconda3/envs/h2o_llm_2024_05_04/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 437, in from_config
return model_class._from_config(config, **kwargs)
File "/home/ubuntu/miniconda3/envs/h2o_llm_2024_05_04/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1401, in _from_config
model = cls(config, **kwargs)
File "/home/ubuntu/miniconda3/envs/h2o_llm_2024_05_04/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1135, in init
self.model = LlamaModel(config)
File "/home/ubuntu/miniconda3/envs/h2o_llm_2024_05_04/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 927, in init
[LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
File "/home/ubuntu/miniconda3/envs/h2o_llm_2024_05_04/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 927, in
[LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
File "/home/ubuntu/miniconda3/envs/h2o_llm_2024_05_04/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 702, in init
self.mlp = LlamaMLP(config)
File "/home/ubuntu/miniconda3/envs/h2o_llm_2024_05_04/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 219, in init
self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
File "/home/ubuntu/miniconda3/envs/h2o_llm_2024_05_04/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 98, in init
self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
File "/home/ubuntu/miniconda3/envs/h2o_llm_2024_05_04/lib/python3.10/site-packages/torch/utils/_device.py", line 77, in torch_function
return func(*args, **kwargs)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 79.15 GiB of which 433.31 MiB is free. Including non-PyTorch memory, this process has 78.71 GiB memory in use. Of the allocated memory 78.21 GiB is allocated by PyTorch, and 28.51 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

@pascal-pfeiffer
Copy link
Collaborator

Right, for very large models that don't fit on a single GPU, we added a workaround that loads the full weights to CPU first and then shards across your GPUs before uploading. Can you try uploading the weights with cpu_shard in the device selection?

@pascal-pfeiffer
Copy link
Collaborator

And actually, I just tried removing our forced cast to float32 and back to float16 when using CPU. It might no longer be needed with recent dependencies upgrades.

We should improve at least the description here to reflect all things that are done under the hood.
image

@pascal-pfeiffer pascal-pfeiffer self-assigned this May 9, 2024
@tmostak
Copy link
Author

tmostak commented May 9, 2024

Ah I didn't realize that's what cpu_shard did. It sounds like it will support bfloat16 then?

@pascal-pfeiffer
Copy link
Collaborator

Yes, cpu_shard supports bfloat16.

@tmostak
Copy link
Author

tmostak commented May 12, 2024

Confirmed I can export to HF with bfloat16 when using the cpu_shard setting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Bug in code
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants