[BUG] HuggingFace export does not preserve bfloat16 weights but converts to float16 silently when using CPU for upload #702

tmostak · 2024-05-09T17:31:21Z

🐛 Bug

Native bfloat16 model fine-tuned with bfloat16 gets pushed to HuggingFace as float16

To Reproduce

Choose a HF model like Llama-3 with weights natively as bfloat16
Fine-tune it using dtype of bfloat16
Export it to HuggingFace
Note that the config.json specifies the weights of the fine-tuned model as float16 (not bfloat16) as you'd expect

pascal-pfeiffer · 2024-05-09T17:44:40Z

Could you please share a config to reproduce the issue on the default dataset?
A quick check showed bfloat16 for me when uploading a fine-tune of danube2 to huggingface:

A known limitation is the upload using CPU. That is automatically converted to float16, as pytorch bfloat16 isn't usually supported on CPU.

tmostak · 2024-05-09T17:45:33Z

Ah that's it exactly then, I've been using CPU to upload. Will try using GPU.

pascal-pfeiffer · 2024-05-09T17:46:34Z

Thanks, I'll change the topic of the issue to reflect that the conversion is done silently.
We probably want to raise a warning.

tmostak · 2024-05-09T18:01:59Z

Actually @pascal-pfeiffer I've found that unfortunately I don't have enough GPU memory on any single GPU on an 8XA100 80GB cluster to push Llama-3 70B to HF using bfloat16. I get the following OOM error. Any ideas of a workaround or way this could be done multi-GPU?

INFO: 127.0.0.1:56582 - "POST / HTTP/1.1" 200 OK
2024-05-09 17:59:46,609 - INFO: Initializing client True
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-05-09 17:59:47,245 - INFO: Stop token ids: [tensor([ 27, 91, 9125, 91, 29])]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-05-09 17:59:48,122 - INFO: Stop token ids: [tensor([ 27, 91, 9125, 91, 29], device='cuda:0')]
2024-05-09 17:59:48,137 - WARNING: PAD token id not matching between config and tokenizer. Overwriting with tokenizer id.
2024-05-09 17:59:48,137 - INFO: Setting pretraining_tp of model config to 1.
2024-05-09 17:59:48,159 - INFO: Using bfloat16 for backbone
2024-05-09 17:59:48,159 - INFO: Using Flash Attention 2.
2024-05-09 17:59:48,379 - ERROR: Unknown exception
Traceback (most recent call last):
File "/home/ubuntu/h2o_llm_2024_05_04/./llm_studio/app_utils/handlers.py", line 337, in handle
await experiment_push_to_huggingface_dialog(q)
File "/home/ubuntu/h2o_llm_2024_05_04/./llm_studio/app_utils/sections/experiment.py", line 1829, in experiment_push_to_huggingface_dialog
publish_model_to_hugging_face(
File "/home/ubuntu/h2o_llm_2024_05_04/./llm_studio/app_utils/hugging_face_utils.py", line 108, in publish_model_to_hugging_face
cfg, model, tokenizer = load_cfg_model_tokenizer(
File "/home/ubuntu/h2o_llm_2024_05_04/./llm_studio/app_utils/sections/chat.py", line 219, in load_cfg_model_tokenizer
model = cfg.architecture.model_class(cfg)
File "/home/ubuntu/h2o_llm_2024_05_04/./llm_studio/src/models/text_causal_language_modeling_model.py", line 32, in init
self.backbone, self.backbone_config = create_nlp_backbone(
File "/home/ubuntu/h2o_llm_2024_05_04/./llm_studio/src/utils/modeling_utils.py", line 804, in create_nlp_backbone
backbone = model_class.from_config(config, **kwargs)
File "/home/ubuntu/miniconda3/envs/h2o_llm_2024_05_04/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 437, in from_config
return model_class._from_config(config, **kwargs)
File "/home/ubuntu/miniconda3/envs/h2o_llm_2024_05_04/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1401, in _from_config
model = cls(config, **kwargs)
File "/home/ubuntu/miniconda3/envs/h2o_llm_2024_05_04/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1135, in init
self.model = LlamaModel(config)
File "/home/ubuntu/miniconda3/envs/h2o_llm_2024_05_04/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 927, in init
[LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
File "/home/ubuntu/miniconda3/envs/h2o_llm_2024_05_04/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 927, in
[LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
File "/home/ubuntu/miniconda3/envs/h2o_llm_2024_05_04/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 702, in init
self.mlp = LlamaMLP(config)
File "/home/ubuntu/miniconda3/envs/h2o_llm_2024_05_04/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 219, in init
self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
File "/home/ubuntu/miniconda3/envs/h2o_llm_2024_05_04/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 98, in init
self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
File "/home/ubuntu/miniconda3/envs/h2o_llm_2024_05_04/lib/python3.10/site-packages/torch/utils/_device.py", line 77, in torch_function
return func(*args, **kwargs)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 79.15 GiB of which 433.31 MiB is free. Including non-PyTorch memory, this process has 78.71 GiB memory in use. Of the allocated memory 78.21 GiB is allocated by PyTorch, and 28.51 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

pascal-pfeiffer · 2024-05-09T18:26:29Z

Right, for very large models that don't fit on a single GPU, we added a workaround that loads the full weights to CPU first and then shards across your GPUs before uploading. Can you try uploading the weights with cpu_shard in the device selection?

pascal-pfeiffer · 2024-05-09T18:29:54Z

And actually, I just tried removing our forced cast to float32 and back to float16 when using CPU. It might no longer be needed with recent dependencies upgrades.

We should improve at least the description here to reflect all things that are done under the hood.

tmostak · 2024-05-09T21:27:05Z

Ah I didn't realize that's what cpu_shard did. It sounds like it will support bfloat16 then?

pascal-pfeiffer · 2024-05-10T06:06:08Z

Yes, cpu_shard supports bfloat16.

tmostak · 2024-05-12T22:08:56Z

Confirmed I can export to HF with bfloat16 when using the cpu_shard setting.

tmostak added the type/bug Bug in code label May 9, 2024

pascal-pfeiffer changed the title ~~[BUG] HuggingFace export does not preserve bfloat16 weights but converts to float16~~ [BUG] HuggingFace export does not preserve bfloat16 weights but converts to float16 silently when using CPU for upload May 9, 2024

pascal-pfeiffer self-assigned this May 9, 2024

pascal-pfeiffer mentioned this issue May 16, 2024

remove cast to float32 and float16 when using CPU for pushing to HF #707

Merged

pascal-pfeiffer closed this as completed in #707 May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] HuggingFace export does not preserve bfloat16 weights but converts to float16 silently when using CPU for upload #702

[BUG] HuggingFace export does not preserve bfloat16 weights but converts to float16 silently when using CPU for upload #702

tmostak commented May 9, 2024

pascal-pfeiffer commented May 9, 2024

tmostak commented May 9, 2024

pascal-pfeiffer commented May 9, 2024 •

edited

tmostak commented May 9, 2024

pascal-pfeiffer commented May 9, 2024

pascal-pfeiffer commented May 9, 2024

tmostak commented May 9, 2024

pascal-pfeiffer commented May 10, 2024

tmostak commented May 12, 2024

[BUG] HuggingFace export does not preserve bfloat16 weights but converts to float16 silently when using CPU for upload #702

[BUG] HuggingFace export does not preserve bfloat16 weights but converts to float16 silently when using CPU for upload #702

Comments

tmostak commented May 9, 2024

🐛 Bug

To Reproduce

pascal-pfeiffer commented May 9, 2024

tmostak commented May 9, 2024

pascal-pfeiffer commented May 9, 2024 • edited

tmostak commented May 9, 2024

pascal-pfeiffer commented May 9, 2024

pascal-pfeiffer commented May 9, 2024

tmostak commented May 9, 2024

pascal-pfeiffer commented May 10, 2024

tmostak commented May 12, 2024

pascal-pfeiffer commented May 9, 2024 •

edited