ONNX export of integer weights with large models #872

Giuseppe5 · 2024-02-22T17:43:51Z

When trying to export large models, currently we are forced to export QDQ pattern for weights, instead of simply exporting Integer weights -> DQ.

The error seems to be caused by the fact that adding a new node in the graph with integer weights confuses the calculation of the model size during torch export, and then the >2GB error is triggered.

To reproduce, using the optimum-amd flow:

CUDA_VISIBLE_DEVICES=0 python quantize_llm.py --model mistralai/Mistral-7B-v0.1

with onnx==1.15.0, torch==2.2.0, brevitas==0.10.2, optimum==1.17.1, optimum-amd from main

Thanks @fxmarty

fxmarty · 2024-02-22T17:53:10Z

@costigt-dev @Giuseppe5 Brevitas seem to be using Constant for the int8 weights in ONNX, while PyTorch ONNX export / ORT quantizer use Inititializer. I'm not sure if this difference has any importance but just noting.

Can also be reproduced with daryl149/llama-2-7b-chat-hf & transformers==4.38.1

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:15<00:00,  7.82s/it]
Computing perplexity...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 128/128 [00:16<00:00,  7.58it/s]
Perplexity (original model): 14.506609916687012
/home/felix/miniconda3/envs/fx/lib/python3.9/site-packages/torch/_tensor.py:1394: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at ../c10/core/TensorImpl.h:1908.)
  return super().rename(names)
Computing perplexity...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 128/128 [01:00<00:00,  2.13it/s]
Perplexity (quantized model): 34.405277252197266
Exporting the model to ONNX...
Using the export variant default. Available variants are:
    - default: The default ONNX variant.
Using framework PyTorch: 2.2.0+cu121
Overriding 1 configuration item(s)
        - use_cache -> True
/home/felix/miniconda3/envs/fx/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py:1057: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if seq_length > self.causal_mask.shape[-1]:
/home/felix/miniconda3/envs/fx/lib/python3.9/site-packages/brevitas/quant_tensor/__init__.py:68: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  training = torch.tensor(training, dtype=torch.bool)
/home/felix/miniconda3/envs/fx/lib/python3.9/site-packages/brevitas/export/common/handler/qcdq.py:52: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert bools
/home/felix/miniconda3/envs/fx/lib/python3.9/site-packages/brevitas/quant_tensor/__init__.py:66: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  signed = torch.tensor(signed, dtype=torch.bool)
Saving external data to one file...
Traceback (most recent call last):
  File "/home/felix/miniconda3/envs/fx/lib/python3.9/site-packages/onnx/serialization.py", line 100, in serialize_proto
    result = proto.SerializeToString()
ValueError: Message onnx.ModelProto exceeds maximum protobuf size of 2GB: 6642034969

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/felix/optimum-amd/examples/quantization/brevitas/quantize_llm.py", line 163, in <module>
    main(args)
  File "/home/felix/optimum-amd/examples/quantization/brevitas/quantize_llm.py", line 82, in main
    onnx_export_from_model(
  File "/home/felix/miniconda3/envs/fx/lib/python3.9/site-packages/optimum/exporters/onnx/convert.py", line 1152, in onnx_export_from_model
    _, onnx_outputs = export_models(
  File "/home/felix/miniconda3/envs/fx/lib/python3.9/site-packages/optimum/exporters/onnx/convert.py", line 763, in export_models
    export(
  File "/home/felix/miniconda3/envs/fx/lib/python3.9/site-packages/optimum/exporters/onnx/convert.py", line 868, in export
    export_output = export_pytorch(
  File "/home/felix/miniconda3/envs/fx/lib/python3.9/site-packages/optimum/exporters/onnx/convert.py", line 607, in export_pytorch
    onnx.save(
  File "/home/felix/miniconda3/envs/fx/lib/python3.9/site-packages/onnx/__init__.py", line 326, in save_model
    serialized = _get_serializer(format, model_filepath).serialize_proto(proto)
  File "/home/felix/miniconda3/envs/fx/lib/python3.9/site-packages/onnx/serialization.py", line 103, in serialize_proto
    raise ValueError(
ValueError: The proto size is larger than the 2 GB limit. Please use save_as_external_data to save tensors separately from the model file.

fxmarty · 2024-02-22T18:04:29Z

Note: doing the export with

    export_manager = StdQCDQONNXManager
    export_manager.change_weight_export(export_weight_q_node=True)
    with torch.no_grad(), brevitas_proxy_export_mode(quantized_model, export_manager=export_manager):

instead of simply

    with torch.no_grad(), brevitas_proxy_export_mode(quantized_model, export_manager=StdQCDQONNXManager):

fixes the issue. But this is not a good long-term fix as the serialized model is then ~4x bigger.

Giuseppe5 · 2024-02-26T09:18:20Z

Maybe this could be relevant:
onnx/onnx#5949

Giuseppe5 · 2024-03-02T11:18:02Z

PyTorch 2.2 has partially fixed this issue: pytorch/pytorch#111097

The problem in Pytorch <2.2 seems to be that constants are not acounted for in the model size computation.
It would be worth investigating how to mark a value as Initializer rather than a Constant when exporting from Pytorch to ONNX.

cc @costigt-dev

costigt-dev · 2024-03-11T13:49:25Z

From my investigations there doesn't appear to be any straightforward way to work around this issue in PyTorch 2.1 or below.

Giuseppe5 assigned costigt-dev Feb 22, 2024

fxmarty mentioned this issue Feb 22, 2024

Fix (brevitas/example): Export QDQ for weights by default huggingface/optimum-amd#84

Merged

Giuseppe5 added the export label Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ONNX export of integer weights with large models #872

ONNX export of integer weights with large models #872

Giuseppe5 commented Feb 22, 2024

fxmarty commented Feb 22, 2024 •

edited

fxmarty commented Feb 22, 2024

Giuseppe5 commented Feb 26, 2024

Giuseppe5 commented Mar 2, 2024

costigt-dev commented Mar 11, 2024

ONNX export of integer weights with large models #872

ONNX export of integer weights with large models #872

Comments

Giuseppe5 commented Feb 22, 2024

fxmarty commented Feb 22, 2024 • edited

fxmarty commented Feb 22, 2024

Giuseppe5 commented Feb 26, 2024

Giuseppe5 commented Mar 2, 2024

costigt-dev commented Mar 11, 2024

fxmarty commented Feb 22, 2024 •

edited