Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ONNX export of integer weights with large models #872

Open
Giuseppe5 opened this issue Feb 22, 2024 · 5 comments
Open

ONNX export of integer weights with large models #872

Giuseppe5 opened this issue Feb 22, 2024 · 5 comments
Assignees
Labels

Comments

@Giuseppe5
Copy link
Collaborator

When trying to export large models, currently we are forced to export QDQ pattern for weights, instead of simply exporting Integer weights -> DQ.

The error seems to be caused by the fact that adding a new node in the graph with integer weights confuses the calculation of the model size during torch export, and then the >2GB error is triggered.

To reproduce, using the optimum-amd flow:

CUDA_VISIBLE_DEVICES=0 python quantize_llm.py --model mistralai/Mistral-7B-v0.1

with onnx==1.15.0, torch==2.2.0, brevitas==0.10.2, optimum==1.17.1, optimum-amd from main

Thanks @fxmarty

@fxmarty
Copy link

fxmarty commented Feb 22, 2024

@costigt-dev @Giuseppe5 Brevitas seem to be using Constant for the int8 weights in ONNX, while PyTorch ONNX export / ORT quantizer use Inititializer. I'm not sure if this difference has any importance but just noting.

Can also be reproduced with daryl149/llama-2-7b-chat-hf & transformers==4.38.1

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:15<00:00,  7.82s/it]
Computing perplexity...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 128/128 [00:16<00:00,  7.58it/s]
Perplexity (original model): 14.506609916687012
/home/felix/miniconda3/envs/fx/lib/python3.9/site-packages/torch/_tensor.py:1394: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at ../c10/core/TensorImpl.h:1908.)
  return super().rename(names)
Computing perplexity...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 128/128 [01:00<00:00,  2.13it/s]
Perplexity (quantized model): 34.405277252197266
Exporting the model to ONNX...
Using the export variant default. Available variants are:
    - default: The default ONNX variant.
Using framework PyTorch: 2.2.0+cu121
Overriding 1 configuration item(s)
        - use_cache -> True
/home/felix/miniconda3/envs/fx/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py:1057: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if seq_length > self.causal_mask.shape[-1]:
/home/felix/miniconda3/envs/fx/lib/python3.9/site-packages/brevitas/quant_tensor/__init__.py:68: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  training = torch.tensor(training, dtype=torch.bool)
/home/felix/miniconda3/envs/fx/lib/python3.9/site-packages/brevitas/export/common/handler/qcdq.py:52: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert bools
/home/felix/miniconda3/envs/fx/lib/python3.9/site-packages/brevitas/quant_tensor/__init__.py:66: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  signed = torch.tensor(signed, dtype=torch.bool)
Saving external data to one file...
Traceback (most recent call last):
  File "/home/felix/miniconda3/envs/fx/lib/python3.9/site-packages/onnx/serialization.py", line 100, in serialize_proto
    result = proto.SerializeToString()
ValueError: Message onnx.ModelProto exceeds maximum protobuf size of 2GB: 6642034969

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/felix/optimum-amd/examples/quantization/brevitas/quantize_llm.py", line 163, in <module>
    main(args)
  File "/home/felix/optimum-amd/examples/quantization/brevitas/quantize_llm.py", line 82, in main
    onnx_export_from_model(
  File "/home/felix/miniconda3/envs/fx/lib/python3.9/site-packages/optimum/exporters/onnx/convert.py", line 1152, in onnx_export_from_model
    _, onnx_outputs = export_models(
  File "/home/felix/miniconda3/envs/fx/lib/python3.9/site-packages/optimum/exporters/onnx/convert.py", line 763, in export_models
    export(
  File "/home/felix/miniconda3/envs/fx/lib/python3.9/site-packages/optimum/exporters/onnx/convert.py", line 868, in export
    export_output = export_pytorch(
  File "/home/felix/miniconda3/envs/fx/lib/python3.9/site-packages/optimum/exporters/onnx/convert.py", line 607, in export_pytorch
    onnx.save(
  File "/home/felix/miniconda3/envs/fx/lib/python3.9/site-packages/onnx/__init__.py", line 326, in save_model
    serialized = _get_serializer(format, model_filepath).serialize_proto(proto)
  File "/home/felix/miniconda3/envs/fx/lib/python3.9/site-packages/onnx/serialization.py", line 103, in serialize_proto
    raise ValueError(
ValueError: The proto size is larger than the 2 GB limit. Please use save_as_external_data to save tensors separately from the model file.

@fxmarty
Copy link

fxmarty commented Feb 22, 2024

Note: doing the export with

    export_manager = StdQCDQONNXManager
    export_manager.change_weight_export(export_weight_q_node=True)
    with torch.no_grad(), brevitas_proxy_export_mode(quantized_model, export_manager=export_manager):

instead of simply

    with torch.no_grad(), brevitas_proxy_export_mode(quantized_model, export_manager=StdQCDQONNXManager):

fixes the issue. But this is not a good long-term fix as the serialized model is then ~4x bigger.

@Giuseppe5
Copy link
Collaborator Author

Maybe this could be relevant:
onnx/onnx#5949

@Giuseppe5
Copy link
Collaborator Author

PyTorch 2.2 has partially fixed this issue: pytorch/pytorch#111097

The problem in Pytorch <2.2 seems to be that constants are not acounted for in the model size computation.
It would be worth investigating how to mark a value as Initializer rather than a Constant when exporting from Pytorch to ONNX.

cc @costigt-dev

@costigt-dev
Copy link
Collaborator

From my investigations there doesn't appear to be any straightforward way to work around this issue in PyTorch 2.1 or below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants