Olive workflow for mistral model optimization does not work #1075

jojo1899 · 2024-04-11T13:24:11Z

Describe the bug
Following the instructions in examples/mistral does not result in a quantized onnx model. After running the workflow, my output_model folder within the cache directory contains an onnx model that is 27 GB on disk and the models folder does not contain a quantized model.

To Reproduce
Follow the instructions in examples/mistral to run the optimization on CPU using: python mistral.py --optimize --config mistral_int4_optimize.json

Expected behavior
Expected to obtain an output model that is around 3.5 GB in the models directory.

Olive config
Available here

{
    "input_model": {
        "type": "PyTorchModel",
        "config": {
            "hf_config": {
                "model_name": "mistralai/Mistral-7B-v0.1",
                "model_class": "MistralForCausalLM"
            }
        }
    },
    "systems": {
        "local_system": {
            "type": "LocalSystem",
            "config": {
                "accelerators": [
                    {
                        "device": "cpu",
                        "execution_providers": [
                            "CPUExecutionProvider"
                        ]
                    }
                ]
            }
        }
    },
    "evaluators": {
        "common_evaluator": {
            "metrics": [
                {
                    "name": "latency",
                    "type": "latency",
                    "sub_types": [
                        {
                            "name": "avg",
                            "priority": 1
                        }
                    ],
                    "user_config": {
                        "user_script": "user_script.py",
                        "dataloader_func": "create_dataloader",
                        "batch_size": 1,
                        "inference_settings" : {
                            "onnx": {
                                "session_options": {
                                    "enable_profiling": false
                                }
                            }
                        }
                    }
                }
            ]
        }
    },
    "passes": {
        "convert": {
            "type": "OptimumConversion",
            "config": {
                "target_opset": 14,
                "extra_args": {
                    "legacy": false,
                    "no_post_process": false
                }
            }
        },
        "optimize": {
            "type": "OrtTransformersOptimization",
            "config": {
                "model_type": "gpt2",
                "use_gpu": false,
                "keep_io_types": true,
                "optimization_options": {
                    "use_multi_head_attention": false
                },
                "save_as_external_data": true,
                "all_tensors_to_one_file": true
            }
        },
        "quantization": {
            "type": "IncStaticQuantization",
            "config": {
                "user_script": "user_script.py",
                "approach": "weight_only",
                "weight_only_config": {
                    "bits": 4,
                    "algorithm": "GPTQ"
                },
                "recipes":{
                    "gptq_args": {
                        "accuracy_level": 0
                    }
                },
                "dataloader_func": "calib_dataloader",
                "calibration_sampling_size": [
                    8
                ],
                "save_as_external_data": true,
                "all_tensors_to_one_file": true,
                "diagnosis": false
            }
        }
    },
    "pass_flows": [
        [
            "convert",
            "optimize",
            "quantization"
        ]
    ],
    "engine": {
        "evaluate_input_model": false,
        "evaluator": "common_evaluator",
        "host": "local_system",
        "target": "local_system",
        "cache_dir": "cache",
        "output_dir": "models",
        "output_name": "mistral_int4"
    }
}

Olive logs
C:\Olive\examples\mistral>python mistral.py --optimize --config mistral_int4_optimize.json
Optimizing mistralai/Mistral-7B-v0.1
[2024-04-11 15:14:42,927] [INFO] [run.py:243:run] Loading Olive module configuration from: C:\Olive\olive\olive_config.json
[2024-04-11 15:14:42,933] [INFO] [accelerator.py:324:create_accelerators] Running workflow on accelerator specs: cpu-cpu
[2024-04-11 15:14:42,934] [INFO] [run.py:196:run_engine] Importing pass module OptimumConversion
[2024-04-11 15:14:42,934] [INFO] [run.py:196:run_engine] Importing pass module OrtTransformersOptimization
[2024-04-11 15:14:42,935] [INFO] [run.py:196:run_engine] Importing pass module IncStaticQuantization
[2024-04-11 15:14:42,936] [INFO] [engine.py:106:initialize] Using cache directory: cache
[2024-04-11 15:14:42,937] [INFO] [engine.py:262:run] Running Olive on accelerator: cpu-cpu
[2024-04-11 15:14:43,817] [INFO] [engine.py:864:_run_pass] Running pass convert:OptimumConversion
Framework not specified. Using pt to export the model.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00, 3.22s/it]
Automatic task detection to text-generation-with-past (possible synonyms are: causal-lm-with-past).
Using the export variant default. Available variants are:
- default: The default ONNX variant.
Using framework PyTorch: 2.2.1+cu121
Overriding 1 configuration item(s)
- use_cache -> True
C:\MiniConda3\envs\myonnxrt\lib\site-packages\transformers\modeling_attn_mask_utils.py:114: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if (input_shape[-1] > 1 or self.sliding_window is not None) and self.is_causal:
C:\MiniConda3\envs\myonnxrt\lib\site-packages\optimum\exporters\onnx\model_patcher.py:301: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if past_key_values_length > 0:
C:\MiniConda3\envs\myonnxrt\lib\site-packages\transformers\models\mistral\modeling_mistral.py:120: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if seq_len > self.max_seq_len_cached:
C:\MiniConda3\envs\myonnxrt\lib\site-packages\transformers\models\mistral\modeling_mistral.py:676: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
Saving external data to one file...
Post-processing the exported models...
Deduplicating shared (tied) weights...
Validating ONNX model cache/models/0_OptimumConversion-d3eae021dc4ad3d4cdbc16eba52ef561-ad904e90276e2793a36f3373323e91e1/output_model/model.onnx...
-[✓] ONNX model output names match reference model (present.31.key, present.18.key, present.13.value, present.0.value, present.7.key, present.20.value, present.15.key, present.3.key, present.18.value, present.29.value, present.14.value, present.4.value, present.9.value, present.26.key, present.24.value, present.27.key, present.23.value, present.10.value, present.6.value, present.28.key, present.4.key, present.8.key, present.17.key, present.1.key, present.27.value, present.16.value, present.11.key, present.15.value, present.23.key, present.21.key, present.5.key, present.7.value, present.21.value, present.26.value, present.30.key, present.0.key, present.2.value, present.11.value, present.9.key, present.16.key, present.17.value, present.19.value, present.10.key, present.20.key, present.25.value, present.31.value, present.29.key, present.2.key, present.25.key, present.28.value, present.8.value, present.24.key, present.30.value, present.12.value, present.13.key, present.22.key, present.22.value, present.12.key, present.19.key, present.14.key, present.1.value, present.6.key, logits, present.3.value, present.5.value)
- Validating ONNX Model output "logits":
-[✓] (2, 16, 32000) matches (2, 16, 32000)
-[x] values not close enough, max diff: 3.62396240234375e-05 (atol: 1e-05)
- Validating ONNX Model output "present.0.key":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.0.value":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.1.key":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.1.value":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.2.key":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.2.value":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.3.key":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.3.value":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.4.key":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.4.value":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.5.key":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.5.value":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.6.key":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.6.value":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.7.key":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.7.value":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.8.key":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.8.value":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.9.key":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.9.value":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.10.key":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.10.value":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.11.key":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.11.value":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.12.key":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.12.value":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.13.key":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[x] values not close enough, max diff: 1.5854835510253906e-05 (atol: 1e-05)
- Validating ONNX Model output "present.13.value":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.14.key":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.14.value":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.15.key":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[x] values not close enough, max diff: 1.7523765563964844e-05 (atol: 1e-05)
- Validating ONNX Model output "present.15.value":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.16.key":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[x] values not close enough, max diff: 2.0742416381835938e-05 (atol: 1e-05)
- Validating ONNX Model output "present.16.value":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.17.key":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[x] values not close enough, max diff: 2.6702880859375e-05 (atol: 1e-05)
- Validating ONNX Model output "present.17.value":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.18.key":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[x] values not close enough, max diff: 3.0279159545898438e-05 (atol: 1e-05)
- Validating ONNX Model output "present.18.value":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.19.key":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[x] values not close enough, max diff: 4.1961669921875e-05 (atol: 1e-05)
- Validating ONNX Model output "present.19.value":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.20.key":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[x] values not close enough, max diff: 4.935264587402344e-05 (atol: 1e-05)
- Validating ONNX Model output "present.20.value":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.21.key":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[x] values not close enough, max diff: 5.6743621826171875e-05 (atol: 1e-05)
- Validating ONNX Model output "present.21.value":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.22.key":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[x] values not close enough, max diff: 5.91278076171875e-05 (atol: 1e-05)
- Validating ONNX Model output "present.22.value":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.23.key":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[x] values not close enough, max diff: 5.5789947509765625e-05 (atol: 1e-05)
- Validating ONNX Model output "present.23.value":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.24.key":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[x] values not close enough, max diff: 4.0531158447265625e-05 (atol: 1e-05)
- Validating ONNX Model output "present.24.value":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.25.key":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[x] values not close enough, max diff: 3.4809112548828125e-05 (atol: 1e-05)
- Validating ONNX Model output "present.25.value":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.26.key":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[x] values not close enough, max diff: 3.814697265625e-05 (atol: 1e-05)
- Validating ONNX Model output "present.26.value":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[x] values not close enough, max diff: 1.1026859283447266e-05 (atol: 1e-05)
- Validating ONNX Model output "present.27.key":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[x] values not close enough, max diff: 2.956390380859375e-05 (atol: 1e-05)
- Validating ONNX Model output "present.27.value":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.28.key":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[x] values not close enough, max diff: 3.0040740966796875e-05 (atol: 1e-05)
- Validating ONNX Model output "present.28.value":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[x] values not close enough, max diff: 1.2159347534179688e-05 (atol: 1e-05)
- Validating ONNX Model output "present.29.key":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[x] values not close enough, max diff: 1.7642974853515625e-05 (atol: 1e-05)
- Validating ONNX Model output "present.29.value":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[x] values not close enough, max diff: 1.9088387489318848e-05 (atol: 1e-05)
- Validating ONNX Model output "present.30.key":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[x] values not close enough, max diff: 1.9550323486328125e-05 (atol: 1e-05)
- Validating ONNX Model output "present.30.value":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[x] values not close enough, max diff: 1.519918441772461e-05 (atol: 1e-05)
- Validating ONNX Model output "present.31.key":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[✓] all values close (atol: 1e-05)
- Validating ONNX Model output "present.31.value":
-[✓] (2, 8, 32, 128) matches (2, 8, 32, 128)
-[x] values not close enough, max diff: 1.52587890625e-05 (atol: 1e-05)
The ONNX export succeeded with the warning: The maximum absolute difference between the output of the reference model and the ONNX exported model is not within the set tolerance 1e-05:

logits: max diff = 3.62396240234375e-05
present.13.key: max diff = 1.5854835510253906e-05
present.15.key: max diff = 1.7523765563964844e-05
present.16.key: max diff = 2.0742416381835938e-05
present.17.key: max diff = 2.6702880859375e-05
present.18.key: max diff = 3.0279159545898438e-05
present.19.key: max diff = 4.1961669921875e-05
present.20.key: max diff = 4.935264587402344e-05
present.21.key: max diff = 5.6743621826171875e-05
present.22.key: max diff = 5.91278076171875e-05
present.23.key: max diff = 5.5789947509765625e-05
present.24.key: max diff = 4.0531158447265625e-05
present.25.key: max diff = 3.4809112548828125e-05
present.26.key: max diff = 3.814697265625e-05
present.26.value: max diff = 1.1026859283447266e-05
present.27.key: max diff = 2.956390380859375e-05
present.28.key: max diff = 3.0040740966796875e-05
present.28.value: max diff = 1.2159347534179688e-05
present.29.key: max diff = 1.7642974853515625e-05
present.29.value: max diff = 1.9088387489318848e-05
present.30.key: max diff = 1.9550323486328125e-05
present.30.value: max diff = 1.519918441772461e-05
present.31.value: max diff = 1.52587890625e-05.
The exported model was saved at: cache/models/0_OptimumConversion-d3eae021dc4ad3d4cdbc16eba52ef561-ad904e90276e2793a36f3373323e91e1/output_model
[2024-04-11 15:23:26,254] [INFO] [engine.py:951:_run_pass] Pass convert:OptimumConversion finished in 522.433565 seconds
[2024-04-11 15:23:26,296] [INFO] [engine.py:864:_run_pass] Running pass optimize:OrtTransformersOptimization

Other information

OS: Windows 11 Pro
Olive version: 0.6.0
ONNXRuntime package and version: onnxruntime-gpu 1.17.1

Additional context
It appears that the quantization is not being performed at all. So checking out what the issue is.

The text was updated successfully, but these errors were encountered:

guotuofeng · 2024-04-12T03:51:27Z

what's the full log? it seems the cache folder contains the converted model.

jojo1899 · 2024-04-12T09:26:18Z

That is the full log. I figured out there was some issue in optimizing the converted model. So I made the following changes to the mistral_int4_optimize.json config file by removing the "optimize" from the pass_flows and updating as follows:

"pass_flows": [
        [
            "convert",
            "quantization"
        ]

When I ran the script again, it seems to produce a quantized model, but that is only 1.45 GB on disk. I tried running the model using the CPUExecutionProvider and then the CUDAExecutionProvider, but it gives a runtime error:

Traceback (most recent call last):
  File "c:\onnxrt\mainonnx.py", line 42, in <module>
    sess = InferenceSession(hf_model_path + "/model.onnx",
  File "C:\MiniConda3\envs\myonnxrt\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "C:\MiniConda3\envs\myonnxrt\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 483, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Deserialize tensor onnx::MatMul_10759_Q4G32 failed.tensorprotoutils.cc:904 onnxruntime::utils::GetExtDataFromTensorProto External initializer: onnx::MatMul_10759_Q4G32 offset: 3208609792 size to read: 8388608 given file_length: 1559232512 are out of bounds or can not be read in full.

guotuofeng · 2024-04-12T09:56:41Z

If above is full log, I am guessing your hit out of memory when optimize the converted onnx model. the oom will kill the python process by OS.

Could you try find bigger memory and retry?

jojo1899 · 2024-04-12T14:56:30Z

Yes, I am on it. The quantization took around 3.5 hours on my Intel i9-13980HX CPU. So it is time consuming to test mistral_int4_optimize.json on different systems. How can mistral_fp16_optimize.json be modified so that I can try INT4 GPTQ quantization on my GPU with CUDAExecutionProvider?

guotuofeng · 2024-04-15T01:41:12Z

would you try by changing the accelerators like https://github.com/microsoft/Olive/blob/main/examples/mistral/mistral_fp16_optimize.json#L15-L21

guotuofeng · 2024-04-15T01:41:45Z

for more info, please refer to https://microsoft.github.io/Olive/tutorials/configure_systems.html

jojo1899 · 2024-04-15T15:38:49Z

Thank you @guotuofeng. I can confirm that the examples mistral_int4_optimize.json and mistral_fp16_optimize.json work.

If anyone faces similar issues, make sure that you have sufficient disk space (around 100 GB or more). The disk space seemed to be a bottleneck for me and not the RAM. I tested it on two computers with 64 GB RAM and it worked well. Here are some details for the mistral_int4_optimize.json workflow:

The mistral_int4_optimize.json workflow took me around 3.5 hr to run on high-end CPUs.
The quantized model is 4.76 GB on disk.

I faced some other issues such as the resulting quantized model's responses being very poor and the CUDAExecutionProvider not working with a recent Nvidia SUPER graphics card that I am using. I will try to fix them and get back if needed.

jojo1899 · 2024-04-17T13:19:18Z

Using mistral.py, we can carry out inference using the CUDAExecutionProvider or on the CPU. How can we perform inference on the GPU using DmlExecutionProvider?

onnxruntime-genai seemed to be an option, but it does not yet have support for DmlExecutionProvider.

guotuofeng · 2024-04-18T00:08:42Z

could you try https://microsoft.github.io/Olive/api/passes.html#cmdoption-arg-115 with backend onnxrt_dml_ep? I am not sure whether the int4 quantization works or not against dml.

jojo1899 · 2024-04-18T09:33:24Z

I suppose you meant onnxrt_dml_ep and not onnxrt_dnnl_ep. Anyway, I tried both.

TRIAL 1: I updated the mistral_int4_optimize.json as follows. I added "backend": "onnxrt_dnnl_ep" for IncStaticQuantization. While running the workflow, it warns as follows: Specified provider 'DnnlExecutionProvider' is not in available provider names. Fallback to available providers: 'DmlExecutionProvider, CPUExecutionProvider'. The workflow finished relatively quickly and the resulting 'quantized' model is 27 GB on disk.
Olive configuration:

{
    "input_model": {
        "type": "PyTorchModel",
        "config": {
            "hf_config": {
                "model_name": "mistralai/Mistral-7B-Instruct-v0.1",
                "model_class": "MistralForCausalLM"
            }
        }
    },
    "systems": {
        "local_system": {
            "type": "LocalSystem",
            "config": {
                "accelerators": [
                    {
                        "device": "gpu",
                        "execution_providers": [
                            "DmlExecutionProvider"
                        ]
                    }
                ]
            }
        }
    },
    "evaluators": {
        "common_evaluator": {
            "metrics": [
                {
                    "name": "latency",
                    "type": "latency",
                    "sub_types": [
                        {
                            "name": "avg",
                            "priority": 1
                        }
                    ],
                    "user_config": {
                        "user_script": "user_script.py",
                        "dataloader_func": "create_dataloader",
                        "batch_size": 1,
                        "inference_settings" : {
                            "onnx": {
                                "session_options": {
                                    "enable_profiling": false
                                }
                            }
                        }
                    }
                }
            ]
        }
    },
    "passes": {
        "convert": {
            "type": "OptimumConversion",
            "config": {
                "target_opset": 14,
                "extra_args": {
                    "legacy": false,
                    "no_post_process": false
                }
            }
        },
        "optimize": {
            "type": "OrtTransformersOptimization",
            "config": {
                "model_type": "gpt2",
                "use_gpu": false,
                "keep_io_types": true,
                "optimization_options": {
                    "use_multi_head_attention": false
                },
                "save_as_external_data": true,
                "all_tensors_to_one_file": true
            }
        },
        "quantization": {
            "type": "IncStaticQuantization",
            "config": {
                "backend": "onnxrt_dnnl_ep",
                "user_script": "user_script.py",
                "approach": "weight_only",
                "weight_only_config": {
                    "bits": 4,
                    "algorithm": "GPTQ"
                },
                "recipes":{
                    "gptq_args": {
                        "accuracy_level": 0
                    }
                },
                "dataloader_func": "calib_dataloader",
                "calibration_sampling_size": [
                    8
                ],
                "save_as_external_data": true,
                "all_tensors_to_one_file": true,
                "diagnosis": false
            }
        }
    },
    "pass_flows": [
        [
            "convert",
            "optimize",
            "quantization"
        ]
    ],
    "engine": {
        "evaluate_input_model": false,
        "evaluator": "common_evaluator",
        "host": "local_system",
        "target": "local_system",
        "cache_dir": "cache",
        "output_dir": "models",
        "output_name": "mistral_int4_dml"
    }
}

Here is the full log:

C:\Olive\examples\mistral>python mistral.py --optimize --config mistral_int4_optimize.json
Optimizing mistralai/Mistral-7B-v0.1
[2024-04-18 11:07:03,572] [INFO] [run.py:261:run] Loading Olive module configuration from: C:\Olive\olive\olive_config.json
[2024-04-18 11:07:03,588] [INFO] [accelerator.py:336:create_accelerators] Running workflow on accelerator specs: gpu-dml
[2024-04-18 11:07:03,588] [INFO] [engine.py:106:initialize] Using cache directory: cache
[2024-04-18 11:07:03,588] [INFO] [engine.py:262:run] Running Olive on accelerator: gpu-dml
[2024-04-18 11:07:04,825] [INFO] [engine.py:864:_run_pass] Running pass convert:OptimumConversion
[2024-04-18 11:07:04,825] [INFO] [engine.py:898:_run_pass] Loaded model from cache: 17_OptimumConversion-5af0f1a930787dedd19fa4814997b8a4-ad904e90276e2793a36f3373323e91e1 from cache\runs
[2024-04-18 11:07:04,825] [INFO] [engine.py:864:_run_pass] Running pass optimize:OrtTransformersOptimization
[2024-04-18 11:35:59,179] [INFO] [engine.py:951:_run_pass] Pass optimize:OrtTransformersOptimization finished in 1734.338259 seconds
[2024-04-18 11:35:59,187] [INFO] [engine.py:864:_run_pass] Running pass quantization:IncStaticQuantization
[2024-04-18 11:36:05,418] [WARNING] [inc_quantization.py:440:_set_tuning_config] 'metric' is not set for INC Quantization Pass. Intel® Neural Compressor will quantize model without accuracy aware tuning. Please set 'metric' if you want to use Intel® Neural Compressorquantization with accuracy aware tuning.
2024-04-18 11:37:22 [INFO] Start auto tuning.
2024-04-18 11:37:22 [INFO] Quantize model without tuning!
2024-04-18 11:37:22 [INFO] Quantize the model with default configuration without evaluating the model.                To perform the tuning process, please either provide an eval_func or provide an                    eval_dataloader an eval_metric.
2024-04-18 11:37:22 [INFO] Adaptor has 5 recipes.
2024-04-18 11:37:22 [INFO] 0 recipes specified by user.
2024-04-18 11:37:22 [INFO] 3 recipes require future tuning.
2024-04-18 11:37:22 [WARNING] Specified provider 'DnnlExecutionProvider' is not in available provider names. Fallback to available providers: 'DmlExecutionProvider, CPUExecutionProvider'
2024-04-18 11:37:22 [INFO] *** Initialize auto tuning
Exception in thread Thread-4:
2024-04-18 11:37:22 [INFO] {
Traceback (most recent call last):
2024-04-18 11:37:22 [INFO]     'PostTrainingQuantConfig': {
  File "C:\Anaconda\envs\myolive\lib\threading.py", line 980, in _bootstrap_inner
2024-04-18 11:37:22 [INFO]         'AccuracyCriterion': {
2024-04-18 11:37:22 [INFO]             'criterion': 'relative',
2024-04-18 11:37:22 [INFO]             'higher_is_better': True,
2024-04-18 11:37:22 [INFO]             'tolerable_loss': 0.01,
2024-04-18 11:37:22 [INFO]             'absolute': None,
2024-04-18 11:37:22 [INFO]             'keys': <bound method AccuracyCriterion.keys of <neural_compressor.config.AccuracyCriterion object at 0x000001A64EFA7550>>,
2024-04-18 11:37:22 [INFO]             'relative': 0.01
2024-04-18 11:37:22 [INFO]         },
2024-04-18 11:37:22 [INFO]         'approach': 'post_training_weight_only',
2024-04-18 11:37:22 [INFO]         'backend': 'onnxrt_dnnl_ep',
2024-04-18 11:37:22 [INFO]         'calibration_sampling_size': [
2024-04-18 11:37:22 [INFO]             8
2024-04-18 11:37:22 [INFO]         ],
2024-04-18 11:37:22 [INFO]         'device': 'cpu',
2024-04-18 11:37:22 [INFO]         'diagnosis': False,
2024-04-18 11:37:22 [INFO]         'domain': 'auto',
2024-04-18 11:37:22 [INFO]         'example_inputs': 'Not printed here due to large size tensors...',
2024-04-18 11:37:22 [INFO]         'excluded_precisions': [
2024-04-18 11:37:22 [INFO]         ],
2024-04-18 11:37:22 [INFO]         'framework': 'onnxruntime',
2024-04-18 11:37:22 [INFO]         'inputs': [
2024-04-18 11:37:22 [INFO]         ],
2024-04-18 11:37:22 [INFO]         'model_name': '',
2024-04-18 11:37:22 [INFO]         'ni_workload_name': 'quantization',
2024-04-18 11:37:22 [INFO]         'op_name_dict': None,
2024-04-18 11:37:22 [INFO]         'op_type_dict': {
2024-04-18 11:37:22 [INFO]             '.*': {
2024-04-18 11:37:22 [INFO]                 'weight': {
2024-04-18 11:37:22 [INFO]                     'bits': [
2024-04-18 11:37:22 [INFO]                         4
2024-04-18 11:37:22 [INFO]                     ],
2024-04-18 11:37:22 [INFO]                     'group_size': [
2024-04-18 11:37:22 [INFO]                         32
2024-04-18 11:37:22 [INFO]                     ],
2024-04-18 11:37:22 [INFO]                     'scheme': [
2024-04-18 11:37:22 [INFO]                         'asym'
2024-04-18 11:37:22 [INFO]                     ],
2024-04-18 11:37:22 [INFO]                     'algorithm': [
2024-04-18 11:37:22 [INFO]                         'GPTQ'
2024-04-18 11:37:22 [INFO]                     ]
2024-04-18 11:37:22 [INFO]                 }
2024-04-18 11:37:22 [INFO]             }
2024-04-18 11:37:22 [INFO]         },
2024-04-18 11:37:22 [INFO]         'outputs': [
2024-04-18 11:37:22 [INFO]         ],
2024-04-18 11:37:22 [INFO]         'quant_format': 'QOperator',
    self.run()
  File "C:\Anaconda\envs\myolive\lib\threading.py", line 1304, in run
2024-04-18 11:37:22 [INFO]         'quant_level': 'auto',
2024-04-18 11:37:22 [INFO]         'recipes': {
2024-04-18 11:37:22 [INFO]             'smooth_quant': False,
2024-04-18 11:37:22 [INFO]             'smooth_quant_args': {
2024-04-18 11:37:22 [INFO]             },
2024-04-18 11:37:22 [INFO]             'layer_wise_quant': False,
2024-04-18 11:37:22 [INFO]             'layer_wise_quant_args': {
2024-04-18 11:37:22 [INFO]             },
2024-04-18 11:37:22 [INFO]             'fast_bias_correction': False,
    self.finished.wait(self.interval)
  File "C:\Anaconda\envs\myolive\lib\threading.py", line 581, in wait
2024-04-18 11:37:22 [INFO]             'weight_correction': False,
2024-04-18 11:37:22 [INFO]             'gemm_to_matmul': True,
2024-04-18 11:37:22 [INFO]             'graph_optimization_level': None,
2024-04-18 11:37:22 [INFO]             'first_conv_or_matmul_quantization': True,
2024-04-18 11:37:22 [INFO]             'last_conv_or_matmul_quantization': True,
2024-04-18 11:37:22 [INFO]             'pre_post_process_quantization': True,
2024-04-18 11:37:22 [INFO]             'add_qdq_pair_to_weight': False,
    signaled = self._cond.wait(timeout)
2024-04-18 11:37:22 [INFO]             'optypes_to_exclude_output_quant': [
  File "C:\Anaconda\envs\myolive\lib\threading.py", line 316, in wait
2024-04-18 11:37:22 [INFO]             ],
2024-04-18 11:37:22 [INFO]             'dedicated_qdq_pair': False,
2024-04-18 11:37:22 [INFO]             'rtn_args': {
2024-04-18 11:37:22 [INFO]             },
2024-04-18 11:37:22 [INFO]             'awq_args': {
2024-04-18 11:37:22 [INFO]             },
2024-04-18 11:37:22 [INFO]             'gptq_args': {
2024-04-18 11:37:22 [INFO]                 'accuracy_level': 0
    gotit = waiter.acquire(True, timeout)
2024-04-18 11:37:22 [INFO]             },
OverflowError: timeout value is too large
2024-04-18 11:37:22 [INFO]             'teq_args': {
2024-04-18 11:37:22 [INFO]             },
2024-04-18 11:37:22 [INFO]             'autoround_args': {
2024-04-18 11:37:22 [INFO]             }
2024-04-18 11:37:22 [INFO]         },
2024-04-18 11:37:22 [INFO]         'reduce_range': False,
2024-04-18 11:37:22 [INFO]         'TuningCriterion': {
2024-04-18 11:37:22 [INFO]             'max_trials': 100,
2024-04-18 11:37:22 [INFO]             'objective': [
2024-04-18 11:37:22 [INFO]                 'performance'
2024-04-18 11:37:22 [INFO]             ],
2024-04-18 11:37:22 [INFO]             'strategy': 'basic',
2024-04-18 11:37:22 [INFO]             'strategy_kwargs': None,
2024-04-18 11:37:22 [INFO]             'timeout': 0
2024-04-18 11:37:22 [INFO]         },
2024-04-18 11:37:22 [INFO]         'use_bf16': True
2024-04-18 11:37:22 [INFO]     }
2024-04-18 11:37:22 [INFO] }
2024-04-18 11:37:22 [WARNING] [Strategy] Please install `mpi4py` correctly if using distributed tuning; otherwise, ignore this warning.
2024-04-18 11:37:22 [WARNING] The model is automatically detected as a non-NLP model. You can use 'domain' argument in 'PostTrainingQuantConfig' to overwrite it
2024-04-18 11:37:22 [WARNING] Graph optimization level is automatically set to ENABLE_BASIC. You can use 'recipe' argument in 'PostTrainingQuantConfig'to overwrite it
C:\Anaconda\envs\myolive\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py:69: UserWarning: Specified provider 'DnnlExecutionProvider' is not in available provider names.Available providers: 'DmlExecutionProvider, CPUExecutionProvider'
  warnings.warn(
2024-04-18 11:38:05 [INFO] Do not evaluate the baseline and quantize the model with default configuration.
2024-04-18 11:38:05 [INFO] Quantize the model with default config.
2024-04-18 11:38:07 [INFO] |******Mixed Precision Statistics******|
2024-04-18 11:38:07 [INFO] +---------------------+----------------+
2024-04-18 11:38:07 [INFO] |       Op Type       |     Total      |
2024-04-18 11:38:07 [INFO] +---------------------+----------------+
2024-04-18 11:38:07 [INFO] +---------------------+----------------+
2024-04-18 11:38:07 [INFO] Pass quantize model elapsed time: 1917.88 ms
2024-04-18 11:38:07 [INFO] Save tuning history to C:\Olive\examples\mistral\nc_workspace\2024-04-18_11-35-59\./history.snapshot.
2024-04-18 11:38:07 [INFO] [Strategy] Found the model meets accuracy requirements, ending the tuning process.
2024-04-18 11:38:07 [INFO] Specified timeout or max trials is reached! Found a quantized model which meet accuracy goal. Exit.
2024-04-18 11:38:07 [INFO] Save deploy yaml to C:\Olive\examples\mistral\nc_workspace\2024-04-18_11-35-59\deploy.yaml
[2024-04-18 11:38:28,125] [INFO] [engine.py:951:_run_pass] Pass quantization:IncStaticQuantization finished in 148.929999 seconds
[2024-04-18 11:38:28,125] [INFO] [engine.py:842:_run_passes] Run model evaluation for the final model...
[2024-04-18 11:39:55,095] [INFO] [engine.py:361:run_accelerator] Save footprint to models\mistral_int4_dml_gpu-dml_footprints.json.
[2024-04-18 11:39:55,099] [INFO] [engine.py:279:run] Run history for gpu-dml:
[2024-04-18 11:39:55,117] [INFO] [engine.py:567:dump_run_history] run history:
+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------------------+----------------+----------------------------+
| model_id                                                                               | parent_model_id                                                                        | from_pass                   |   duration_sec | metrics                    |
+========================================================================================+========================================================================================+=============================+================+============================+
| 5af0f1a930787dedd19fa4814997b8a4                                                       |                                                                                        |                             |                |                            |
+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------------------+----------------+----------------------------+
| 17_OptimumConversion-5af0f1a930787dedd19fa4814997b8a4-ad904e90276e2793a36f3373323e91e1 | 5af0f1a930787dedd19fa4814997b8a4                                                       | OptimumConversion           |        378.937 |                            |
+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------------------+----------------+----------------------------+
| 21_OrtTransformersOptimization-17-7ee24d7faf207d244aa16596fc4f536c-gpu-dml             | 17_OptimumConversion-5af0f1a930787dedd19fa4814997b8a4-ad904e90276e2793a36f3373323e91e1 | OrtTransformersOptimization |       1734.34  |                            |
+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------------------+----------------+----------------------------+
| 22_IncStaticQuantization-21-d038bb1662e6b6fb8eec0b99098940cb-gpu-dml                   | 21_OrtTransformersOptimization-17-7ee24d7faf207d244aa16596fc4f536c-gpu-dml             | IncStaticQuantization       |        148.93  | {                          |
|                                                                                        |                                                                                        |                             |                |   "latency-avg": 520.68174 |
|                                                                                        |                                                                                        |                             |                | }                          |
+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------------------+----------------+----------------------------+
[2024-04-18 11:39:55,119] [INFO] [engine.py:294:run] No packaging config provided, skip packaging artifacts

TRIAL 2: I changed "backend": "onnxrt_dnnl_ep" to "backend": "onnxrt_dml_ep" and ran the workflow again. It resulted in a few warnings and errors. A couple of noteworthy warnings from the log are as follows:

[WARNING] Backend onnxrt_dml_ep requires a NPU device. Reset device to 'npu'.
[WARNING] [engine.py:357:run_accelerator] Failed to run Olive on gpu-dml.

Here is the full log:

C:\Olive\examples\mistral>python mistral.py --optimize --config mistral_int4_optimize.json
Optimizing mistralai/Mistral-7B-v0.1
[2024-04-18 12:18:02,387] [INFO] [run.py:261:run] Loading Olive module configuration from: C:\Olive\olive\olive_config.json
[2024-04-18 12:18:02,406] [INFO] [accelerator.py:336:create_accelerators] Running workflow on accelerator specs: gpu-dml
[2024-04-18 12:18:02,406] [INFO] [engine.py:106:initialize] Using cache directory: cache
[2024-04-18 12:18:02,406] [INFO] [engine.py:262:run] Running Olive on accelerator: gpu-dml
[2024-04-18 12:18:04,096] [INFO] [engine.py:864:_run_pass] Running pass convert:OptimumConversion
[2024-04-18 12:18:04,096] [INFO] [engine.py:898:_run_pass] Loaded model from cache: 17_OptimumConversion-5af0f1a930787dedd19fa4814997b8a4-ad904e90276e2793a36f3373323e91e1 from cache\runs
[2024-04-18 12:18:04,096] [INFO] [engine.py:864:_run_pass] Running pass optimize:OrtTransformersOptimization
[2024-04-18 12:18:04,096] [INFO] [engine.py:898:_run_pass] Loaded model from cache: 21_OrtTransformersOptimization-17-7ee24d7faf207d244aa16596fc4f536c-gpu-dml from cache\runs
[2024-04-18 12:18:04,096] [INFO] [engine.py:864:_run_pass] Running pass quantization:IncStaticQuantization
[2024-04-18 12:18:07,192] [WARNING] [inc_quantization.py:440:_set_tuning_config] 'metric' is not set for INC Quantization Pass. Intel® Neural Compressor will quantize model without accuracy aware tuning. Please set 'metric' if you want to use Intel® Neural Compressorquantization with accuracy aware tuning.
2024-04-18 12:19:06 [INFO] Start auto tuning.
2024-04-18 12:19:06 [INFO] Quantize model without tuning!
2024-04-18 12:19:06 [INFO] Quantize the model with default configuration without evaluating the model.                To perform the tuning process, please either provide an eval_func or provide an                    eval_dataloader an eval_metric.
2024-04-18 12:19:06 [INFO] Adaptor has 5 recipes.
2024-04-18 12:19:06 [INFO] 0 recipes specified by user.
2024-04-18 12:19:06 [INFO] 3 recipes require future tuning.
2024-04-18 12:19:06 [WARNING] Backend `onnxrt_dml_ep` requires a NPU device. Reset device to 'npu'.
2024-04-18 12:19:06 [INFO] *** Initialize auto tuning
Exception in thread Thread-4:
Traceback (most recent call last):
  File "C:\Anaconda\envs\myolive\lib\threading.py", line 980, in _bootstrap_inner
2024-04-18 12:19:06 [INFO] {
2024-04-18 12:19:06 [INFO]     'PostTrainingQuantConfig': {
2024-04-18 12:19:06 [INFO]         'AccuracyCriterion': {
2024-04-18 12:19:06 [INFO]             'criterion': 'relative',
2024-04-18 12:19:06 [INFO]             'higher_is_better': True,
2024-04-18 12:19:06 [INFO]             'tolerable_loss': 0.01,
2024-04-18 12:19:06 [INFO]             'absolute': None,
2024-04-18 12:19:06 [INFO]             'keys': <bound method AccuracyCriterion.keys of <neural_compressor.config.AccuracyCriterion object at 0x000001E3B51BBF70>>,
2024-04-18 12:19:06 [INFO]             'relative': 0.01
2024-04-18 12:19:06 [INFO]         },
2024-04-18 12:19:06 [INFO]         'approach': 'post_training_weight_only',
2024-04-18 12:19:06 [INFO]         'backend': 'onnxrt_dml_ep',
2024-04-18 12:19:06 [INFO]         'calibration_sampling_size': [
2024-04-18 12:19:06 [INFO]             8
    self.run()
  File "C:\Anaconda\envs\myolive\lib\threading.py", line 1304, in run
2024-04-18 12:19:06 [INFO]         ],
2024-04-18 12:19:06 [INFO]         'device': 'cpu',
2024-04-18 12:19:06 [INFO]         'diagnosis': False,
2024-04-18 12:19:06 [INFO]         'domain': 'auto',
2024-04-18 12:19:06 [INFO]         'example_inputs': 'Not printed here due to large size tensors...',
2024-04-18 12:19:06 [INFO]         'excluded_precisions': [
2024-04-18 12:19:06 [INFO]         ],
    self.finished.wait(self.interval)
  File "C:\Anaconda\envs\myolive\lib\threading.py", line 581, in wait
2024-04-18 12:19:06 [INFO]         'framework': 'onnxruntime',
2024-04-18 12:19:06 [INFO]         'inputs': [
2024-04-18 12:19:06 [INFO]         ],
2024-04-18 12:19:06 [INFO]         'model_name': '',
2024-04-18 12:19:06 [INFO]         'ni_workload_name': 'quantization',
2024-04-18 12:19:06 [INFO]         'op_name_dict': None,
2024-04-18 12:19:06 [INFO]         'op_type_dict': {
2024-04-18 12:19:06 [INFO]             '.*': {
2024-04-18 12:19:06 [INFO]                 'weight': {
    signaled = self._cond.wait(timeout)
  File "C:\Anaconda\envs\myolive\lib\threading.py", line 316, in wait
2024-04-18 12:19:06 [INFO]                     'bits': [
2024-04-18 12:19:06 [INFO]                         4
2024-04-18 12:19:06 [INFO]                     ],
2024-04-18 12:19:06 [INFO]                     'group_size': [
2024-04-18 12:19:06 [INFO]                         32
2024-04-18 12:19:06 [INFO]                     ],
2024-04-18 12:19:06 [INFO]                     'scheme': [
2024-04-18 12:19:06 [INFO]                         'asym'
    gotit = waiter.acquire(True, timeout)
2024-04-18 12:19:06 [INFO]                     ],
OverflowError: timeout value is too large
2024-04-18 12:19:06 [INFO]                     'algorithm': [
2024-04-18 12:19:06 [INFO]                         'GPTQ'
2024-04-18 12:19:06 [INFO]                     ]
2024-04-18 12:19:06 [INFO]                 }
2024-04-18 12:19:06 [INFO]             }
2024-04-18 12:19:06 [INFO]         },
2024-04-18 12:19:06 [INFO]         'outputs': [
2024-04-18 12:19:06 [INFO]         ],
2024-04-18 12:19:06 [INFO]         'quant_format': 'QOperator',
2024-04-18 12:19:06 [INFO]         'quant_level': 'auto',
2024-04-18 12:19:06 [INFO]         'recipes': {
2024-04-18 12:19:06 [INFO]             'smooth_quant': False,
2024-04-18 12:19:06 [INFO]             'smooth_quant_args': {
2024-04-18 12:19:06 [INFO]             },
2024-04-18 12:19:06 [INFO]             'layer_wise_quant': False,
2024-04-18 12:19:06 [INFO]             'layer_wise_quant_args': {
2024-04-18 12:19:06 [INFO]             },
2024-04-18 12:19:06 [INFO]             'fast_bias_correction': False,
2024-04-18 12:19:06 [INFO]             'weight_correction': False,
2024-04-18 12:19:06 [INFO]             'gemm_to_matmul': True,
2024-04-18 12:19:06 [INFO]             'graph_optimization_level': None,
2024-04-18 12:19:06 [INFO]             'first_conv_or_matmul_quantization': True,
2024-04-18 12:19:06 [INFO]             'last_conv_or_matmul_quantization': True,
2024-04-18 12:19:06 [INFO]             'pre_post_process_quantization': True,
2024-04-18 12:19:06 [INFO]             'add_qdq_pair_to_weight': False,
2024-04-18 12:19:06 [INFO]             'optypes_to_exclude_output_quant': [
2024-04-18 12:19:06 [INFO]             ],
2024-04-18 12:19:06 [INFO]             'dedicated_qdq_pair': False,
2024-04-18 12:19:06 [INFO]             'rtn_args': {
2024-04-18 12:19:06 [INFO]             },
2024-04-18 12:19:06 [INFO]             'awq_args': {
2024-04-18 12:19:06 [INFO]             },
2024-04-18 12:19:06 [INFO]             'gptq_args': {
2024-04-18 12:19:06 [INFO]                 'accuracy_level': 0
2024-04-18 12:19:06 [INFO]             },
2024-04-18 12:19:06 [INFO]             'teq_args': {
2024-04-18 12:19:06 [INFO]             },
2024-04-18 12:19:06 [INFO]             'autoround_args': {
2024-04-18 12:19:06 [INFO]             }
2024-04-18 12:19:06 [INFO]         },
2024-04-18 12:19:06 [INFO]         'reduce_range': False,
2024-04-18 12:19:06 [INFO]         'TuningCriterion': {
2024-04-18 12:19:06 [INFO]             'max_trials': 100,
2024-04-18 12:19:06 [INFO]             'objective': [
2024-04-18 12:19:06 [INFO]                 'performance'
2024-04-18 12:19:06 [INFO]             ],
2024-04-18 12:19:06 [INFO]             'strategy': 'basic',
2024-04-18 12:19:06 [INFO]             'strategy_kwargs': None,
2024-04-18 12:19:06 [INFO]             'timeout': 0
2024-04-18 12:19:06 [INFO]         },
2024-04-18 12:19:06 [INFO]         'use_bf16': True
2024-04-18 12:19:06 [INFO]     }
2024-04-18 12:19:06 [INFO] }
2024-04-18 12:19:06 [WARNING] [Strategy] Please install `mpi4py` correctly if using distributed tuning; otherwise, ignore this warning.
2024-04-18 12:19:06 [WARNING] The model is automatically detected as a non-NLP model. You can use 'domain' argument in 'PostTrainingQuantConfig' to overwrite it
2024-04-18 12:19:06 [WARNING] Graph optimization level is automatically set to ENABLE_BASIC. You can use 'recipe' argument in 'PostTrainingQuantConfig'to overwrite it
2024-04-18 12:19:40 [INFO] Do not evaluate the baseline and quantize the model with default configuration.
2024-04-18 12:19:40 [INFO] Quantize the model with default config.
2024-04-18 12:19:41 [INFO] |******Mixed Precision Statistics******|
2024-04-18 12:19:41 [INFO] +---------------------+----------------+
2024-04-18 12:19:41 [INFO] |       Op Type       |     Total      |
2024-04-18 12:19:41 [INFO] +---------------------+----------------+
2024-04-18 12:19:41 [INFO] +---------------------+----------------+
2024-04-18 12:19:41 [INFO] Pass quantize model elapsed time: 843.77 ms
2024-04-18 12:19:41 [INFO] Save tuning history to C:\Olive\examples\mistral\nc_workspace\2024-04-18_12-18-04\./history.snapshot.
2024-04-18 12:19:41 [INFO] [Strategy] Found the model meets accuracy requirements, ending the tuning process.
2024-04-18 12:19:41 [INFO] Specified timeout or max trials is reached! Found a quantized model which meet accuracy goal. Exit.
2024-04-18 12:19:41 [INFO] Save deploy yaml to C:\Olive\examples\mistral\nc_workspace\2024-04-18_12-18-04\deploy.yaml
[2024-04-18 12:20:01,362] [INFO] [engine.py:951:_run_pass] Pass quantization:IncStaticQuantization finished in 117.265407 seconds
[2024-04-18 12:20:01,374] [INFO] [engine.py:842:_run_passes] Run model evaluation for the final model...
2024-04-18 12:20:02.4692837 [E:onnxruntime:, inference_session.cc:1997 onnxruntime::InferenceSession::Initialize::<lambda_80060d29f848598faaecbd5242ad430a>::operator ()] Exception during initialization: invalid unordered_map<K, T> key
[2024-04-18 12:20:02,468] [WARNING] [engine.py:357:run_accelerator] Failed to run Olive on gpu-dml.
Traceback (most recent call last):
  File "C:\Olive\olive\engine\engine.py", line 336, in run_accelerator
    output_footprint = self.run_no_search(
  File "C:\Olive\olive\engine\engine.py", line 428, in run_no_search
    should_prune, signal, model_ids = self._run_passes(
  File "C:\Olive\olive\engine\engine.py", line 843, in _run_passes
    signal = self._evaluate_model(model_config, model_id, data_root, evaluator_config, accelerator_spec)
  File "C:\Olive\olive\engine\engine.py", line 1041, in _evaluate_model
    signal = self.target.evaluate_model(model_config, data_root, metrics, accelerator_spec)
  File "C:\Olive\olive\systems\local.py", line 46, in evaluate_model
    return evaluator.evaluate(model, data_root, metrics, device=device, execution_providers=execution_providers)
  File "C:\Olive\olive\evaluator\olive_evaluator.py", line 214, in evaluate
    metrics_res[metric.name] = self._evaluate_latency(
  File "C:\Olive\olive\evaluator\olive_evaluator.py", line 132, in _evaluate_latency
    latencies = self._evaluate_raw_latency(
  File "C:\Olive\olive\evaluator\olive_evaluator.py", line 767, in _evaluate_raw_latency
    return self._evaluate_onnx_latency(model, metric, dataloader, post_func, device, execution_providers)
  File "C:\Olive\olive\evaluator\olive_evaluator.py", line 540, in _evaluate_onnx_latency
    session, inference_settings = OnnxEvaluator.get_session_wrapper(
  File "C:\Olive\olive\evaluator\olive_evaluator.py", line 435, in get_session_wrapper
    session = model.prepare_session(
  File "C:\Olive\olive\model\handler\onnx.py", line 114, in prepare_session
    return get_ort_inference_session(
  File "C:\Olive\olive\common\ort_inference.py", line 118, in get_ort_inference_session
    session = ort.InferenceSession(
  File "C:\Anaconda\envs\myolive\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "C:\Anaconda\envs\myolive\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 483, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: invalid unordered_map<K, T> key
[2024-04-18 12:20:02,515] [INFO] [engine.py:279:run] Run history for gpu-dml:
[2024-04-18 12:20:02,531] [INFO] [engine.py:567:dump_run_history] run history:
+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------------------+----------------+-----------+
| model_id                                                                               | parent_model_id                                                                        | from_pass                   |   duration_sec | metrics   |
+========================================================================================+========================================================================================+=============================+================+===========+
| 5af0f1a930787dedd19fa4814997b8a4                                                       |                                                                                        |                             |                |           |
+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------------------+----------------+-----------+
| 17_OptimumConversion-5af0f1a930787dedd19fa4814997b8a4-ad904e90276e2793a36f3373323e91e1 | 5af0f1a930787dedd19fa4814997b8a4                                                       | OptimumConversion           |        378.937 |           |
+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------------------+----------------+-----------+
| 21_OrtTransformersOptimization-17-7ee24d7faf207d244aa16596fc4f536c-gpu-dml             | 17_OptimumConversion-5af0f1a930787dedd19fa4814997b8a4-ad904e90276e2793a36f3373323e91e1 | OrtTransformersOptimization |       1734.34  |           |
+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------------------+----------------+-----------+
| 23_IncStaticQuantization-21-b76f1bb364ef9dc8aca22db9c5b3ee30-gpu-dml                   | 21_OrtTransformersOptimization-17-7ee24d7faf207d244aa16596fc4f536c-gpu-dml             | IncStaticQuantization       |        117.265 |           |
+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------------------+----------------+-----------+
[2024-04-18 12:20:02,531] [INFO] [engine.py:294:run] No packaging config provided, skip packaging artifacts

guotuofeng · 2024-04-18T09:41:50Z

yes I mean dml ep. as for the error, we might need ask from dml ep team. @PatriceVignola, do you have any insight with this error?

guotuofeng · 2024-04-18T09:43:22Z

#852 (comment)

jojo1899 · 2024-04-18T09:48:32Z

@guotuofeng The following code snippet works like a charm with the INT4 model created using the scripts in examples/mistral

config = AutoConfig.from_pretrained(hfmodelpath + "/config.json")
tokenizer = AutoTokenizer.from_pretrained(hfmodelpath)

options = ort.SessionOptions()

sess = InferenceSession(hfmodelpath + "/model.onnx",
                        load_external_data=True, 
                        sess_options=options,
                        provider = "CUDAExecutionProvider")

inputs = tokenizer("The lightest element is", return_tensors="pt")

model = ORTModelForCausalLM(sess, config, use_cache=True)    
model = model.to("cuda")
inputs = inputs.to('cuda')
starttime = time.time()
outputs = model.generate(**inputs, max_new_tokens=512)
endtime = time.time()
print(f"Latency = {endtime-starttime} seconds")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

I simply want to use DmlExecutionProvider instead of CUDAExecutionProvider. I tried the following, but it results in an error.

config = AutoConfig.from_pretrained(hfmodelpath + "/config.json")
tokenizer = AutoTokenizer.from_pretrained(hfmodelpath)

options = ort.SessionOptions()

sess = InferenceSession(hfmodelpath + "/model.onnx",
                        load_external_data=True, 
                        sess_options=options,
                        provider = "DmlExecutionProvider")

inputs = tokenizer("The lightest element is", return_tensors="pt")

model = ORTModelForCausalLM(sess, config, use_cache=True)    
device = torch_directml.device(0) 
model = model.to(device)
inputs = inputs.to(device)
starttime = time.time()
outputs = model.generate(**inputs, max_new_tokens=512)
endtime = time.time()
print(f"Latency = {endtime-starttime} seconds")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

RuntimeError: Cannot access data pointer of Tensor that doesn't have storage

Do you know if I can fix this error? or is it not possible to use DmlExecutionProvider in this case?

guotuofeng · 2024-04-18T10:49:10Z

I am not sure, I don't try DML before since we doesn't have dml GPU.

jojo1899 · 2024-04-18T12:18:23Z

@guotuofeng Thank you for the responses.

I am now trying out LLM Optimization with DirectML, which has been updated yesterday.

guotuofeng · 2024-04-19T00:12:55Z

Actually, some OPs is still pending to merge in that example.

jojo1899 closed this as completed Apr 15, 2024

jojo1899 reopened this Apr 17, 2024

jojo1899 mentioned this issue Apr 22, 2024

Failed to run symbolic shape inference when doing LLM Optimization with DirectML #1093

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Olive workflow for mistral model optimization does not work #1075

Olive workflow for mistral model optimization does not work #1075

jojo1899 commented Apr 11, 2024 •

edited

guotuofeng commented Apr 12, 2024

jojo1899 commented Apr 12, 2024 •

edited

guotuofeng commented Apr 12, 2024

jojo1899 commented Apr 12, 2024 •

edited

guotuofeng commented Apr 15, 2024

guotuofeng commented Apr 15, 2024

jojo1899 commented Apr 15, 2024

jojo1899 commented Apr 17, 2024

guotuofeng commented Apr 18, 2024 •

edited

jojo1899 commented Apr 18, 2024 •

edited

guotuofeng commented Apr 18, 2024

guotuofeng commented Apr 18, 2024

jojo1899 commented Apr 18, 2024

guotuofeng commented Apr 18, 2024

jojo1899 commented Apr 18, 2024

guotuofeng commented Apr 19, 2024 •

edited

Olive workflow for mistral model optimization does not work #1075

Olive workflow for mistral model optimization does not work #1075

Comments

jojo1899 commented Apr 11, 2024 • edited

guotuofeng commented Apr 12, 2024

jojo1899 commented Apr 12, 2024 • edited

guotuofeng commented Apr 12, 2024

jojo1899 commented Apr 12, 2024 • edited

guotuofeng commented Apr 15, 2024

guotuofeng commented Apr 15, 2024

jojo1899 commented Apr 15, 2024

jojo1899 commented Apr 17, 2024

guotuofeng commented Apr 18, 2024 • edited

jojo1899 commented Apr 18, 2024 • edited

guotuofeng commented Apr 18, 2024

guotuofeng commented Apr 18, 2024

jojo1899 commented Apr 18, 2024

guotuofeng commented Apr 18, 2024

jojo1899 commented Apr 18, 2024

guotuofeng commented Apr 19, 2024 • edited

jojo1899 commented Apr 11, 2024 •

edited

jojo1899 commented Apr 12, 2024 •

edited

jojo1899 commented Apr 12, 2024 •

edited

guotuofeng commented Apr 18, 2024 •

edited

jojo1899 commented Apr 18, 2024 •

edited

guotuofeng commented Apr 19, 2024 •

edited