Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I don't have models/optimized/llama_v2 folder after I've run python llama_v2.py --optimize #905

Open
KarpovVolodymyr opened this issue Jan 28, 2024 · 8 comments

Comments

@KarpovVolodymyr
Copy link

KarpovVolodymyr commented Jan 28, 2024

Describe the bug
Hello. I was following the steps from this guide https://community.amd.com/t5/ai/how-to-running-optimized-llama2-with-microsoft-directml-on-amd/ba-p/645190

at the end of step 2, when I run python llama_v2.py --optimize command, the running just stops each time if I try to re-run it again at the same place.
It generates model.onnx file in Olive\examples\directml\llama_v2\cache\models\0_OnnxConversion-14dc7b7c3125d3ad1222f0b9e2e5b807-dc5fbbbe422d406cc8fcef71d99251a4\output_model\model.onnx
and there is no models/optimized/llama_v2 folder.

As I understand there should be these files
Знімок екрана 2024-01-28 193022

I don't have any errors so it is difficult to understand is something wrong or not

Could you please give me some advice what I do wrong?

To Reproduce
Steps to reproduce the behavior.

Expected behavior
From the description I expect this
Once the script successfully completes, the optimized ONNX pipeline will be stored under models/optimized/llama_v2.

Olive config
Add Olive configurations here.

Olive logs
(llama2_Optimize) C:\Users\proxi\Olive\examples\directml\llama_v2>python llama_v2.py --model_type=7b-chat

Optimizing argmax_sampling
[2024-01-28 19:05:45,570] [INFO] [accelerator.py:205:create_accelerators] Running workflow on accelerator specs: gpu-dml
[2024-01-28 19:05:45,578] [INFO] [engine.py:851:_run_pass] Running pass convert:OnnxConversion
[2024-01-28 19:05:45,590] [INFO] [footprint.py:101:create_pareto_frontier] Output all 2 models
[2024-01-28 19:05:45,590] [INFO] [footprint.py:120:_create_pareto_frontier_from_nodes] pareto frontier points: 0_OnnxConversion-14dc7b7c3125d3ad1222f0b9e2e5b807-dc5fbbbe422d406cc8fcef71d99251a4
{
"latency-avg": 0.23591
}
[2024-01-28 19:05:45,591] [INFO] [engine.py:282:run] Run history for gpu-dml:
[2024-01-28 19:05:45,595] [INFO] [engine.py:557:dump_run_history] run history:
+------------------------------------------------------------------------------------+----------------------------------+----------------+----------------+--------------------------+
| model_id | parent_model_id | from_pass | duration_sec | metrics |
+====================================================================================+==================================+================+================+==========================+
| 14dc7b7c3125d3ad1222f0b9e2e5b807 | | | | |
+------------------------------------------------------------------------------------+----------------------------------+----------------+----------------+--------------------------+
| 0_OnnxConversion-14dc7b7c3125d3ad1222f0b9e2e5b807-dc5fbbbe422d406cc8fcef71d99251a4 | 14dc7b7c3125d3ad1222f0b9e2e5b807 | OnnxConversion | 0.135661 | { |
| | | | | "latency-avg": 0.23591 |
| | | | | } |
+------------------------------------------------------------------------------------+----------------------------------+----------------+----------------+--------------------------+
[2024-01-28 19:05:45,595] [INFO] [engine.py:296:run] No packaging config provided, skip packaging artifacts
Optimized Model : C:\Users\proxi\Olive\examples\directml\llama_v2\cache\models\0_OnnxConversion-14dc7b7c3125d3ad1222f0b9e2e5b807-dc5fbbbe422d406cc8fcef71d99251a4\output_model\model.onnx

Optimizing llama_v2
[2024-01-28 19:05:45,607] [INFO] [accelerator.py:205:create_accelerators] Running workflow on accelerator specs: gpu-dml
[2024-01-28 19:05:45,635] [INFO] [engine.py:851:_run_pass] Running pass convert:OnnxConversion
C:\Users\proxi\miniconda3\envs\llama2_Optimize\lib\site-packages\torch\onnx_internal\jit_utils.py:307: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\passes\onnx\constant_fold.cpp:181.)
_C._jit_pass_onnx_node_shape_type_inference(node, params_dict, opset_version)
C:\Users\proxi\miniconda3\envs\llama2_Optimize\lib\site-packages\torch\onnx\utils.py:702: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\passes\onnx\constant_fold.cpp:181.)
_C._jit_pass_onnx_graph_shape_type_inference(
C:\Users\proxi\miniconda3\envs\llama2_Optimize\lib\site-packages\torch\onnx\utils.py:1209: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\passes\onnx\constant_fold.cpp:181.)
_C._jit_pass_onnx_graph_shape_type_inference(

Other information

  • OS: Windows, AMD GPU
  • Olive version: [e.g. 0.4.0 or main]
  • ONNXRuntime package and version: ONNXRuntime-directml 1.16.3

Additional context
I have an AMD videocard. So I was looking for ways how to run llama-2 with AMD GPU. I found this guide and was following the steps. https://community.amd.com/t5/ai/how-to-running-optimized-llama2-with-microsoft-directml-on-amd/ba-p/645190

@trajepl
Copy link
Contributor

trajepl commented Jan 29, 2024

image
Check this, seems you did not run optimization successfully.
In your case, only argmax_sampling got optimized, the llama_v2 did not, right?

From the log, I cannot tell why the optimization failed, could you help please attach the completed log?

@KarpovVolodymyr KarpovVolodymyr changed the title I don't have models/optimized/llama_v2 folder after I've ran python llama_v2.py --optimize I don't have models/optimized/llama_v2 folder after I've run python llama_v2.py --optimize Jan 29, 2024
@Karjhan
Copy link

Karjhan commented Mar 7, 2024

I alsp have an AMD GPU and I too have a similar issue regarding the same steps: https://community.amd.com/t5/ai/how-to-running-optimized-llama2-with-microsoft-directml-on-amd/ba-p/645190
Was this issue solved?
If yes, how?

My Olive Logs:
Optimizing llama_v2
[2024-03-07 15:56:48,646] [INFO] [accelerator.py:208:create_accelerators] Running workflow on accelerator specs: gpu-dml
[2024-03-07 15:56:48,647] [INFO] [engine.py:116:initialize] Using cache directory: cache
[2024-03-07 15:56:48,647] [INFO] [engine.py:272:run] Running Olive on accelerator: gpu-dml
[2024-03-07 15:56:48,669] [INFO] [engine.py:862:_run_pass] Running pass convert:OnnxConversion
C:\Users\anaconda3\envs\llama2_Optimize\lib\site-packages\torch\onnx_internal\jit_utils.py:307: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\passes\onnx\constant_fold.cpp:181.)
_C._jit_pass_onnx_node_shape_type_inference(node, params_dict, opset_version)
C:\Users\anaconda3\envs\llama2_Optimize\lib\site-packages\torch\onnx\utils.py:702: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\passes\onnx\constant_fold.cpp:181.)
_C._jit_pass_onnx_graph_shape_type_inference(
C:\Users\anaconda3\envs\llama2_Optimize\lib\site-packages\torch\onnx\utils.py:1209: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\passes\onnx\constant_fold.cpp:181.)
_C._jit_pass_onnx_graph_shape_type_inference(
[2024-03-07 16:02:23,506] [WARNING] [common.py:108:model_proto_to_file] Model is too large to save as a single file but 'save_as_external_data' is False. Saved tensors as external data regardless.
[2024-03-07 16:06:51,058] [WARNING] [common.py:108:model_proto_to_file] Model is too large to save as a single file but 'save_as_external_data' is False. Saved tensors as external data regardless.
[2024-03-07 16:06:51,127] [INFO] [engine.py:952:_run_pass] Pass convert:OnnxConversion finished in 602.456794 seconds
[2024-03-07 16:06:51,139] [INFO] [engine.py:862:_run_pass] Running pass optimize:OrtTransformersOptimization

@jamesalster
Copy link

I have the same issue, following the same instructions as Karjhan.

Windows 11, AMD CPU, AMD 6700 XT. Using olive-ai-0.50, and

Olive logs:
Optimizing llama_v2
[2024-03-07 19:56:03,993] [INFO] [accelerator.py:208:create_accelerators] Running workflow on accelerator specs: gpu-dml
[2024-03-07 19:56:03,993] [INFO] [engine.py:116:initialize] Using cache directory: cache
[2024-03-07 19:56:03,993] [INFO] [engine.py:272:run] Running Olive on accelerator: gpu-dml
[2024-03-07 19:56:04,026] [INFO] [engine.py:862:_run_pass] Running pass convert:OnnxConversion
[2024-03-07 19:56:04,034] [INFO] [engine.py:896:_run_pass] Loaded model from cache: 1_OnnxConversion-9c3612d31e59051b1903b377f456134d-dc5fbbbe422d406cc8fcef71d99251a4 from cache\runs
[2024-03-07 19:56:04,034] [INFO] [engine.py:862:_run_pass] Running pass optimize:OrtTransformersOptimization

Sometimes, I get the following message, as well:

[2024-03-07 19:36:13,073] [WARNING] [common.py:108:model_proto_to_file] Model is too large to save as a single file but 'save_as_external_data' is False. Saved tensors as external data regardless.

@trajepl
Copy link
Contributor

trajepl commented Mar 11, 2024

@PatriceVignola Could you help take a look?

@PatriceVignola
Copy link
Contributor

These are classic OOM symptoms when running the script. Unfortunately the ORT optimizer that Olive uses needs way more memory than in should, which results in those OOM crashes without error messages.

Usually, I would recommend between 200gb and 300gb of RAM (which can include your pagefile). On my machine, I have 128gb of RAM and a pagefile of about 150gb and it takes around 30 minutes to go through the conversion and optimization process. It can also be done with less physical memory (and bigger pagefile size), but it might take longer.

@jamesalster
Copy link

That makes sense, thank you, I don't have that much memory + pagefile available. I'll keep an eye on the repository for any future updates that reduce the memory required.

@andresmejia10-int
Copy link

Tested on two different state of the art systems and i am trap in the error:

onnxruntime.capi.onnxruntime_pybind11_state.InvalidProtobuf: [ONNXRuntimeError] : 7 : INVALID_PROTOBUF : Load model from
C:\Olive\examples\directml\llama_v2\cache\models\3_OptimumMerging-2-29928aa56000b48c6135423fa1102e45-gpu-dml\output_model\decoder_model_merged.onnx failed:Protobuf parsing failed.

i wonder if thi sis the cause ?

[2024-04-11 10:11:58,163] [INFO] [engine.py:873:_run_pass] Running pass merge:OptimumMerging
Merged ONNX model exceeds 2GB, the model will not be checked without save_path given.

LOG:

C:\Olive\examples\directml\llama_v2>python llama_v2.py

Optimizing argmax_sampling
[2024-04-11 09:06:46,186] [INFO] [run.py:246:run] Loading Olive module configuration from: C:\Olive\olive\olive_config.json
[2024-04-11 09:06:46,186] [INFO] [accelerator.py:324:create_accelerators] Running workflow on accelerator specs: gpu-dml
[2024-04-11 09:06:46,186] [INFO] [run.py:199:run_engine] Importing pass module OnnxConversion
[2024-04-11 09:06:46,197] [INFO] [engine.py:115:initialize] Using cache directory: cache
[2024-04-11 09:06:46,197] [INFO] [engine.py:271:run] Running Olive on accelerator: gpu-dml
[2024-04-11 09:06:46,202] [INFO] [engine.py:873:_run_pass] Running pass convert:OnnxConversion
[2024-04-11 09:06:47,720] [INFO] [engine.py:960:_run_pass] Pass convert:OnnxConversion finished in 1.518627 seconds
[2024-04-11 09:06:47,727] [INFO] [engine.py:851:_run_passes] Run model evaluation for the final model...
[2024-04-11 09:06:48,244] [INFO] [engine.py:370:run_accelerator] Save footprint to footprints\argmax_sampling_gpu-dml_footprints.json.
[2024-04-11 09:06:48,244] [INFO] [engine.py:288:run] Run history for gpu-dml:
[2024-04-11 09:06:48,244] [INFO] [engine.py:576:dump_run_history] run history:
+------------------------------------------------------------------------------------+----------------------------------+------+
| model_id | parent_model_id | from_pass | duration_sec | metrics |
+====================================================================================+==================================+======+
| 14dc | | | | |
+------------------------------------------------------------------------------------+----------------------------------+------+
| 0_OnnxConversion-14dc-818 | 14dc | OnnxConversion | 1.51863 | "latency-avg": 0.4114|
+------------------------------------------------------------------------------------+----------------------------------+------+
[2024-04-11 09:06:48,244] [INFO] [engine.py:303:run] No packaging config provided, skip packaging artifacts
Optimized Model : C:\Olive\examples\directml\llama_v2\cache\models\0_OnnxConversion-14dc7b7c3125d3ad1222f0b9e2e5b807-8183e3a10c90bb4a9507d579143be30e\output_model\model.onnx

Optimizing llama_v2
[2024-04-11 09:06:48,260] [INFO] [run.py:246:run] Loading Olive module configuration from: C:\Olive\olive\olive_config.json
[2024-04-11 09:06:48,260] [INFO] [accelerator.py:324:create_accelerators] Running workflow on accelerator specs: gpu-dml
[2024-04-11 09:06:48,260] [INFO] [run.py:199:run_engine] Importing pass module OnnxConversion
[2024-04-11 09:06:48,260] [INFO] [run.py:199:run_engine] Importing pass module OrtTransformersOptimization
[2024-04-11 09:06:48,260] [INFO] [run.py:199:run_engine] Importing pass module OptimumMerging
[2024-04-11 09:06:48,260] [INFO] [engine.py:115:initialize] Using cache directory: cache
[2024-04-11 09:06:48,260] [INFO] [engine.py:271:run] Running Olive on accelerator: gpu-dml
[2024-04-11 09:06:48,292] [INFO] [engine.py:873:_run_pass] Running pass convert:OnnxConversion
C:\Users\t\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\onnx_internal\jit_utils.py:307: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at ..\torch\csrc\jit\passes\onnx\constant_fold.cpp:181.)
_C._jit_pass_onnx_node_shape_type_inference(node, params_dict, opset_version)
C:\Users\t\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\onnx\utils.py:702: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at ..\torch\csrc\jit\passes\onnx\constant_fold.cpp:181.)
_C._jit_pass_onnx_graph_shape_type_inference(
C:\Users\t\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\onnx\utils.py:1209: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at ..\torch\csrc\jit\passes\onnx\constant_fold.cpp:181.)
_C._jit_pass_onnx_graph_shape_type_inference(
[2024-04-11 09:39:36,710] [INFO] [engine.py:960:_run_pass] Pass convert:OnnxConversion finished in 1968.418042 seconds
[2024-04-11 09:39:36,757] [INFO] [engine.py:873:_run_pass] Running pass optimize:OrtTransformersOptimization
[2024-04-11 10:11:58,085] [INFO] [engine.py:960:_run_pass] Pass optimize:OrtTransformersOptimization finished in 1941.281035 seconds

[2024-04-11 10:11:58,163] [INFO] [engine.py:873:_run_pass] Running pass merge:OptimumMerging
Merged ONNX model exceeds 2GB, the model will not be checked without save_path given.
[2024-04-11 10:17:38,660] [ERROR] [engine.py:955:_run_pass] Pass run failed.
Traceback (most recent call last):
File "C:\Olive\olive\engine\engine.py", line 943, in _run_pass
output_model_config = host.run_pass(p, input_model_config, data_root, output_model_path, pass_search_point)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Olive\olive\systems\local.py", line 31, in run_pass
output_model = the_pass.run(model, data_root, output_model_path, point)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Olive\olive\passes\olive_pass.py", line 216, in run
output_model = self._run_for_config(model, data_root, config, output_model_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Olive\olive\passes\onnx\optimum_merging.py", line 85, in _run_for_config
onnxruntime.InferenceSession(output_model_path, sess_options, providers=[execution_provider])
File "C:\Users\t\AppData\Local\Programs\Python\Python311\Lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in init
self._create_inference_session(providers, provider_options, disabled_optimizers)
File "C:\Users\t\AppData\Local\Programs\Python\Python311\Lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 472, in _create_inference_session
sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
onnxruntime.capi.onnxruntime_pybind11_state.InvalidProtobuf: [ONNXRuntimeError] : 7 : INVALID_PROTOBUF : Load model from
C:\Olive\examples\directml\llama_v2\cache\models\3_OptimumMerging-2-29928aa56000b48c6135423fa1102e45-gpu-dml\output_model\decoder_model_merged.onnx failed:Protobuf parsing failed.

@josephyuzb
Copy link

I ran :python llama_v2.py
error message is here:

RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
[2024-04-25 19:58:01,158] [INFO] [engine.py:279:run] Run history for gpu-dml:
[2024-04-25 19:58:01,159] [INFO] [engine.py:567:dump_run_history] run history:
+----------------------------------+-------------------+-------------+----------------+-----------+
| model_id | parent_model_id | from_pass | duration_sec | metrics |
+==================================+===================+=============+================+===========+
| 9c3612d31e59051b1903b377f456134d | | | | |
+----------------------------------+-------------------+-------------+----------------+-----------+
[2024-04-25 19:58:01,159] [INFO] [engine.py:294:run] No packaging config provided, skip packaging artifacts
Traceback (most recent call last):
File "C:\Users\Administrator\olive\examples\directml\llama_v2\llama_v2.py", line 217, in
optimize(optimized_model_dir, args.model_type)
File "C:\Users\Administrator\olive\examples\directml\llama_v2\llama_v2.py", line 76, in optimize
with footprints_file_path.open("r") as footprint_file:
File "C:\ProgramData\Anaconda3\envs\llama2_Optimize\lib\pathlib.py", line 1252, in open
return io.open(self, mode, buffering, encoding, errors, newline,
File "C:\ProgramData\Anaconda3\envs\llama2_Optimize\lib\pathlib.py", line 1120, in _opener
return self._accessor.open(self, flags, mode)
FileNotFoundError: [Errno 2] No such file or directory: 'C:\Users\Administrator\olive\examples\directml\llama_v2\footprints\llama_v2_gpu-dml_footprints.json'

(C:\ProgramData\Anaconda3\envs\llama2_Optimize) C:\Users\Administrator\olive\examples\directml\llama_v2>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants