Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Can not get chat CLI working, throwing error after cloning model #2339

Closed
BeytoA opened this issue May 14, 2024 · 4 comments
Closed
Labels
question Question about the usage

Comments

@BeytoA
Copy link

BeytoA commented May 14, 2024

❓ General Questions

Windows 10 64-bit
Intel Xeon W2123 3.60GHz
24,0GB 2666MHz
NVIDIA Quadro P2000

I just installed conda and created a new environment. According to the quick start guide the installation is successful. But when I try to launch chat CLI, and get below error.

(base) C:\Users\mypc>conda activate llmENV

(llmENV) C:\Users\mypc>python -c "import mlc_llm; print(mlc_llm)"

<module 'mlc_llm' from 'C:\\Users\\mypc\\AppData\\Local\\miniconda3\\envs\\llmENV\\Lib\\site-packages\\mlc_llm\\__init__.py'>

(llmENV) C:\Users\mypc>mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
[2024-05-14 13:42:58] INFO auto_device.py:88: ←[91mNot found←[0m device: cuda:0
[2024-05-14 13:43:00] INFO auto_device.py:88: ←[91mNot found←[0m device: rocm:0
[2024-05-14 13:43:03] INFO auto_device.py:88: ←[91mNot found←[0m device: metal:0
[2024-05-14 13:43:08] INFO auto_device.py:79: ←[92mFound←[0m device: vulkan:0
[2024-05-14 13:43:10] INFO auto_device.py:88: ←[91mNot found←[0m device: opencl:0
[2024-05-14 13:43:10] INFO auto_device.py:35: Using device: ←[1mvulkan:0←[0m
[2024-05-14 13:43:10] INFO chat_module.py:362: Downloading model from HuggingFace: HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
[2024-05-14 13:43:10] INFO download.py:42: [Git] Cloning https://huggingface.co/mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC.git to C:\Users\mypc\AppData\Local\Temp\tmpd999_qpt\tmp
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Scripts\mlc_llm.exe\__main__.py", line 7, in <module>
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\__main__.py", line 37, in main
    cli.main(sys.argv[2:])
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\cli\chat.py", line 42, in main
    chat(
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\interface\chat.py", line 134, in chat
    cm = ChatModule(model, device, chat_config=config, model_lib=model_lib)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\chat_module.py", line 765, in __init__
    self.model_path, self.config_file_path = _get_model_path(model)
                                             ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\chat_module.py", line 363, in _get_model_path
    mlc_dir = download_mlc_weights(model)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\support\download.py", line 138, in download_mlc_weights
    git_clone(git_url, tmp_dir, ignore_lfs=True)
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\support\download.py", line 43, in git_clone
    subprocess.run(
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\subprocess.py", line 1538, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [WinError 2] The system cannot find the file specified

(llmENV) C:\Users\mypc>
@BeytoA BeytoA added the question Question about the usage label May 14, 2024
@tqchen
Copy link
Contributor

tqchen commented May 14, 2024

this seems to be the download error, can you check if you have installed the git, and git-lfs properly in your env

@BeytoA
Copy link
Author

BeytoA commented May 16, 2024

@tqchen Thanks for your help, installing git and git-lfs solved the problem! Now I have another one.

(llmENV) C:\Users\mypc>mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC

[2024-05-16 10:27:35] INFO auto_device.py:88: ←[91mNot found←[0m device: cuda:0
[2024-05-16 10:27:37] INFO auto_device.py:88: ←[91mNot found←[0m device: rocm:0
[2024-05-16 10:27:39] INFO auto_device.py:88: ←[91mNot found←[0m device: metal:0
[2024-05-16 10:27:44] INFO auto_device.py:79: ←[92mFound←[0m device: vulkan:0
[2024-05-16 10:27:47] INFO auto_device.py:88: ←[91mNot found←[0m device: opencl:0
[2024-05-16 10:27:47] INFO auto_device.py:35: Using device: ←[1mvulkan:0←[0m
[2024-05-16 10:27:47] INFO chat_module.py:362: Downloading model from HuggingFace: HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
[2024-05-16 10:27:47] INFO download.py:133: Weights already downloaded: C:\Users\mypc\AppData\Local\mlc_llm\model_weights\mlc-ai\Llama-3-8B-Instruct-q4f16_1-MLC
[2024-05-16 10:27:47] INFO chat_module.py:781: Now compiling model lib on device...
[2024-05-16 10:27:47] INFO jit.py:43: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-05-16 10:27:47] INFO jit.py:120: Compiling using commands below:
[2024-05-16 10:27:47] INFO jit.py:121: 'C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\python.exe' -m mlc_llm compile 'C:\Users\mypc\AppData\Local\mlc_llm\model_weights\mlc-ai\Llama-3-8B-Instruct-q4f16_1-MLC' --opt 'flashinfer=1;cublas_gemm=1;faster_transformer=1;cudagraph=0;cutlass=1;ipc_allreduce_strategy=NONE' --overrides 'context_window_size=8192;prefill_chunk_size=1024;tensor_parallel_shards=1' --device vulkan:0 --output 'C:\Users\mypc\AppData\Local\Temp\tmpin09k7zj\lib.dll'
[2024-05-16 10:27:50] INFO auto_config.py:69: Found model configuration: C:\Users\mypc\AppData\Local\mlc_llm\model_weights\mlc-ai\Llama-3-8B-Instruct-q4f16_1-MLC\mlc-chat-config.json
[2024-05-16 10:27:50] INFO auto_target.py:84: Detecting target device: vulkan:0
[2024-05-16 10:27:50] INFO auto_target.py:86: Found target: {"thread_warp_size": 1, "supports_float32": T.bool(True), "supports_int16": 1, "supports_int32": T.bool(True), "max_threads_per_block": 1536, "supports_int8": 1, "supports_int64": 1, "max_num_threads": 256, "kind": "vulkan", "max_shared_memory_per_block": 49152, "supports_16bit_buffer": 1, "tag": "", "keys": ["vulkan", "gpu"], "supports_float16": 0}
[2024-05-16 10:27:50] INFO auto_target.py:103: Found host LLVM triple: x86_64-pc-windows-msvc
[2024-05-16 10:27:50] INFO auto_target.py:104: Found host LLVM CPU: skylake-avx512
[2024-05-16 10:27:50] INFO auto_config.py:153: Found model type: llama. Use `--model-type` to override.
Compiling with arguments:
  --config          LlamaConfig(hidden_size=4096, intermediate_size=14336, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-05, vocab_size=128256, position_embedding_base=500000.0, context_window_size=8192, prefill_chunk_size=1024, num_key_value_heads=8, head_dim=128, tensor_parallel_shards=1, max_batch_size=80, kwargs={})
  --quantization    GroupQuantize(name='q4f16_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', linear_weight_layout='NK', quantize_embedding=True, quantize_final_fc=True, num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7)
  --model-type      llama
  --target          {"thread_warp_size": 1, "host": {"mtriple": "x86_64-pc-windows-msvc", "tag": "", "kind": "llvm", "mcpu": "skylake-avx512", "keys": ["cpu"]}, "supports_int16": 1, "supports_float32": T.bool(True), "supports_int32": T.bool(True), "max_threads_per_block": 1536, "supports_int8": 1, "supports_int64": 1, "max_num_threads": 256, "kind": "vulkan", "max_shared_memory_per_block": 49152, "supports_16bit_buffer": 1, "tag": "", "keys": ["vulkan", "gpu"], "supports_float16": 0}
  --opt             flashinfer=0;cublas_gemm=0;faster_transformer=0;cudagraph=0;cutlass=0;ipc_allreduce_strategy=NONE
  --system-lib-prefix ""
  --output          C:\Users\mypc\AppData\Local\Temp\tmpin09k7zj\lib.dll
  --overrides       context_window_size=8192;sliding_window_size=None;prefill_chunk_size=1024;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=1
[2024-05-16 10:27:50] INFO config.py:106: Overriding context_window_size from 8192 to 8192
[2024-05-16 10:27:50] INFO config.py:106: Overriding prefill_chunk_size from 1024 to 1024
[2024-05-16 10:27:50] INFO config.py:106: Overriding tensor_parallel_shards from 1 to 1
[2024-05-16 10:27:50] INFO compile.py:138: Creating model from: LlamaConfig(hidden_size=4096, intermediate_size=14336, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-05, vocab_size=128256, position_embedding_base=500000.0, context_window_size=8192, prefill_chunk_size=1024, num_key_value_heads=8, head_dim=128, tensor_parallel_shards=1, max_batch_size=80, kwargs={})
[2024-05-16 10:27:50] INFO compile.py:157: Exporting the model to TVM Unity compiler
[2024-05-16 10:27:57] INFO compile.py:163: Running optimizations using TVM Unity
[2024-05-16 10:27:57] INFO compile.py:182: Registering metadata: {'model_type': 'llama', 'quantization': 'q4f16_1', 'context_window_size': 8192, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 1024, 'tensor_parallel_shards': 1, 'kv_cache_bytes': 0}
[2024-05-16 10:27:59] INFO pipeline.py:52: Running TVM Relax graph-level optimizations
[2024-05-16 10:29:49] INFO pipeline.py:52: Lowering to TVM TIR kernels
[2024-05-16 10:29:59] INFO pipeline.py:52: Running TVM TIR-level optimizations
[2024-05-16 10:30:25] INFO pipeline.py:52: Running TVM Dlight low-level optimizations
[2024-05-16 10:30:28] INFO pipeline.py:52: Lowering to VM bytecode
[2024-05-16 10:30:33] INFO estimate_memory_usage.py:57: [Memory usage] Function `alloc_embedding_tensor`: 8.00 MB
[2024-05-16 10:30:33] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_decode`: 11.56 MB
[2024-05-16 10:30:33] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_decode_to_last_hidden_states`: 12.19 MB
[2024-05-16 10:30:33] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_prefill`: 148.00 MB
[2024-05-16 10:30:33] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_prefill_to_last_hidden_states`: 156.00 MB
[2024-05-16 10:30:33] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_select_last_hidden_states`: 0.62 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_verify`: 148.00 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_verify_to_last_hidden_states`: 156.00 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `create_tir_paged_kv_cache`: 0.00 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `decode`: 0.14 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `decode_to_last_hidden_states`: 0.15 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `embed`: 8.00 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `gather_hidden_states`: 0.00 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `get_logits`: 0.00 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `prefill`: 148.01 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `prefill_to_last_hidden_states`: 156.00 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `scatter_hidden_states`: 0.00 MB
[2024-05-16 10:30:34] INFO estimate_memory_usage.py:57: [Memory usage] Function `softmax_with_temperature`: 0.00 MB
[2024-05-16 10:30:36] INFO pipeline.py:52: Compiling external modules
[2024-05-16 10:30:36] INFO pipeline.py:52: Compilation complete! Exporting to disk
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\__main__.py", line 56, in <module>    main()
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\__main__.py", line 25, in main
    cli.main(sys.argv[2:])
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\cli\compile.py", line 128, in main    compile(
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\interface\compile.py", line 240, in compile
    _compile(args, model_config)
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\interface\compile.py", line 185, in _compile
    args.build_func(
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\support\auto_target.py", line 284, in build
    relax.build(
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\tvm\relax\vm_build.py", line 341, in build    return _vmlink(
           ^^^^^^^^
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\tvm\relax\vm_build.py", line 247, in _vmlink
    lib = tvm.build(
          ^^^^^^^^^^
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\tvm\driver\build_module.py", line 297, in build
    rt_mod_host = _driver_ffi.tir_to_runtime(annotated_mods, target_host)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\tvm\_ffi\_ctypes\packed_func.py", line 239, in __call__
    raise_last_ffi_error()
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\tvm\_ffi\base.py", line 481, in raise_last_ffi_error
    raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
  File "D:\a\package\package\tvm\src\target\spirv\ir_builder.cc", line 566
InternalError: Check failed: (spirv_support_.supports_float16) is false: Vulkan target does not support Float16 capability.  If your device supports 16-bit float operations, please either add -supports_float16=1 to the target, or query all device parameters by adding -from_device=0.
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Scripts\mlc_llm.exe\__main__.py", line 7, in <module>
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\__main__.py", line 37, in main
    cli.main(sys.argv[2:])
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\cli\chat.py", line 42, in main
    chat(
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\interface\chat.py", line 134, in chat
    cm = ChatModule(model, device, chat_config=config, model_lib=model_lib)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\chat_module.py", line 784, in __init__
    self.model_lib = jit.jit(
                     ^^^^^^^^
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\interface\jit.py", line 166, in jit
    _run_jit(
  File "C:\Users\mypc\AppData\Local\miniconda3\envs\llmENV\Lib\site-packages\mlc_llm\interface\jit.py", line 126, in _run_jit
    raise RuntimeError("Cannot find compilation output, compilation failed")
RuntimeError: Cannot find compilation output, compilation failed

(llmENV) C:\Users\mypc>

I suspect it has to do with the capabilities of the videocard. Is there any way to bypass or manually activate this float16 feature? Or is it related to something else?

screenshotVulkan

@tqchen
Copy link
Contributor

tqchen commented May 16, 2024

error message said you do not support f16, so please try to use a q4f32 variant of the model

@tqchen
Copy link
Contributor

tqchen commented May 16, 2024

mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f32_1-MLC

@tqchen tqchen closed this as completed May 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question about the usage
Projects
None yet
Development

No branches or pull requests

2 participants