[Bug] LlaMa-3 doesn't work #2281

chongkuiqi · 2024-05-06T09:34:37Z

🐛 Bug

Thanks for your work ! I download the compiled/quantized Llama-3 weights from https://huggingface.co/mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC, but when i run it, it outputs as folowing:

Exception in thread Thread-1 (_background_loop):
Traceback (most recent call last):
File "/home/haige/miniconda3/envs/torch222/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/haige/miniconda3/envs/torch222/lib/python3.10/threading.py", line 953, in run
self._target(self._args, *self._kwargs)
File "/home/haige/miniconda3/envs/torch222/lib/python3.10/site-packages/mlc_llm/serve/engine_base.py", line 484, in _background_loop
self._ffi"run_background_loop"
File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.call
File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
File "/home/haige/miniconda3/envs/torch222/lib/python3.10/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
13: mlc::llm::serve::ThreadedEngineImpl::RunBackgroundLoop()
at /workspace/mlc-llm/cpp/serve/threaded_engine.cc:168
12: mlc::llm::serve::EngineImpl::Step()
at /workspace/mlc-llm/cpp/serve/engine.cc:326
11: mlc::llm::serve::NewRequestPrefillActionObj::Step(mlc::llm::serve::EngineState)
at /workspace/mlc-llm/cpp/serve/engine_actions/new_request_prefill.cc:235
10: mlc::llm::serve::GPUSampler::BatchSampleTokensWithProbAfterTopP(tvm::runtime::NDArray, std::vector<int, std::allocator > const&, tvm::runtime::Array<tvm::runtime::String, void> const&, tvm::runtime::Array<mlc::llm::serve::GenerationConfig, void> const&, std::vector<mlc::llm::RandomGenerator, std::allocatormlc::llm::RandomGenerator* > const&, std::vector<tvm::runtime::NDArray, std::allocatortvm::runtime::NDArray >)
at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:179
9: mlc::llm::serve::GPUSampler::BatchSampleTokensImpl(tvm::runtime::NDArray, std::vector<int, std::allocator > const&, tvm::runtime::Array<tvm::runtime::String, void> const&, tvm::runtime::Array<mlc::llm::serve::GenerationConfig, void> const&, std::vector<mlc::llm::RandomGenerator*, std::allocatormlc::llm::RandomGenerator* > const&, bool, std::vector<tvm::runtime::NDArray, std::allocatortvm::runtime::NDArray >)
at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:369
8: mlc::llm::serve::GPUSampler::ChunkSampleTokensImpl(tvm::runtime::NDArray, std::vector<int, std::allocator > const&, tvm::runtime::Array<mlc::llm::serve::GenerationConfig, void> const&, std::vector<mlc::llm::RandomGenerator, std::allocatormlc::llm::RandomGenerator* > const&, bool)
at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:450
7: mlc::llm::serve::GPUSampler::SampleOnGPU(tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, bool, bool, int, std::vector<int, std::allocator > const&)
at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:567
6: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
5: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::relax_vm::VirtualMachineImpl::GetClosureInternal(tvm::runtime::String const&, bool)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
4: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocatortvm::runtime::TVMRetValue > const&)
3: tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop()
2: tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame*, tvm::runtime::relax_vm::Instruction)
1: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::WrapPackedFunc(int ()(TVMValue, int*, int, TVMValue*, int*, void*), tvm::runtime::ObjectPtrtvm::runtime::Object const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
0: TVMThrowLastError.cold
TVMError: after determining tmp storage requirements for inclusive_scan: cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device

So I mlc_llm compile the llama-3-cuda.so, but it outputs:

Exception in thread Thread-1 (_background_loop):
Traceback (most recent call last):
File "/home/haige/miniconda3/envs/torch222/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/haige/miniconda3/envs/torch222/lib/python3.10/threading.py", line 953, in run
self._target(self._args, self._kwargs)
File "/home/haige/miniconda3/envs/torch222/lib/python3.10/site-packages/mlc_llm/serve/engine_base.py", line 484, in _background_loop
self._ffi"run_background_loop"
File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.call
File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
File "/home/haige/miniconda3/envs/torch222/lib/python3.10/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
3: _ZN3tvm7runtime13PackedFuncObj9ExtractorINS0_16PackedFuncSubObjIZNS0_6detail17PackFuncVoidAddr_ILi8ENS0_15CUDAWrappedFuncEEENS0_10PackedFuncET0_RKSt6vectorINS4_1
2: tvm::runtime::CUDAWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue, void) const [clone .isra.0]
1: tvm::runtime::CUDAModuleNode::GetFunc(int, std::_cxx11::basic_string<char, std::char_traits, std::allocator > const&)
0: ZN3tvm7runtime6deta
File "/workspace/tvm/src/runtime/cuda/cuda_module.cc", line 110
CUDAError: cuModuleLoadData(&(module[device_id]), data.c_str()) failed with error: CUDA_ERROR_NO_BINARY_FOR_GPU

To Reproduce

from mlc_llm import MLCEngine
model = "Llama-3-8B-Instruct-q4f16_1-MLC"
engine = MLCEngine(
model=model,
model_lib=" llama-3-cuda.so",
device="cuda",
)
response = engine.chat.completions.create(
messages=[{"role": "user", "content": "What is the meaning of life?"}],
model=model,
stream=False,
)
print(response)

Environment

Platform : cuda
Operating system: ubuntu20.04
Device: RTX 6000 24GB
How you installed MLC-LLM: python3 -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cu121 mlc-ai-nightly-cu121
How you installed TVM-Unity: python3 -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cu121 mlc-ai-nightly-cu121
Python version: 3.10
GPU driver version: 535.171.04
CUDA/cuDNN version : cuda12.1, cudnn 8.9.4

Could you please provide some helps ?

vinx13 · 2024-05-07T17:35:14Z

The error CUDA_ERROR_NO_BINARY_FOR_GPU is likely ude to a mismatch of the cuda arch, you can try specifying the arch in the target

chongkuiqi · 2024-05-08T02:43:48Z

The error CUDA_ERROR_NO_BINARY_FOR_GPU is likely ude to a mismatch of the cuda arch, you can try specifying the arch in the target

Thanks for reply！But when I specifying the arch, it still outputs :

[10:40:22] /workspace/mlc-llm/cpp/serve/config.cc:683: Estimated total single GPU memory usage: 5736.325 MB (Parameters: 4308.133 MB. KVCache: 1092.268 MB. Temporary buffer: 335.925 MB). The actual usage might be slightly larger than the estimated number.
Exception in thread Thread-1 (_background_loop):
Traceback (most recent call last):
  File "/home/haige/miniconda3/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/haige/miniconda3/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/haige/miniconda3/lib/python3.10/site-packages/mlc_llm/serve/engine_base.py", line 484, in _background_loop
    self._ffi["run_background_loop"]()
  File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
  File "/home/haige/miniconda3/lib/python3.10/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
  13: mlc::llm::serve::ThreadedEngineImpl::RunBackgroundLoop()
        at /workspace/mlc-llm/cpp/serve/threaded_engine.cc:168
  12: mlc::llm::serve::EngineImpl::Step()
        at /workspace/mlc-llm/cpp/serve/engine.cc:326
  11: mlc::llm::serve::NewRequestPrefillActionObj::Step(mlc::llm::serve::EngineState)
        at /workspace/mlc-llm/cpp/serve/engine_actions/new_request_prefill.cc:235
  10: mlc::llm::serve::GPUSampler::BatchSampleTokensWithProbAfterTopP(tvm::runtime::NDArray, std::vector<int, std::allocator<int> > const&, tvm::runtime::Array<tvm::runtime::String, void> const&, tvm::runtime::Array<mlc::llm::serve::GenerationConfig, void> const&, std::vector<mlc::llm::RandomGenerator*, std::allocator<mlc::llm::RandomGenerator*> > const&, std::vector<tvm::runtime::NDArray, std::allocator<tvm::runtime::NDArray> >*)
        at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:179
  9: mlc::llm::serve::GPUSampler::BatchSampleTokensImpl(tvm::runtime::NDArray, std::vector<int, std::allocator<int> > const&, tvm::runtime::Array<tvm::runtime::String, void> const&, tvm::runtime::Array<mlc::llm::serve::GenerationConfig, void> const&, std::vector<mlc::llm::RandomGenerator*, std::allocator<mlc::llm::RandomGenerator*> > const&, bool, std::vector<tvm::runtime::NDArray, std::allocator<tvm::runtime::NDArray> >*)
        at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:369
  8: mlc::llm::serve::GPUSampler::ChunkSampleTokensImpl(tvm::runtime::NDArray, std::vector<int, std::allocator<int> > const&, tvm::runtime::Array<mlc::llm::serve::GenerationConfig, void> const&, std::vector<mlc::llm::RandomGenerator*, std::allocator<mlc::llm::RandomGenerator*> > const&, bool)
        at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:450
  7: mlc::llm::serve::GPUSampler::SampleOnGPU(tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, bool, bool, int, std::vector<int, std::allocator<int> > const&)
        at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:567
  6: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
  5: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::relax_vm::VirtualMachineImpl::GetClosureInternal(tvm::runtime::String const&, bool)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
  4: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocator<tvm::runtime::TVMRetValue> > const&)
  3: tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop()
  2: tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame*, tvm::runtime::relax_vm::Instruction)
  1: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::WrapPackedFunc(int (*)(TVMValue*, int*, int, TVMValue*, int*, void*), tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
  0: TVMThrowLastError.cold
TVMError: after determining tmp storage requirements for inclusive_scan: cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device

Could you please provide some helps ?

NSTiwari · 2024-05-10T07:04:47Z

@chongkuiqi could you please share the build files generated after running prepare_libs.sh for Llama3? Let me try.

chongkuiqi · 2024-05-10T12:43:21Z

@chongkuiqi could you please share the build files generated after running prepare_libs.sh for Llama3? Let me try.

I didn't use prepare_libs.sh, I just use mlc_llm compile ./dist/Llama-3-8B-Instruct-q4f16_1-MLC/mlc-chat-config.json ... to generate llama3-cuda.so.
I think the above problem is probability due to my GPU Quadro RTX 6000 (SM75) without Flash kernels.

tqchen · 2024-05-11T12:00:21Z

Likely this was due to older variant of the GPU, you can try to build tvm and mlc from source without flashinfer/thrust https://llm.mlc.ai/docs/install/tvm.html

chongkuiqi added the bug Confirmed bugs label May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] LlaMa-3 doesn't work #2281

[Bug] LlaMa-3 doesn't work #2281

chongkuiqi commented May 6, 2024

vinx13 commented May 7, 2024

chongkuiqi commented May 8, 2024

NSTiwari commented May 10, 2024

chongkuiqi commented May 10, 2024

tqchen commented May 11, 2024 •

edited

[Bug] LlaMa-3 doesn't work #2281

[Bug] LlaMa-3 doesn't work #2281

Comments

chongkuiqi commented May 6, 2024

🐛 Bug

To Reproduce

Environment

vinx13 commented May 7, 2024

chongkuiqi commented May 8, 2024

NSTiwari commented May 10, 2024

chongkuiqi commented May 10, 2024

tqchen commented May 11, 2024 • edited

tqchen commented May 11, 2024 •

edited