Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] LlaMa-3 doesn't work #2281

Open
chongkuiqi opened this issue May 6, 2024 · 5 comments
Open

[Bug] LlaMa-3 doesn't work #2281

chongkuiqi opened this issue May 6, 2024 · 5 comments
Labels
bug Confirmed bugs

Comments

@chongkuiqi
Copy link

馃悰 Bug

Thanks for your work ! I download the compiled/quantized Llama-3 weights from https://huggingface.co/mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC, but when i run it, it outputs as folowing:

Exception in thread Thread-1 (_background_loop):
Traceback (most recent call last):
File "/home/haige/miniconda3/envs/torch222/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/haige/miniconda3/envs/torch222/lib/python3.10/threading.py", line 953, in run
self._target(self._args, *self._kwargs)
File "/home/haige/miniconda3/envs/torch222/lib/python3.10/site-packages/mlc_llm/serve/engine_base.py", line 484, in _background_loop
self._ffi"run_background_loop"
File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.call
File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
File "/home/haige/miniconda3/envs/torch222/lib/python3.10/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
13: mlc::llm::serve::ThreadedEngineImpl::RunBackgroundLoop()
at /workspace/mlc-llm/cpp/serve/threaded_engine.cc:168
12: mlc::llm::serve::EngineImpl::Step()
at /workspace/mlc-llm/cpp/serve/engine.cc:326
11: mlc::llm::serve::NewRequestPrefillActionObj::Step(mlc::llm::serve::EngineState)
at /workspace/mlc-llm/cpp/serve/engine_actions/new_request_prefill.cc:235
10: mlc::llm::serve::GPUSampler::BatchSampleTokensWithProbAfterTopP(tvm::runtime::NDArray, std::vector<int, std::allocator > const&, tvm::runtime::Array<tvm::runtime::String, void> const&, tvm::runtime::Array<mlc::llm::serve::GenerationConfig, void> const&, std::vector<mlc::llm::RandomGenerator
, std::allocatormlc::llm::RandomGenerator* > const&, std::vector<tvm::runtime::NDArray, std::allocatortvm::runtime::NDArray >
)
at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:179
9: mlc::llm::serve::GPUSampler::BatchSampleTokensImpl(tvm::runtime::NDArray, std::vector<int, std::allocator > const&, tvm::runtime::Array<tvm::runtime::String, void> const&, tvm::runtime::Array<mlc::llm::serve::GenerationConfig, void> const&, std::vector<mlc::llm::RandomGenerator*, std::allocatormlc::llm::RandomGenerator* > const&, bool, std::vector<tvm::runtime::NDArray, std::allocatortvm::runtime::NDArray >)
at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:369
8: mlc::llm::serve::GPUSampler::ChunkSampleTokensImpl(tvm::runtime::NDArray, std::vector<int, std::allocator > const&, tvm::runtime::Array<mlc::llm::serve::GenerationConfig, void> const&, std::vector<mlc::llm::RandomGenerator
, std::allocatormlc::llm::RandomGenerator* > const&, bool)
at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:450
7: mlc::llm::serve::GPUSampler::SampleOnGPU(tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, bool, bool, int, std::vector<int, std::allocator > const&)
at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:567
6: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
5: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::relax_vm::VirtualMachineImpl::GetClosureInternal(tvm::runtime::String const&, bool)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
4: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocatortvm::runtime::TVMRetValue > const&)
3: tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop()
2: tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame*, tvm::runtime::relax_vm::Instruction)
1: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::WrapPackedFunc(int ()(TVMValue, int*, int, TVMValue*, int*, void*), tvm::runtime::ObjectPtrtvm::runtime::Object const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
0: TVMThrowLastError.cold
TVMError: after determining tmp storage requirements for inclusive_scan: cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device

So I mlc_llm compile the llama-3-cuda.so, but it outputs:

Exception in thread Thread-1 (_background_loop):
Traceback (most recent call last):
File "/home/haige/miniconda3/envs/torch222/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/haige/miniconda3/envs/torch222/lib/python3.10/threading.py", line 953, in run
self._target(self._args, self._kwargs)
File "/home/haige/miniconda3/envs/torch222/lib/python3.10/site-packages/mlc_llm/serve/engine_base.py", line 484, in _background_loop
self._ffi"run_background_loop"
File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.call
File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
File "/home/haige/miniconda3/envs/torch222/lib/python3.10/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
3: _ZN3tvm7runtime13PackedFuncObj9ExtractorINS0_16PackedFuncSubObjIZNS0_6detail17PackFuncVoidAddr_ILi8ENS0_15CUDAWrappedFuncEEENS0_10PackedFuncET0_RKSt6vectorINS4_1
2: tvm::runtime::CUDAWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue
, void
) const [clone .isra.0]
1: tvm::runtime::CUDAModuleNode::GetFunc(int, std::_cxx11::basic_string<char, std::char_traits, std::allocator > const&)
0: ZN3tvm7runtime6deta
File "/workspace/tvm/src/runtime/cuda/cuda_module.cc", line 110
CUDAError: cuModuleLoadData(&(module
[device_id]), data
.c_str()) failed with error: CUDA_ERROR_NO_BINARY_FOR_GPU

To Reproduce

from mlc_llm import MLCEngine
model = "Llama-3-8B-Instruct-q4f16_1-MLC"
engine = MLCEngine(
model=model,
model_lib=" llama-3-cuda.so",
device="cuda",
)
response = engine.chat.completions.create(
messages=[{"role": "user", "content": "What is the meaning of life?"}],
model=model,
stream=False,
)
print(response)

Environment

  • Platform : cuda

  • Operating system: ubuntu20.04

  • Device: RTX 6000 24GB

  • How you installed MLC-LLM: python3 -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cu121 mlc-ai-nightly-cu121

  • How you installed TVM-Unity: python3 -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cu121 mlc-ai-nightly-cu121

  • Python version: 3.10

  • GPU driver version: 535.171.04

  • CUDA/cuDNN version : cuda12.1, cudnn 8.9.4

Could you please provide some helps ?

@chongkuiqi chongkuiqi added the bug Confirmed bugs label May 6, 2024
@vinx13
Copy link
Member

vinx13 commented May 7, 2024

The error CUDA_ERROR_NO_BINARY_FOR_GPU is likely ude to a mismatch of the cuda arch, you can try specifying the arch in the target

@chongkuiqi
Copy link
Author

The error CUDA_ERROR_NO_BINARY_FOR_GPU is likely ude to a mismatch of the cuda arch, you can try specifying the arch in the target

Thanks for reply锛丅ut when I specifying the arch, it still outputs :

[10:40:22] /workspace/mlc-llm/cpp/serve/config.cc:683: Estimated total single GPU memory usage: 5736.325 MB (Parameters: 4308.133 MB. KVCache: 1092.268 MB. Temporary buffer: 335.925 MB). The actual usage might be slightly larger than the estimated number.
Exception in thread Thread-1 (_background_loop):
Traceback (most recent call last):
  File "/home/haige/miniconda3/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/haige/miniconda3/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/haige/miniconda3/lib/python3.10/site-packages/mlc_llm/serve/engine_base.py", line 484, in _background_loop
    self._ffi["run_background_loop"]()
  File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
  File "/home/haige/miniconda3/lib/python3.10/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
  13: mlc::llm::serve::ThreadedEngineImpl::RunBackgroundLoop()
        at /workspace/mlc-llm/cpp/serve/threaded_engine.cc:168
  12: mlc::llm::serve::EngineImpl::Step()
        at /workspace/mlc-llm/cpp/serve/engine.cc:326
  11: mlc::llm::serve::NewRequestPrefillActionObj::Step(mlc::llm::serve::EngineState)
        at /workspace/mlc-llm/cpp/serve/engine_actions/new_request_prefill.cc:235
  10: mlc::llm::serve::GPUSampler::BatchSampleTokensWithProbAfterTopP(tvm::runtime::NDArray, std::vector<int, std::allocator<int> > const&, tvm::runtime::Array<tvm::runtime::String, void> const&, tvm::runtime::Array<mlc::llm::serve::GenerationConfig, void> const&, std::vector<mlc::llm::RandomGenerator*, std::allocator<mlc::llm::RandomGenerator*> > const&, std::vector<tvm::runtime::NDArray, std::allocator<tvm::runtime::NDArray> >*)
        at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:179
  9: mlc::llm::serve::GPUSampler::BatchSampleTokensImpl(tvm::runtime::NDArray, std::vector<int, std::allocator<int> > const&, tvm::runtime::Array<tvm::runtime::String, void> const&, tvm::runtime::Array<mlc::llm::serve::GenerationConfig, void> const&, std::vector<mlc::llm::RandomGenerator*, std::allocator<mlc::llm::RandomGenerator*> > const&, bool, std::vector<tvm::runtime::NDArray, std::allocator<tvm::runtime::NDArray> >*)
        at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:369
  8: mlc::llm::serve::GPUSampler::ChunkSampleTokensImpl(tvm::runtime::NDArray, std::vector<int, std::allocator<int> > const&, tvm::runtime::Array<mlc::llm::serve::GenerationConfig, void> const&, std::vector<mlc::llm::RandomGenerator*, std::allocator<mlc::llm::RandomGenerator*> > const&, bool)
        at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:450
  7: mlc::llm::serve::GPUSampler::SampleOnGPU(tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, bool, bool, int, std::vector<int, std::allocator<int> > const&)
        at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:567
  6: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
  5: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::relax_vm::VirtualMachineImpl::GetClosureInternal(tvm::runtime::String const&, bool)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
  4: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocator<tvm::runtime::TVMRetValue> > const&)
  3: tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop()
  2: tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame*, tvm::runtime::relax_vm::Instruction)
  1: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::WrapPackedFunc(int (*)(TVMValue*, int*, int, TVMValue*, int*, void*), tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
  0: TVMThrowLastError.cold
TVMError: after determining tmp storage requirements for inclusive_scan: cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device

Could you please provide some helps ?

@NSTiwari
Copy link

@chongkuiqi could you please share the build files generated after running prepare_libs.sh for Llama3? Let me try.

@chongkuiqi
Copy link
Author

@chongkuiqi could you please share the build files generated after running prepare_libs.sh for Llama3? Let me try.

I didn't use prepare_libs.sh, I just use mlc_llm compile ./dist/Llama-3-8B-Instruct-q4f16_1-MLC/mlc-chat-config.json ... to generate llama3-cuda.so.
I think the above problem is probability due to my GPU Quadro RTX 6000 (SM75) without Flash kernels.

@tqchen
Copy link
Contributor

tqchen commented May 11, 2024

Likely this was due to older variant of the GPU, you can try to build tvm and mlc from source without flashinfer/thrust https://llm.mlc.ai/docs/install/tvm.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed bugs
Projects
None yet
Development

No branches or pull requests

4 participants