-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] LlaMa-3 doesn't work #2281
Comments
The error |
Thanks for reply锛丅ut when I specifying the arch, it still outputs :
Could you please provide some helps ? |
@chongkuiqi could you please share the build files generated after running |
I didn't use prepare_libs.sh, I just use mlc_llm compile ./dist/Llama-3-8B-Instruct-q4f16_1-MLC/mlc-chat-config.json ... to generate llama3-cuda.so. |
Likely this was due to older variant of the GPU, you can try to build tvm and mlc from source without flashinfer/thrust https://llm.mlc.ai/docs/install/tvm.html |
馃悰 Bug
Thanks for your work ! I download the compiled/quantized Llama-3 weights from https://huggingface.co/mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC, but when i run it, it outputs as folowing:
Exception in thread Thread-1 (_background_loop):
Traceback (most recent call last):
File "/home/haige/miniconda3/envs/torch222/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/haige/miniconda3/envs/torch222/lib/python3.10/threading.py", line 953, in run
self._target(self._args, *self._kwargs)
File "/home/haige/miniconda3/envs/torch222/lib/python3.10/site-packages/mlc_llm/serve/engine_base.py", line 484, in _background_loop
self._ffi"run_background_loop"
File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.call
File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
File "/home/haige/miniconda3/envs/torch222/lib/python3.10/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
13: mlc::llm::serve::ThreadedEngineImpl::RunBackgroundLoop()
at /workspace/mlc-llm/cpp/serve/threaded_engine.cc:168
12: mlc::llm::serve::EngineImpl::Step()
at /workspace/mlc-llm/cpp/serve/engine.cc:326
11: mlc::llm::serve::NewRequestPrefillActionObj::Step(mlc::llm::serve::EngineState)
at /workspace/mlc-llm/cpp/serve/engine_actions/new_request_prefill.cc:235
10: mlc::llm::serve::GPUSampler::BatchSampleTokensWithProbAfterTopP(tvm::runtime::NDArray, std::vector<int, std::allocator > const&, tvm::runtime::Array<tvm::runtime::String, void> const&, tvm::runtime::Array<mlc::llm::serve::GenerationConfig, void> const&, std::vector<mlc::llm::RandomGenerator, std::allocatormlc::llm::RandomGenerator* > const&, std::vector<tvm::runtime::NDArray, std::allocatortvm::runtime::NDArray >)
at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:179
9: mlc::llm::serve::GPUSampler::BatchSampleTokensImpl(tvm::runtime::NDArray, std::vector<int, std::allocator > const&, tvm::runtime::Array<tvm::runtime::String, void> const&, tvm::runtime::Array<mlc::llm::serve::GenerationConfig, void> const&, std::vector<mlc::llm::RandomGenerator*, std::allocatormlc::llm::RandomGenerator* > const&, bool, std::vector<tvm::runtime::NDArray, std::allocatortvm::runtime::NDArray >)
at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:369
8: mlc::llm::serve::GPUSampler::ChunkSampleTokensImpl(tvm::runtime::NDArray, std::vector<int, std::allocator > const&, tvm::runtime::Array<mlc::llm::serve::GenerationConfig, void> const&, std::vector<mlc::llm::RandomGenerator, std::allocatormlc::llm::RandomGenerator* > const&, bool)
at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:450
7: mlc::llm::serve::GPUSampler::SampleOnGPU(tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, bool, bool, int, std::vector<int, std::allocator > const&)
at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:567
6: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
5: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::relax_vm::VirtualMachineImpl::GetClosureInternal(tvm::runtime::String const&, bool)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
4: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocatortvm::runtime::TVMRetValue > const&)
3: tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop()
2: tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame*, tvm::runtime::relax_vm::Instruction)
1: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::WrapPackedFunc(int ()(TVMValue, int*, int, TVMValue*, int*, void*), tvm::runtime::ObjectPtrtvm::runtime::Object const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
0: TVMThrowLastError.cold
TVMError: after determining tmp storage requirements for inclusive_scan: cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device
So I mlc_llm compile the llama-3-cuda.so, but it outputs:
Exception in thread Thread-1 (_background_loop):
Traceback (most recent call last):
File "/home/haige/miniconda3/envs/torch222/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/haige/miniconda3/envs/torch222/lib/python3.10/threading.py", line 953, in run
self._target(self._args, self._kwargs)
File "/home/haige/miniconda3/envs/torch222/lib/python3.10/site-packages/mlc_llm/serve/engine_base.py", line 484, in _background_loop
self._ffi"run_background_loop"
File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.call
File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
File "/home/haige/miniconda3/envs/torch222/lib/python3.10/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
3: _ZN3tvm7runtime13PackedFuncObj9ExtractorINS0_16PackedFuncSubObjIZNS0_6detail17PackFuncVoidAddr_ILi8ENS0_15CUDAWrappedFuncEEENS0_10PackedFuncET0_RKSt6vectorINS4_1
2: tvm::runtime::CUDAWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue, void) const [clone .isra.0]
1: tvm::runtime::CUDAModuleNode::GetFunc(int, std::_cxx11::basic_string<char, std::char_traits, std::allocator > const&)
0: ZN3tvm7runtime6deta
File "/workspace/tvm/src/runtime/cuda/cuda_module.cc", line 110
CUDAError: cuModuleLoadData(&(module[device_id]), data.c_str()) failed with error: CUDA_ERROR_NO_BINARY_FOR_GPU
To Reproduce
from mlc_llm import MLCEngine
model = "Llama-3-8B-Instruct-q4f16_1-MLC"
engine = MLCEngine(
model=model,
model_lib=" llama-3-cuda.so",
device="cuda",
)
response = engine.chat.completions.create(
messages=[{"role": "user", "content": "What is the meaning of life?"}],
model=model,
stream=False,
)
print(response)
Environment
Platform : cuda
Operating system: ubuntu20.04
Device: RTX 6000 24GB
How you installed MLC-LLM: python3 -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cu121 mlc-ai-nightly-cu121
How you installed TVM-Unity: python3 -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cu121 mlc-ai-nightly-cu121
Python version: 3.10
GPU driver version: 535.171.04
CUDA/cuDNN version : cuda12.1, cudnn 8.9.4
Could you please provide some helps ?
The text was updated successfully, but these errors were encountered: