Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama3 model can not answer #189

Open
hyperbolic-c opened this issue May 14, 2024 · 1 comment
Open

llama3 model can not answer #189

hyperbolic-c opened this issue May 14, 2024 · 1 comment

Comments

@hyperbolic-c
Copy link

when I run the llama3 mnn model

(py_llama) st@server03:~/mnn-llm$ ./build/cli_demo ./models/llama3/
model path is ./models/llama3/
### model name : Llama3_8b
The device support i8sdot:0, support fp16:0, support i8mm: 0
load tokenizer
load tokenizer Done
### disk embedding is 1
[ 10% ] load ./models/llama3//lm.mnn model ... Done!
[ 15% ] load ./models/llama3//block_0.mnn model ... Done!
[ 18% ] load ./models/llama3//block_1.mnn model ... Done!
[ 21% ] load ./models/llama3//block_2.mnn model ... Done!
[ 23% ] load ./models/llama3//block_3.mnn model ... Done!
[ 26% ] load ./models/llama3//block_4.mnn model ... Done!
[ 29% ] load ./models/llama3//block_5.mnn model ... Done!
[ 31% ] load ./models/llama3//block_6.mnn model ... Done!
[ 34% ] load ./models/llama3//block_7.mnn model ... Done!
[ 36% ] load ./models/llama3//block_8.mnn model ... Done!
[ 39% ] load ./models/llama3//block_9.mnn model ... Done!
[ 42% ] load ./models/llama3//block_10.mnn model ... Done!
[ 44% ] load ./models/llama3//block_11.mnn model ... Done!
[ 47% ] load ./models/llama3//block_12.mnn model ... Done!
[ 50% ] load ./models/llama3//block_13.mnn model ... Done!
[ 52% ] load ./models/llama3//block_14.mnn model ... Done!
[ 55% ] load ./models/llama3//block_15.mnn model ... Done!
[ 58% ] load ./models/llama3//block_16.mnn model ... Done!
[ 60% ] load ./models/llama3//block_17.mnn model ... Done!
[ 63% ] load ./models/llama3//block_18.mnn model ... Done!
[ 66% ] load ./models/llama3//block_19.mnn model ... Done!
[ 68% ] load ./models/llama3//block_20.mnn model ... Done!
[ 71% ] load ./models/llama3//block_21.mnn model ... Done!
[ 74% ] load ./models/llama3//block_22.mnn model ... Done!
[ 76% ] load ./models/llama3//block_23.mnn model ... Done!
[ 79% ] load ./models/llama3//block_24.mnn model ... Done!
[ 81% ] load ./models/llama3//block_25.mnn model ... Done!
[ 84% ] load ./models/llama3//block_26.mnn model ... Done!
[ 87% ] load ./models/llama3//block_27.mnn model ... Done!
[ 89% ] load ./models/llama3//block_28.mnn model ... Done!
[ 92% ] load ./models/llama3//block_29.mnn model ... Done!
[ 95% ] load ./models/llama3//block_30.mnn model ... Done!
[ 97% ] load ./models/llama3//block_31.mnn model ... Done!

then to ask it returns

Q: who are you

A: You're asking "who"?

#################################
 total tokens num  = 20
prompt tokens num  = 13
output tokens num  = 7
  total time = 2.59 s
prefill time = 1.31 s
 decode time = 1.28 s
  total speed = 7.73 tok/s
prefill speed = 9.92 tok/s
 decode speed = 5.48 tok/s
   chat speed = 2.71 tok/s
##################################


Q:
A: You're asking "are"?

#################################
 total tokens num  = 39
prompt tokens num  = 32
output tokens num  = 7
  total time = 4.21 s
prefill time = 2.81 s
 decode time = 1.41 s
  total speed = 9.26 tok/s
prefill speed = 11.40 tok/s
 decode speed = 4.98 tok/s
   chat speed = 1.66 tok/s
##################################


Q:
A: You're asking "you"?

#################################
 total tokens num  = 58
prompt tokens num  = 51
output tokens num  = 7
  total time = 4.82 s
prefill time = 3.48 s
 decode time = 1.34 s
  total speed = 12.04 tok/s
prefill speed = 14.64 tok/s
 decode speed = 5.24 tok/s
   chat speed = 1.45 tok/s
##################################


Q: introduce Beijing

A: You're asking "introduce"?

#################################
 total tokens num  = 84
prompt tokens num  = 76
output tokens num  = 8
  total time = 6.32 s
prefill time = 5.19 s
 decode time = 1.14 s
  total speed = 13.29 tok/s
prefill speed = 14.66 tok/s
 decode speed = 7.04 tok/s
   chat speed = 1.27 tok/s
##################################


Q:
A: You're asking "Beijing"?

#################################
 total tokens num  = 108
prompt tokens num  = 100
output tokens num  = 8
  total time = 7.68 s
prefill time = 6.51 s
 decode time = 1.17 s
  total speed = 14.06 tok/s
prefill speed = 15.37 tok/s
 decode speed = 6.81 tok/s
   chat speed = 1.04 tok/s
##################################

Any solution? Thanks !!

@hyperbolic-c
Copy link
Author

When I use benchmark it can return correctly

[ 92% ] load ./models/llama3//block_29.mnn model ... Done!
[ 95% ] load ./models/llama3//block_30.mnn model ... Done!
[ 97% ] load ./models/llama3//block_31.mnn model ... Done!
prompt file is ./resource/prompt.txt
### warmup ... Done
It's great to chat with you! How are you doing today?
哈哈!我是 ChatGPT,一个人工智能语言模型!
I'm just an AI, I don't have access to real-time weather information. However, you can check the weather forecast online or on your local weather app to get an idea of the current weather conditions.

#################################
prompt tokens num  = 54
decode tokens num  = 77
prefill time = 3.85 s
 decode time = 12.91 s
prefill speed = 14.02 tok/s
 decode speed = 5.96 tok/s
##################################

It looks like llama3 only can response with llm->response(prompts[i]), not chat with llm->chat() ?
@wangzhaode Do you have any suggestions, please!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant