Add chat template support for llama-cli #8068

ngxson · 2024-06-22T18:39:30Z

This PR brings the same logic of chat template from server to main (llama-cli).

Goals

Minimal modification possible, by reusing existing llama_chat_apply_template function
Support both auto-detect template and custom --chat-template argument
Not to introduce new list to be maintained ==> Some PRs in the past tend to add separated list of prefix/postfix, which is duplicated of llama_chat_apply_template and thus requires additional maintenance.
To simplify the implementation on both server & main

How it works

Newly added CPP wrapper for llama_chat_apply_template that support std::string ==> simplify the code
Newly added llama_chat_format_single ==> it evaluates the history twice, once with and once without the added message, then return the diff

Demo

make llama-cli && ./llama-cli -m ../Meta-Llama-3-8B-Instruct-abliterated-v3_q4.gguf -p "You are an AI" -cnv

system

You are an AI


> hi
Hello! I'm a language model AI. It's nice to meet you! Is there something I can help you with or would you like to chat?

> what is your name
I'm an AI, so I don't have a personal name in the classical sense. I'm often referred to as "Assistant" or "AI" by users, but I don't have a specific name like a human would. However, I can be addressed as "AI" or "Assistant" if you'd like!

> who made you
I was created by a team of researcher at Meta AI. They are a group of scientists who specialize in natural language processing and machine learning. They trained me on a massive dataset of text from various sources, including books, articles, and websites, to enable me to understand and generate human-like language.

Fix #8053 #6391

Replace #6810

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

common/common.cpp

examples/main/main.cpp

ggerganov · 2024-06-24T06:18:34Z

examples/main/main.cpp

+ std::string user_inp = params.conversation
+ ? chat_add_and_format("user", buffer)
+ : buffer;
+ // TODO: one inconvenient of current chat template implementation is that we can't distinguish between user input and special tokens (prefix/postfix)


When params.conversation == false there is an extra string copy that should be avoided here

Regarding the comment - can you illustrate with an example as I'm not sure what is the issue

An example would be a prompt like this: Which one is correct HTML tag? <s> or <a>?

Some models having <s> as BOS will see the prompt as Which one is correct HTML tag? BOS or <a>?

Leaving special == false will fix that, but will also break chat template since we're now adding special tokens to user's text. This could be avoided with some more code. But IMO it's not really a big deal though, assuming that special tokens are unlikely to accidentally appear in the text.

I added a std::move(buffer) since we no longer use buffer after this line. Is it OK to do so?

Aha got it. Yes, for now let's make have the simple solution

examples/main/main.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

fairydreaming · 2024-06-27T10:23:23Z

It looks like it broke some models, here is the llama-cli output and brief gdb inspection from DeepSeek-V2-Lite:

./llama-cli --numa distribute -s 42 -t 32 --temp 0.01 -m /mnt/md0/models/deepseek-v2-lite-chat-2.gguf -f ../prompt-deepseek.txt
Log start
main: build = 3248 (f675b20a)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 42
WARNING: /proc/sys/kernel/numa_balancing is enabled, this has been observed to impair performance
llama_model_loader: loaded meta data with 38 key-value pairs and 377 tensors from /mnt/md0/models/deepseek-v2-lite-chat-2.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.name str              = be9443d5eec410d7045ba7dcbe2e0f189f5dda9e
llama_model_loader: - kv   2:                      deepseek2.block_count u32              = 27
llama_model_loader: - kv   3:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv   4:                 deepseek2.embedding_length u32              = 2048
llama_model_loader: - kv   5:              deepseek2.feed_forward_length u32              = 10944
llama_model_loader: - kv   6:             deepseek2.attention.head_count u32              = 16
llama_model_loader: - kv   7:          deepseek2.attention.head_count_kv u32              = 16
llama_model_loader: - kv   8:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv   9: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                deepseek2.expert_used_count u32              = 6
llama_model_loader: - kv  11:                          general.file_type u32              = 1
llama_model_loader: - kv  12:        deepseek2.leading_dense_block_count u32              = 1
llama_model_loader: - kv  13:                       deepseek2.vocab_size u32              = 102400
llama_model_loader: - kv  14:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  15:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  16:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  17:       deepseek2.expert_feed_forward_length u32              = 1408
llama_model_loader: - kv  18:                     deepseek2.expert_count u32              = 64
llama_model_loader: - kv  19:              deepseek2.expert_shared_count u32              = 2
llama_model_loader: - kv  20:             deepseek2.expert_weights_scale f32              = 1.000000
llama_model_loader: - kv  21:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  22:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  23:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  24: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  25: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.070700
llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = deepseek-llm
llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,102400]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,102400]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,99757]   = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e...
llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 100000
llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 100001
llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 100001
llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  35:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  37:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  108 tensors
llama_model_loader: - type  f16:  269 tensors
llm_load_vocab: special tokens cache size = 2400
llm_load_vocab: token to piece cache size = 0.6659 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = deepseek2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 102400
llm_load_print_meta: n_merges         = 99757
llm_load_print_meta: n_ctx_train      = 163840
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_layer          = 27
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 192
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 3072
llm_load_print_meta: n_embd_v_gqa     = 2048
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 10944
llm_load_print_meta: n_expert         = 64
llm_load_print_meta: n_expert_used    = 6
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = yarn
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 0.025
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 16B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 15.71 B
llm_load_print_meta: model size       = 29.26 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = be9443d5eec410d7045ba7dcbe2e0f189f5dda9e
llm_load_print_meta: BOS token        = 100000 '<｜begin▁of▁sentence｜>'
llm_load_print_meta: EOS token        = 100001 '<｜end▁of▁sentence｜>'
llm_load_print_meta: PAD token        = 100001 '<｜end▁of▁sentence｜>'
llm_load_print_meta: LF token         = 126 'Ä'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_layer_dense_lead   = 1
llm_load_print_meta: n_lora_q             = 0
llm_load_print_meta: n_lora_kv            = 512
llm_load_print_meta: n_ff_exp             = 1408
llm_load_print_meta: n_expert_shared      = 2
llm_load_print_meta: expert_weights_scale = 1.0
llm_load_print_meta: rope_yarn_log_mul    = 0.0707
llm_load_tensors: ggml ctx size =    0.16 MiB
llm_load_tensors:        CPU buffer size = 29964.48 MiB
.......................................................................................
llama_new_context_with_model: n_ctx      = 163840
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init:        CPU KV buffer size = 43200.00 MiB
llama_new_context_with_model: KV self size  = 43200.00 MiB, K (f16): 25920.00 MiB, V (f16): 17280.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.39 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 5464.01 MiB
llama_new_context_with_model:        CPU compute buffer size =  5464.01 MiB
llama_new_context_with_model: graph nodes  = 1924
llama_new_context_with_model: graph splits = 1
terminate called after throwing an instance of 'std::length_error'
  what():  vector::_M_default_append
Aborted (core dumped)

Thread 1 "llama-cli" received signal SIGABRT, Aborted.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737347880896) at ./nptl/pthread_kill.c:44
44	./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737347880896) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140737347880896) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140737347880896, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff7a4f476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff7a357f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007ffff7e29b9e in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffff7e3520c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007ffff7e35277 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007ffff7e354d8 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007ffff7e2c449 in std::__throw_length_error(char const*) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x00005555556f4867 in std::vector<char, std::allocator<char> >::_M_check_len (this=0x7fffffffb910, __n=18446744073709551522, 
    __s=0x555555899c33 "vector::_M_default_append") at /usr/include/c++/11/bits/stl_vector.h:1759
#11 0x00005555556da3cb in std::vector<char, std::allocator<char> >::_M_default_append (this=0x7fffffffb910, __n=18446744073709551522)
    at /usr/include/c++/11/bits/vector.tcc:634
#12 0x00005555556c588d in std::vector<char, std::allocator<char> >::resize (this=0x7fffffffb910, __new_size=18446744073709551615)
    at /usr/include/c++/11/bits/stl_vector.h:940
#13 0x00005555557c154a in llama_chat_apply_template (model=0x555555a8acf0, tmpl="", msgs=std::vector of length 4, capacity 4 = {...}, 
    add_ass=true) at common/common.cpp:2635
#14 0x00005555557c1b51 in llama_chat_format_example (model=0x555555a8acf0, tmpl="") at common/common.cpp:2664
#15 0x000055555586de70 in main (argc=13, argv=0x7fffffffe0e8) at examples/main/main.cpp:227
(gdb) up
#1  __pthread_kill_internal (signo=6, threadid=140737347880896) at ./nptl/pthread_kill.c:78
78	in ./nptl/pthread_kill.c
(gdb) 
#2  __GI___pthread_kill (threadid=140737347880896, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
89	in ./nptl/pthread_kill.c
(gdb) 
#3  0x00007ffff7a4f476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
26	../sysdeps/posix/raise.c: No such file or directory.
(gdb) 
#4  0x00007ffff7a357f3 in __GI_abort () at ./stdlib/abort.c:79
79	./stdlib/abort.c: No such file or directory.
(gdb) 
#5  0x00007ffff7e29b9e in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
(gdb) 
#6  0x00007ffff7e3520c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
(gdb) 
#7  0x00007ffff7e35277 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
(gdb) 
#8  0x00007ffff7e354d8 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
(gdb) 
#9  0x00007ffff7e2c449 in std::__throw_length_error(char const*) () from /lib/x86_64-linux-gnu/libstdc++.so.6
(gdb) 
#10 0x00005555556f4867 in std::vector<char, std::allocator<char> >::_M_check_len (this=0x7fffffffb910, __n=18446744073709551522, 
    __s=0x555555899c33 "vector::_M_default_append") at /usr/include/c++/11/bits/stl_vector.h:1759
1759		  __throw_length_error(__N(__s));
(gdb) 
#11 0x00005555556da3cb in std::vector<char, std::allocator<char> >::_M_default_append (this=0x7fffffffb910, __n=18446744073709551522)
    at /usr/include/c++/11/bits/vector.tcc:634
634			_M_check_len(__n, "vector::_M_default_append");
(gdb) 
#12 0x00005555556c588d in std::vector<char, std::allocator<char> >::resize (this=0x7fffffffb910, __new_size=18446744073709551615)
    at /usr/include/c++/11/bits/stl_vector.h:940
940		  _M_default_append(__new_size - size());
(gdb) 
#13 0x00005555557c154a in llama_chat_apply_template (model=0x555555a8acf0, tmpl="", msgs=std::vector of length 4, capacity 4 = {...}, 
    add_ass=true) at common/common.cpp:2635
2635	        buf.resize(res);
(gdb) 
#14 0x00005555557c1b51 in llama_chat_format_example (model=0x555555a8acf0, tmpl="") at common/common.cpp:2664
2664	    return llama_chat_apply_template(model, tmpl, msgs, true);
(gdb) print tmpl
$1 = ""
(gdb) down
#13 0x00005555557c154a in llama_chat_apply_template (model=0x555555a8acf0, tmpl="", msgs=std::vector of length 4, capacity 4 = {...}, 
    add_ass=true) at common/common.cpp:2635
2635	        buf.resize(res);
(gdb) print res
$2 = -1
(gdb)

ngxson · 2024-06-27T10:28:23Z

@fairydreaming The default behavior should be "if built-in template is not supported, we use chatml as fallback value"

Turns out it's not the case here (I missed something). I'll need to push a fix for this.

ngxson added 3 commits June 22, 2024 20:24

add chat template support for llama-cli

5a2fde8

add help message

c91f972

server: simplify format_chat

3174527

ngxson requested a review from ggerganov June 22, 2024 18:39

github-actions bot added testing Everything test related examples labels Jun 22, 2024

more consistent naming

962be6a

github-actions bot added the server label Jun 22, 2024

ngxson added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jun 22, 2024

This was referenced Jun 23, 2024

Bug: Persistent hallucination even after re-running llama.cpp #8070

Open

Bug: Inference is messed up in llama-server+default ui and llama-cli but works in llama-server+openweb ui #8027

Closed

ggerganov approved these changes Jun 24, 2024

View reviewed changes

ngxson added 2 commits June 24, 2024 10:45

improve

43cab6b

add llama_chat_format_example

a3dbfab

ngxson commented Jun 24, 2024

View reviewed changes

examples/main/main.cpp Show resolved Hide resolved

ngxson added 4 commits June 24, 2024 10:56

fix server

a1e9520

code style

7a76502

Merge branch 'master' into xsn/main_chat_template_2

c530ce4

code style

a28e70f

ggerganov reviewed Jun 25, 2024

View reviewed changes

examples/main/main.cpp Outdated Show resolved Hide resolved

Update examples/main/main.cpp

895bb2a

Co-authored-by: Georgi Gerganov <[email protected]>

ngxson added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label Jun 25, 2024

mofosyne merged commit 48e6b92 into ggerganov:master Jun 25, 2024
63 checks passed

mosujiba mentioned this pull request Jun 26, 2024

server: phi-3 end token not handled? #6903

Closed

fairydreaming mentioned this pull request Jun 27, 2024

llama : add DeepSeek-v2-Chat support #7118

Closed

This was referenced Jun 27, 2024

Add chatml fallback for cpp llama_chat_apply_template #8160

Merged

Proposal: Improve llama.cpp snippet huggingface/huggingface.js#778

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add chat template support for llama-cli #8068

Add chat template support for llama-cli #8068

ngxson commented Jun 22, 2024 •

edited

Loading

ggerganov Jun 24, 2024

ngxson Jun 24, 2024

ngxson Jun 24, 2024

ggerganov Jun 25, 2024

fairydreaming commented Jun 27, 2024

ngxson commented Jun 27, 2024

Add chat template support for llama-cli #8068

Add chat template support for llama-cli #8068

Conversation

ngxson commented Jun 22, 2024 • edited Loading

Goals

How it works

Demo

ggerganov Jun 24, 2024

Choose a reason for hiding this comment

ngxson Jun 24, 2024

Choose a reason for hiding this comment

ngxson Jun 24, 2024

Choose a reason for hiding this comment

ggerganov Jun 25, 2024

Choose a reason for hiding this comment

fairydreaming commented Jun 27, 2024

ngxson commented Jun 27, 2024

ngxson commented Jun 22, 2024 •

edited

Loading