Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Sequence mode prototype This is a prototype of sequence mode. Load model ... 1.318s Serial mode to process 30 tokens ... 2.116s Sequence mode to process 30 tokens ... 0.509s Logits total diff = 0.00000 Logits identical = TRUE This is only for testing. It runs into precision and capacity limits at large lengths. The goal is to support sequences of up to 25k tokens. It is also likely that the dedicated single token functions should be brought back. Again, only prototype. * Move out rwkv_att_inner * Move out more graph functions * Print system info in sequence.c * Small single-token optimizations * Add function to estimate graph work size * Avoid allocating new sequence graph every rwkv_eval_sequence we still build one, but that seems necessary for ggml. * Remove sequence capability from ops that do not need it * Add GPU offload to sequence.c benchmark * Only calculate 1 - x tensors once per layer * use ggml_cpy in sequence mode xx output * Rename "inputs" to "state" in rwkv_eval_sequence * Basic sequence mode graph caching This is a huge speedup when the same sequence length is used many times in a row. I intend to clean up this code very soon * Revert "Only calculate 1 - x tensors once per layer" It doesn't actually matter * Clean up code around graph building and ggml contexts * Remove unused parameter from rwkv_att_wkv_size * Fix printf integer width in rwkv_eval * Correct assert return types, whoops * Free rwkv_context at the end of sequence.c * Fix typo I didn't make * Expand single-line return conditions * Enable sanitizer in macOS workflows Sanitizer is enabled to fix issues discovered when testing #89. It needs to be disabled as soon as it is possible (that is, master is able to be built on MacOS GitHub runner again) * Add doc comments and expand ser->serial, seq->sequence * Adjust doc comment in rwkv.h * Add thread safety note to rwkv_eval_sequence as well * Remove entire rwkv.cpp source code from sequence.c * Don't validate when sequence is NULL lol * Fix OOM on cuBLAS-enabled quantized models * Remove sequence.c
- Loading branch information