RWKV · saharNooby · Jun 13, 2023 · Jun 13, 2023 · Jun 13, 2023 · Jun 13, 2023
diff --git a/CODE_STYLE.md b/CODE_STYLE.md
@@ -11,16 +11,22 @@ Overall, keep code in similar style as it was before.
 - Keep lines at 180 characters or shorter.
 - Separate logically grouped pieces of code with empty lines.
 - Surround `if`, `for`, `while`, `do` and other similar statements with empty lines.
+- Add trailing new line to the end of the file.
+
+### Comments and messages
+
 - Write documentation for public functions indended for outside use.
 - Place single-line comments on the line before, not right after the code line.
-- Start comments with a capital letter, use correct grammar and punctuation.
+- Begin comments with a capital letter, use correct grammar and punctuation.
+- Begin messages, including error messages, with a capital letter.
 
 ## C/C++
 
 - Use 4 spaces for indentation.
 - Use [The One True Brace Style](https://en.wikipedia.org/wiki/Indentation_style#Variant:_1TBS_(OTBS)):
  - Place braces on the same line as the statement.
  - Always add braces to `if`, `for`, `while`, `do` and other similar statements.
+- Prefix top-level function and struct names with `rwkv_`.
 
 ## Python
 

diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@
 
 This is a port of [BlinkDL/RWKV-LM](https://github.com/BlinkDL/RWKV-LM) to [ggerganov/ggml](https://github.com/ggerganov/ggml).
 
-Besides the usual **FP32**, it supports **FP16**, **quantized INT4, INT5 and INT8** inference. This project is **CPU only**.
+Besides the usual **FP32**, it supports **FP16**, **quantized INT4, INT5 and INT8** inference. This project is **focused on CPU**, but cuBLAS is also supported.
 
 This project provides [a C library rwkv.h](rwkv.h) and [a convinient Python wrapper](rwkv%2Frwkv_cpp_model.py) for it.
 
@@ -28,7 +28,7 @@ Below table is for reference only. Measurements were made on 4C/8T x86 CPU with
 
 #### With cuBLAS
 
-Measurements were made on 3060Ti 8G + i7 13700K. Latency per token shown.
+Measurements were made on Intel i7 13700K & NVIDIA 3060 Ti 8G. Latency per token shown.
 
 | Model | Layers on GPU | Format | 24 Threads | 8 Threads | 4 Threads | 2 Threads | 1 Threads |
 |-----------------------|---------------|--------|-------------|------------|------------|------------|------------|
@@ -39,7 +39,7 @@ Measurements were made on 3060Ti 8G + i7 13700K. Latency per token shown.
 | `RWKV-4-Raven-7B-v11` | 32 | `Q4_1` | 94.5 ms | 54.3 ms | 49.7 ms | 51.8 ms | 59.2 ms |
 | `RWKV-4-Raven-7B-v11` | 32 | `Q5_1` | 101.6 ms | 72.3 ms | 67.2 ms | 69.3 ms | 77.0 ms |
 
-Note: since there is only `ggml_mul_mat()` supported with cuBLAS, we still need to assign few CPU resources to execute remaining operations.
+Note: since cuBLAS is supported only for `ggml_mul_mat()`, we still need to use few CPU resources to execute remaining operations.
 
 ## How to use
 
@@ -79,7 +79,7 @@ If everything went OK, `bin\Release\rwkv.dll` file should appear.
 
 ##### Windows + cuBLAS
 
-**Important**: Since there is no cuBLAS static libraries for Windows, after compiling with dynamic libraries following DLLs should be copied from `{CUDA}/bin` into `build/bin/Release`: `cudart64_12.dll`, `cublas64_12.dll`, `cublasLt64_12.dll`.
+**Important**: Since there are no cuBLAS static libraries for Windows, after compiling with dynamic libraries following DLLs should be copied from `{CUDA}/bin` into `build/bin/Release`: `cudart64_12.dll`, `cublas64_12.dll`, `cublasLt64_12.dll`.
 
 ```commandline
 mkdir build
@@ -116,7 +116,7 @@ If everything went OK, `librwkv.so` (Linux) or `librwkv.dylib` (MacOS) file shou
 
 #### Option 3.1. Download pre-quantized Raven model
 
-There are pre-quantized Raven models available on [Hugging Face](https://huggingface.co/BlinkDL/rwkv-4-raven/tree/main). Check that you are downloading `.bin` file, NOT `.pth`.
+There are pre-quantized Raven models available on [Hugging Face](https://huggingface.co/BlinkDL/rwkv-4-raven/tree/main). Check that you are downloading `.bin` file, **not** `.pth`.
 
 #### Option 3.2. Convert and quantize PyTorch model
 
@@ -222,4 +222,4 @@ See also [FILE_FORMAT.md](FILE_FORMAT.md) for version numbers of `rwkv.cpp` mode
 
 ## Contributing
 
-There is no complete contributor guide yet; but we have [CODE_STYLE.md](CODE_STYLE.md).
+Please follow the code style described in [CODE_STYLE.md](CODE_STYLE.md).
diff --git a/extras/CMakeLists.txt b/extras/CMakeLists.txt
@@ -7,4 +7,4 @@ endfunction()
 file(GLOB extras *.c)
 foreach (extra ${extras})
  rwkv_add_extra(${extra})
-endforeach()
+endforeach()
diff --git a/extras/cpu_info.c b/extras/cpu_info.c
@@ -4,4 +4,4 @@
 
 int main() {
  printf("%s", rwkv_get_system_info_string());
-}
+}
diff --git a/extras/quantize.c b/extras/quantize.c
@@ -5,15 +5,6 @@
 #include <stdio.h>
 #include <string.h>
 
-enum ggml_type type_from_string(const char* string) {
- if (strcmp(string, "Q4_0") == 0) return GGML_TYPE_Q4_0;
- if (strcmp(string, "Q4_1") == 0) return GGML_TYPE_Q4_1;
- if (strcmp(string, "Q5_0") == 0) return GGML_TYPE_Q5_0;
- if (strcmp(string, "Q5_1") == 0) return GGML_TYPE_Q5_1;
- if (strcmp(string, "Q8_0") == 0) return GGML_TYPE_Q8_0;
- return GGML_TYPE_COUNT;
-}
-
 #ifdef _WIN32
 bool QueryPerformanceFrequency(uint64_t* lpFrequency);
 bool QueryPerformanceCounter(uint64_t* lpPerformanceCount);
@@ -31,7 +22,16 @@ bool QueryPerformanceCounter(uint64_t* lpPerformanceCount);
 #define TIME_DIFF(freq, start, end) (double) ((end.tv_nsec - start.tv_nsec) / 1000000) / 1000
 #endif
 
-int main(int argc, char* argv[]) {
+enum ggml_type type_from_string(const char* string) {
+ if (strcmp(string, "Q4_0") == 0) return GGML_TYPE_Q4_0;
+ if (strcmp(string, "Q4_1") == 0) return GGML_TYPE_Q4_1;
+ if (strcmp(string, "Q5_0") == 0) return GGML_TYPE_Q5_0;
+ if (strcmp(string, "Q5_1") == 0) return GGML_TYPE_Q5_1;
+ if (strcmp(string, "Q8_0") == 0) return GGML_TYPE_Q8_0;
+ return GGML_TYPE_COUNT;
+}
+
+int main(int argc, char * argv[]) {
  if (argc != 4 || type_from_string(argv[3]) == GGML_TYPE_COUNT) {
  fprintf(stderr, "Usage: %s INPUT OUTPUT FORMAT\n\nAvailable formats: Q4_0 Q4_1 Q5_0 Q5_1 Q8_0\n", argv[0]);
  return EXIT_FAILURE;
@@ -40,7 +40,7 @@ int main(int argc, char* argv[]) {
  time_t freq, start, end;
  time_calibrate(freq);
 
- fprintf(stderr, "Quantizing ...\n");
+ fprintf(stderr, "Quantizing...\n");
 
  time_measure(start);
  bool success = rwkv_quantize_model_file(argv[1], argv[2], argv[3]);
@@ -55,4 +55,4 @@ int main(int argc, char* argv[]) {
  fprintf(stderr, "Error in %.3fs: 0x%.8X\n", diff, rwkv_get_last_error(NULL));
  return EXIT_FAILURE;
  }
-}
+}
diff --git a/ggml b/ggml