Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cosmetic changes (code style, documentation, etc.) #97

Merged
merged 9 commits into from
Jun 13, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion CODE_STYLE.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,16 +11,22 @@ Overall, keep code in similar style as it was before.
- Keep lines at 180 characters or shorter.
- Separate logically grouped pieces of code with empty lines.
- Surround `if`, `for`, `while`, `do` and other similar statements with empty lines.
- Add trailing new line to the end of the file.

### Comments and messages

- Write documentation for public functions indended for outside use.
- Place single-line comments on the line before, not right after the code line.
- Start comments with a capital letter, use correct grammar and punctuation.
- Begin comments with a capital letter, use correct grammar and punctuation.
- Begin messages, including error messages, with a capital letter.

## C/C++

- Use 4 spaces for indentation.
- Use [The One True Brace Style](https://en.wikipedia.org/wiki/Indentation_style#Variant:_1TBS_(OTBS)):
- Place braces on the same line as the statement.
- Always add braces to `if`, `for`, `while`, `do` and other similar statements.
- Prefix top-level function and struct names with `rwkv_`.

## Python

Expand Down
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

This is a port of [BlinkDL/RWKV-LM](https://github.com/BlinkDL/RWKV-LM) to [ggerganov/ggml](https://github.com/ggerganov/ggml).

Besides the usual **FP32**, it supports **FP16**, **quantized INT4, INT5 and INT8** inference. This project is **CPU only**.
Besides the usual **FP32**, it supports **FP16**, **quantized INT4, INT5 and INT8** inference. This project is **focused on CPU**, but cuBLAS is also supported.

This project provides [a C library rwkv.h](rwkv.h) and [a convinient Python wrapper](rwkv%2Frwkv_cpp_model.py) for it.

Expand All @@ -28,7 +28,7 @@ Below table is for reference only. Measurements were made on 4C/8T x86 CPU with

#### With cuBLAS

Measurements were made on 3060Ti 8G + i7 13700K. Latency per token shown.
Measurements were made on Intel i7 13700K & NVIDIA 3060 Ti 8G. Latency per token shown.

| Model | Layers on GPU | Format | 24 Threads | 8 Threads | 4 Threads | 2 Threads | 1 Threads |
|-----------------------|---------------|--------|-------------|------------|------------|------------|------------|
Expand All @@ -39,7 +39,7 @@ Measurements were made on 3060Ti 8G + i7 13700K. Latency per token shown.
| `RWKV-4-Raven-7B-v11` | 32 | `Q4_1` | 94.5 ms | 54.3 ms | 49.7 ms | 51.8 ms | 59.2 ms |
| `RWKV-4-Raven-7B-v11` | 32 | `Q5_1` | 101.6 ms | 72.3 ms | 67.2 ms | 69.3 ms | 77.0 ms |

Note: since there is only `ggml_mul_mat()` supported with cuBLAS, we still need to assign few CPU resources to execute remaining operations.
Note: since cuBLAS is supported only for `ggml_mul_mat()`, we still need to use few CPU resources to execute remaining operations.

## How to use

Expand Down Expand Up @@ -79,7 +79,7 @@ If everything went OK, `bin\Release\rwkv.dll` file should appear.

##### Windows + cuBLAS

**Important**: Since there is no cuBLAS static libraries for Windows, after compiling with dynamic libraries following DLLs should be copied from `{CUDA}/bin` into `build/bin/Release`: `cudart64_12.dll`, `cublas64_12.dll`, `cublasLt64_12.dll`.
**Important**: Since there are no cuBLAS static libraries for Windows, after compiling with dynamic libraries following DLLs should be copied from `{CUDA}/bin` into `build/bin/Release`: `cudart64_12.dll`, `cublas64_12.dll`, `cublasLt64_12.dll`.

```commandline
mkdir build
Expand Down Expand Up @@ -116,7 +116,7 @@ If everything went OK, `librwkv.so` (Linux) or `librwkv.dylib` (MacOS) file shou

#### Option 3.1. Download pre-quantized Raven model

There are pre-quantized Raven models available on [Hugging Face](https://huggingface.co/BlinkDL/rwkv-4-raven/tree/main). Check that you are downloading `.bin` file, NOT `.pth`.
There are pre-quantized Raven models available on [Hugging Face](https://huggingface.co/BlinkDL/rwkv-4-raven/tree/main). Check that you are downloading `.bin` file, **not** `.pth`.

#### Option 3.2. Convert and quantize PyTorch model

Expand Down Expand Up @@ -222,4 +222,4 @@ See also [FILE_FORMAT.md](FILE_FORMAT.md) for version numbers of `rwkv.cpp` mode

## Contributing

There is no complete contributor guide yet; but we have [CODE_STYLE.md](CODE_STYLE.md).
Please follow the code style described in [CODE_STYLE.md](CODE_STYLE.md).
2 changes: 1 addition & 1 deletion extras/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,4 @@ endfunction()
file(GLOB extras *.c)
foreach (extra ${extras})
rwkv_add_extra(${extra})
endforeach()
endforeach()
2 changes: 1 addition & 1 deletion extras/cpu_info.c
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@

int main() {
printf("%s", rwkv_get_system_info_string());
}
}
24 changes: 12 additions & 12 deletions extras/quantize.c
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,6 @@
#include <stdio.h>
#include <string.h>

enum ggml_type type_from_string(const char* string) {
if (strcmp(string, "Q4_0") == 0) return GGML_TYPE_Q4_0;
if (strcmp(string, "Q4_1") == 0) return GGML_TYPE_Q4_1;
if (strcmp(string, "Q5_0") == 0) return GGML_TYPE_Q5_0;
if (strcmp(string, "Q5_1") == 0) return GGML_TYPE_Q5_1;
if (strcmp(string, "Q8_0") == 0) return GGML_TYPE_Q8_0;
return GGML_TYPE_COUNT;
}

#ifdef _WIN32
bool QueryPerformanceFrequency(uint64_t* lpFrequency);
bool QueryPerformanceCounter(uint64_t* lpPerformanceCount);
Expand All @@ -31,7 +22,16 @@ bool QueryPerformanceCounter(uint64_t* lpPerformanceCount);
#define TIME_DIFF(freq, start, end) (double) ((end.tv_nsec - start.tv_nsec) / 1000000) / 1000
#endif

int main(int argc, char* argv[]) {
enum ggml_type type_from_string(const char* string) {
if (strcmp(string, "Q4_0") == 0) return GGML_TYPE_Q4_0;
if (strcmp(string, "Q4_1") == 0) return GGML_TYPE_Q4_1;
if (strcmp(string, "Q5_0") == 0) return GGML_TYPE_Q5_0;
if (strcmp(string, "Q5_1") == 0) return GGML_TYPE_Q5_1;
if (strcmp(string, "Q8_0") == 0) return GGML_TYPE_Q8_0;
return GGML_TYPE_COUNT;
}

int main(int argc, char * argv[]) {
if (argc != 4 || type_from_string(argv[3]) == GGML_TYPE_COUNT) {
fprintf(stderr, "Usage: %s INPUT OUTPUT FORMAT\n\nAvailable formats: Q4_0 Q4_1 Q5_0 Q5_1 Q8_0\n", argv[0]);
return EXIT_FAILURE;
Expand All @@ -40,7 +40,7 @@ int main(int argc, char* argv[]) {
time_t freq, start, end;
time_calibrate(freq);

fprintf(stderr, "Quantizing ...\n");
fprintf(stderr, "Quantizing...\n");

time_measure(start);
bool success = rwkv_quantize_model_file(argv[1], argv[2], argv[3]);
Expand All @@ -55,4 +55,4 @@ int main(int argc, char* argv[]) {
fprintf(stderr, "Error in %.3fs: 0x%.8X\n", diff, rwkv_get_last_error(NULL));
return EXIT_FAILURE;
}
}
}
Loading
Loading