Support for ggml #417

philwee · 2023-04-18T09:40:42Z

Could there be support for ggml added to this soon - 4bit quantized models are said to be pretty decent, but there is no reliable way to test this out. It would be nice if support for it could be added to this.

Thank you!

jon-tow · 2023-04-22T14:04:09Z

@philwee tagging the python bindings you shared which should make it much easier to add ggml support:

https://github.com/abetlen/llama-cpp-python

haileyschoelkopf · 2023-04-29T18:25:31Z

If someone wants to work on this I’d be happy to give pointers! All that’s required is a new LM subclass akin to #395 .

I may take a look at working on this integration on our end in ~1 month from now, if no one else has started a PR by then.

philwee · 2023-04-29T18:28:40Z

I can try to work on this, could you give some pointers?

haileyschoelkopf · 2023-04-29T18:45:14Z

Of course! I’d recommend looking at the PR I linked to get a sense of what the scope might be.

The process would look something like:
-make a new file in lm_eval/models called “ggml_model.py” or similar

in that file make a BaseLM subclass called GGMLLM or similar
This class should do the following:
In initialization, instantiate a model using the Python bindings @jon-tow linked
Implement the loglikelihood_rolling(), loglikelihood(), and greedy_until() class methods to support all 3 completion types (see gpt3.py or BaseLM for a template to compare to)
add any helper methods for those functions!

Lmk if this makes sense!

StellaAthena · 2023-04-30T12:13:31Z

I asked about this in the ggml library and the Response contained links to several WIP Python bindings. It looks like this one is the best starting point for us.

StellaAthena · 2023-04-30T20:26:47Z

Carson Poole reports:

ggml is doing the compute in int4 rather than just the weight storage. it's how it can be so much faster than a typical cpu impl because CPUs are more compute bound than GPUs for gemms
it's also egregiously slow for long input context lengths. a very unoptimized WebGPU implementation will obliterate ggml's speed on like 500-1000 tokens input

So it may be worth lowering the priority on this. Of course, implementing it would enable us to better evaluate these claims 🙃

Green-Sky · 2023-04-30T21:14:29Z

a very unoptimized WebGPU implementation will obliterate ggml's speed on like 500-1000 tokens input

there exists BLAS support (OpenBLAS, cuBLAS, clblast), which outperforms larger batchsizes of just the simd tuned code. (openblas -> cpu, cublas and clblast -> gpu)

the blas acceleration can already make a difference with single digit batchsizese

edit: also since only the logits are of interest, eval can be done in very large batchsizes (even better for blas)

Green-Sky · 2023-04-30T21:16:56Z

I asked about this in the ggml library and the ggerganov/ggml#120 (comment) contained links to several WIP Python bindings. It looks like this one is the best starting point for us.

Personally I think this one is better (no need to call that one a "starting point").

StellaAthena · 2023-05-01T00:32:14Z

I asked about this in the ggml library and the ggerganov/ggml#120 (comment) contained links to several WIP Python bindings. It looks like this one is the best starting point for us.

Personally I think this one is better (no need to call that one a "starting point").

I saw that, but per the issue at abetlen/llama-cpp-python#71 it appears to be 5x slower than the underlying implantation.

Green-Sky · 2023-05-01T09:35:13Z

It might be bc it does not build the llama.so/.dll properly / only in 1 configuration. so simd might be disabled. There is also the fact that there is no official BLAS enabled build available anywhere. (see abetlen/llama-cpp-python#117 )

Green-Sky · 2023-05-01T09:36:12Z

but they are "easy" to fix after the fact, since you can build the llama.dll yourself with the buildoptions that you like and replace the one shipped with the bindings (recommended right now).

StellaAthena · 2023-05-01T12:31:20Z

@Green-Sky I have almost no experience with C, but if you can do that and demonstrate acceptable speed that works for me.

gjmulder · 2023-05-11T17:11:08Z

@StellaAthena If you want to give me a representative test prompt I can compare llama-cpp-python to native llama.cpp. I also have both a 16 core CPU w/128GB of RAM and a shiny new 3090Ti w/24GB if you need some test cycles.

Here's my (short run comparative) perplexity scores to date with the models I have on hand.

StellaAthena · 2023-05-11T19:00:02Z

@gjmulder i haven’t had the bandwidth to test it yet, but this PR supports saving the actual predictions to disk: #492

you can run Lambada, HellaSwag, and ANLI with a limit of 20. If that ends up identical I think assuming it generalizes is safe. Maybe throw in a math problem too

gjmulder · 2023-05-11T19:11:15Z

llama-cpp-python attempts to implement the OpenAI API, so I may look at simply pointing the harness at an instance of llama-cpp-python and running a few smoke tests.

StellaAthena · 2023-05-11T19:45:02Z

Sounds great!

matthoffner · 2023-06-26T05:19:50Z

Started adding support for a llama-cpp-python server here: #617

haileyschoelkopf · 2023-11-03T20:23:08Z

Courtesy of @matthoffner , lm-eval now supports GGML Llama models via llama-cpp-python!

philwee mentioned this issue Apr 25, 2023

config.json for 4bit quantized files ggerganov/llama.cpp#1038

Closed

Green-Sky mentioned this issue Apr 26, 2023

Study how LM Evaluation Harness works and try to implement it ggerganov/llama.cpp#231

Open

StellaAthena added help wanted Contributors and extra help welcome. good first issue Good for newcomers labels Apr 30, 2023

StellaAthena assigned matthoffner Jun 26, 2023

haileyschoelkopf closed this as completed Nov 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for ggml #417

Support for ggml #417

philwee commented Apr 18, 2023

jon-tow commented Apr 22, 2023

haileyschoelkopf commented Apr 29, 2023

philwee commented Apr 29, 2023

haileyschoelkopf commented Apr 29, 2023 •

edited

Loading

StellaAthena commented Apr 30, 2023

StellaAthena commented Apr 30, 2023

Green-Sky commented Apr 30, 2023 •

edited

Loading

Green-Sky commented Apr 30, 2023

StellaAthena commented May 1, 2023

Green-Sky commented May 1, 2023

Green-Sky commented May 1, 2023

StellaAthena commented May 1, 2023

gjmulder commented May 11, 2023

StellaAthena commented May 11, 2023

gjmulder commented May 11, 2023

StellaAthena commented May 11, 2023

matthoffner commented Jun 26, 2023

haileyschoelkopf commented Nov 3, 2023

Support for ggml #417

Support for ggml #417

Comments

philwee commented Apr 18, 2023

jon-tow commented Apr 22, 2023

haileyschoelkopf commented Apr 29, 2023

philwee commented Apr 29, 2023

haileyschoelkopf commented Apr 29, 2023 • edited Loading

StellaAthena commented Apr 30, 2023

StellaAthena commented Apr 30, 2023

Green-Sky commented Apr 30, 2023 • edited Loading

Green-Sky commented Apr 30, 2023

StellaAthena commented May 1, 2023

Green-Sky commented May 1, 2023

Green-Sky commented May 1, 2023

StellaAthena commented May 1, 2023

gjmulder commented May 11, 2023

StellaAthena commented May 11, 2023

gjmulder commented May 11, 2023

StellaAthena commented May 11, 2023

matthoffner commented Jun 26, 2023

haileyschoelkopf commented Nov 3, 2023

haileyschoelkopf commented Apr 29, 2023 •

edited

Loading

Green-Sky commented Apr 30, 2023 •

edited

Loading