Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native Intel IPEX-LLM Support #7190

Open
iamhumanipromise opened this issue May 10, 2024 · 12 comments
Open

Native Intel IPEX-LLM Support #7190

iamhumanipromise opened this issue May 10, 2024 · 12 comments
Labels
enhancement New feature or request

Comments

@iamhumanipromise
Copy link

iamhumanipromise commented May 10, 2024

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • [X ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • [ X] I carefully followed the README.md.
  • [ X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [X ] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

I have found this closed issue where someone manually (?how?) implemented IPEX-LLM. However, looking forward to native IPEX-LLM support for Intel Xe iGPUs + Intel Arc dGPUs on Windows and Linux

#7042

TL;DR is IPEX-LLM now provides a C++ interface, which can be used as a backend for running llama.cpp on Intel GPUs. Incorporating this interface into llama.cpp would allow for leveraging the optimized performance of IPEX-LLM.

Motivation

Intel Xe graphics launched in 2020. Flex, Max Datacenter and Arc Consumer cards for laptop and desktop launched in 2022. This is a lot of devices in production/circulation.

This would "permit" llama.cpp users to utilize their integrated Xe GPUs and dedicated Arc GPUs, Datacenter Flex and Max cards with llama.cpp on BOTH Windows and Linux natively (without a confusing manual build).

Possible Implementation

The implementation of native Intel IPEX-LLM support would be something like... Integrate --> Test --> Document --> Release.

  1. Integration with IPEX: Since IPEX-LLM is built on top of Intel Extension for PyTorch (IPEX), the first step would be to ensure seamless integration with IPEX. This would involve linking the llama.cpp build system with the IPEX library and ensuring that all dependencies are correctly managed. Here is a link for using llama.cpp with Intel GPUs...

Full manual/guide: https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html
Full verified model list: https://ipex-llm.readthedocs.io/en/latest/#verified-models
Github: https://github.com/intel-analytics/ipex-llm

The "owners" of this process will be the devs and engineers here; in this Github (simple nerds such as myself do not have the expertise to tackle something like this... even locally)

For example from the documentation it looks like this would be create a new conda envioronment --> set up environment --> configure oneapi variables --> update cmakelists.txt or makefile with paths to IPEX-LLM library and headers --> then ??map llama.cpp functionalities to ipex apis (which Intel has already done).

  1. Testing Across Platforms: Ensuring that the implementation works across different versions of Windows and Linux is crucial... This includes testing on various Intel iGPUs and Arc dGPUs to guarantee broad compatibility. This effort would involve the community here, various Discords, subreddits, and perhaps trying to "rope in" as many laptop/desktop Xe iGPU users and dGPU users as possible -- so that means gamers, too.

The "owners" of this step would be wide-ranging overall.

  1. Documentation and Examples: Someone would have to "own" updating the documentation to guide users on how to enable and use the new IPEX-LLM support. Providing examples and quickstart guides can significantly help; but ultimately for independent users it will be up to them and then for GUI and TUI/CLI frontends, the documentation will need to be updated by them.

  2. Release After all of this has been done, going forward to launch woot woot.

I'm sure there are many, many steps I am missing here. Just wanted to "kick off" the process.

@iamhumanipromise iamhumanipromise added the enhancement New feature or request label May 10, 2024
@NeoZhangJianyu
Copy link
Collaborator

@iamhumanipromise
Sorry, it's not clear of the description of the idea.

As my understanding of this issue, you hope use IPEX-LLM as backend to support Intel GPU.
It's more like use TensorFlow/Pytorch as backend too.

If yes, will it be quicker than TensorFlow/Pytorch? Why not use TenfsorFlow/Pytorch directly?

@qnixsynapse
Copy link

qnixsynapse commented May 10, 2024

IPEX LLM already supports llama.cpp I think: https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html

Also PyTorch's IPEX and openxla both use Intel OneAPI SYCL which is used by llama.cpp's SYCL backend. So, it is already supported.

@simonlui
Copy link

simonlui commented May 10, 2024

IPEX LLM already supports llama.cpp I think: https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html

What IPEX-LLM has is a fork of llama.cpp and some other projects that has optimizations that have not been upstreamed here for one reason or another. I'm a current user of it and typically, usage of it doubles the speed of upstream. However, it can't support mixed GPU + CPU scenarios which is the main issue and new model support may take a while to filter over. Hence why I have upstream and the fork for my use cases.

@NeoZhangJianyu
Copy link
Collaborator

SYCL backend is still focusing on the missed functions to support more features and model.
Performance optimization will be handled in next. But we have less spare time to contribute to it. I think the progress won't be quickly.
Because SYCL backend will cover more Intel GPUs: Max, Flex, Arc and iGPU in MTL. So the performance optimization need to be verified on all GPUs and make sure not to impact any of them.

@Thellton
Copy link

Thellton commented May 13, 2024

which issue/pull would you recommend we follow for latest info about the SYCL branch @NeoZhangJianyu?

EDIT: I take it that [SYCL] Refactor would be it?

IPEX LLM already supports llama.cpp I think: https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html

What IPEX-LLM has is a fork of llama.cpp and some other projects that has optimizations that have not been upstreamed here for one reason or another. I'm a current user of it and typically, usage of it doubles the speed of upstream. However, it can't support mixed GPU + CPU scenarios which is the main issue and new model support may take a while to filter over. Hence why I have upstream and the fork for my use cases.

they seem to be keeping it reasonably up to date, as their published version of LlamaCPP-IPEX is using a week old version as its base line so far. I do hope they provide a bit of clarity about how to go about actually pulling new versions of their IPEX branch though... also Interesting about the lack of GPU overflow/partial offload capability, I was not aware of that...

@simonlui
Copy link

they seem to be keeping it reasonably up to date, as their published version of LlamaCPP-IPEX is using a week old version as its base line so far. I do hope they provide a bit of clarity about how to go about actually pulling new versions of their IPEX branch though... also Interesting about the lack of GPU overflow/partial offload capability, I was not aware of that...

Going to respond to this since the other comment was deleted from another person from Intel. I think it should be working but for some reason, it fails as it forces you to fully offload in the case of something like Llama 3 8B or faults on an illegal instruction for something bigger like Llama 3 70B or Command-R which had support just added from what I tested. I haven't upgraded in a while, so I'll probably be rechecking this before opening a ticket in the other repository to fix this since upstream works but the fork doesn't in this situation.

@Thellton
Copy link

they seem to be keeping it reasonably up to date, as their published version of LlamaCPP-IPEX is using a week old version as its base line so far. I do hope they provide a bit of clarity about how to go about actually pulling new versions of their IPEX branch though... also Interesting about the lack of GPU overflow/partial offload capability, I was not aware of that...

Going to respond to this since the other comment was deleted from another person from Intel. I think it should be working but for some reason, it fails as it forces you to fully offload in the case of something like Llama 3 8B or faults on an illegal instruction for something bigger like Llama 3 70B or Command-R which had support just added from what I tested. I haven't upgraded in a while, so I'll probably be rechecking this before opening a ticket in the other repository to fix this since upstream works but the fork doesn't in this situation.

Do they have a fork on github for llamacpp? I actually haven't found it, I just installed from the readingthedocs site that I linked to. hell, I don't actually know how to go about updating the install; just have a hypothesis about what I need to do.

@tonym97
Copy link

tonym97 commented May 15, 2024

Most of the stuff for IPEX-LLM has been upstreamed into llama.cpp. IPEX-LLM llama.cpp vs llama.cpp (upstream) is basically the same perf at this point. I think the question shouldn't be for IPEX-LLM support, but for SYCL support using upstream llama.cpp (which the IPEX-LLM team is already upstreaming into llama.cpp)

Also note that this doesn't require IPEX itself. IPEX-LLM does, but the native SYCL support does not.

And yes I work for Intel and yes I'm talking to IPEX-LLM teams and others :)

@Thellton
Copy link

Thellton commented May 15, 2024

Most of the stuff for IPEX-LLM has been upstreamed into llama.cpp. IPEX-LLM llama.cpp vs llama.cpp (upstream) is basically the same perf at this point. I think the question shouldn't be for IPEX-LLM support, but for SYCL support using upstream llama.cpp (which the IPEX-LLM team is already upstreaming into llama.cpp)

Also note that this doesn't require IPEX itself. IPEX-LLM does, but the native SYCL support does not.

And yes I work for Intel and yes I'm talking to IPEX-LLM teams and others :)

With a Q6_K quant of a llama 3 that had been quanted from a BF16 GGUF with the correct pre-tokeniser and EOS token, I get 30 tokens per second at the beginning of context with the IPEX branch compared to 17 tokens per second with the llamacpp-SYCL version b2885. that's actually quite a stark difference in performance as I see it and I feel that if it's possible, it'd be awesome to see the performance of the IPEX branch becoming generally available from the standard SYCL branch of llamacpp, as installing the IPEX branch was troublesome.

So I'll be waiting with bated breath, I guess.

@tonym97
Copy link

tonym97 commented May 15, 2024

Yeah that’s fair. It definitely depends on model size etc.

Will work with the team to try to upstream asap as we can.

@NeoZhangJianyu
Copy link
Collaborator

which issue/pull would you recommend we follow for latest info about the SYCL branch @NeoZhangJianyu?

EDIT: I take it that [SYCL] Refactor would be it?

IPEX LLM already supports llama.cpp I think: https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html

What IPEX-LLM has is a fork of llama.cpp and some other projects that has optimizations that have not been upstreamed here for one reason or another. I'm a current user of it and typically, usage of it doubles the speed of upstream. However, it can't support mixed GPU + CPU scenarios which is the main issue and new model support may take a while to filter over. Hence why I have upstream and the fork for my use cases.

they seem to be keeping it reasonably up to date, as their published version of LlamaCPP-IPEX is using a week old version as its base line so far. I do hope they provide a bit of clarity about how to go about actually pulling new versions of their IPEX branch though... also Interesting about the lack of GPU overflow/partial offload capability, I was not aware of that...

I suggest to use the latest code in master branch.
There is no obvious issue in SYCL backend recently.

@christianazinn
Copy link
Contributor

Can also attest to differences in SYCL build (as outlined in https://github.com/ggerganov/llama.cpp/blob/master/README-sycl.md) and the IPEX-LLM branch. Intel Arc A770M, Llama 3 8B Q8_0, full offload with the prompt "Building a website can be done in 10 simple steps:\nStep 1:"; Win11 and WSL2 Ubuntu SYCL builds get in the 4.3-4.9 tok/s range, while WSL2 Ubuntu IPEX build from their branch gets 6.5-7.1 tok/s. Looking forward to upstreamed IPEX support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants