feat: Add Support for Devices other than CUDA in Telemetry and Web UI #902

cromefire · 2023-11-26T18:59:21Z

Realized via AMD hipBLAS (ROCm/HIP)

Fixes #636

cromefire · 2023-11-26T19:55:19Z

Okay, the GPU name is also in the UI now, it's more like the ID currently, but I've opened an issue for the library to get the marketing (user recognizable) name: PTFOPlayer/rocm_smi_lib_rs#2

cromefire · 2023-11-26T20:43:53Z

Documentation is also there now, it should be pretty okay-ish for a start.

wsxiaoys

Thanks for the PR.

First round of comments focusing on source code

crates/tabby/src/main.rs

crates/tabby/src/services/health.rs

crates/llama-cpp-bindings/build.rs

.github/workflows/docker.yml

crates/tabby/src/main.rs

crates/tabby/src/services/health.rs

ee/tabby-ui/app/(dashboard)/components/worker-card.tsx

ee/tabby-ui/app/(dashboard)/page.tsx

cromefire · 2023-11-28T23:02:35Z

crates/tabby-common/src/api/accelerator.rs

+ /// Universally unique ID of the accelerator, if available
+ pub uuid: Option<String>,


The biggest issue now is whether we can prevent this field from being part of the telemetry. If not, it definitely has to be removed, as it's basically almost equivalent to a serial number. It's nice for tracking a GPU in the accelerators, but it's a huge issue if that data leaves your control.

cromefire · 2023-11-28T23:09:20Z

As for the why I use rocminfo instead of a lib that there's no rust wrapper for libhsa-runtime64 and I tried wring my own quickly, but due to C being C it seems to create just more problems than it solves (and my rust ffi and C skills aren't great). And then you have to link it again, which is rather fragile or load in on runtime, which is more of a guessing game, while rocminfo should be available whenever rocm is installed and is just on the path and just works. And if not, worst case, it doesn't show the GPUs in the health info. Maybe long term, at least on Linux, some sort of generic approach using the kernel interfaces is possible (via DRM or so).

wsxiaoys · 2023-11-29T03:23:34Z

Hey, considering the increasing size of this pull request, I have extracted #913 to focus on the basic ROCm support as a preliminary step

Amzd · 2023-12-08T15:02:18Z

The other PR didn't add any doc but I tried the command in documentation of this PR but couldn't get it to work, is there a trick to it I'm missing?

docker run -it
--device /dev/dri --device /dev/kfd
-p 8080:8080 -v $HOME/.tabby:/data
tabbyml/tabby-rocm
serve --model TabbyML/StarCoder-1B --device rocm

cromefire · 2023-12-08T16:17:28Z

The other PR didn't add any doc but I tried the command in documentation of this PR but couldn't get it to work, is there a trick to it I'm missing?

docker run -it
--device /dev/dri --device /dev/kfd
-p 8080:8080 -v $HOME/.tabby:/data
tabbyml/tabby-rocm
serve --model TabbyML/StarCoder-1B --device rocm

The other PR didn't include the docker image, you'll have to build it yourself

Frozen-byte · 2023-12-09T17:07:06Z

I am sadly unable to run this docker:

$ docker-compose up 
Recreating tabby_tabby_1 ... done
Attaching to tabby_tabby_1
tabby_1  | 2023-12-09T17:02:11.296325Z  INFO tabby::serve: crates/tabby/src/serve.rs:105: Starting server, this might takes a few minutes...
tabby_1  | ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
tabby_1  | ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
tabby_1  | ggml_init_cublas: found 2 ROCm devices:
tabby_1  |   Device 0: AMD Radeon RX 6900 XT, compute capability 10.3
tabby_1  |   Device 1: AMD Radeon Graphics, compute capability 10.3
tabby_1  | ggml_cuda_set_main_device: using device 0 (AMD Radeon RX 6900 XT) as main device
tabby_1  | 
tabby_1  | CUDA error 98 at /root/workspace/crates/llama-cpp-bindings/llama.cpp/ggml-cuda.cu:6779: invalid device function
tabby_1  | current device: 0
tabby_tabby_1 exited with code 1

my docker-compose.yml:

version: '3.5'

services:
  tabby:
    build:
      context: .
      dockerfile: rocm.Dockerfile
    command: serve --model TabbyML/StarCoder-1B --device rocm
    volumes:
      - "$HOME/Docker/tabby/data:/data"
    ports:
      - 8229:8080
    environment:
      - TABBY_DISABLE_USAGE_COLLECTION=1
    devices:
      - /dev/dri
      - /dev/kfd

cromefire · 2023-12-09T17:11:14Z

You'll probably need to remove your iGPU from the container, ROCm sometimes has some problems with that (alternatively I think in the llama.cpp docs it also called out an environment variable to make it only use a specific GPU).

# Conflicts: # crates/llama-cpp-bindings/build.rs # ee/tabby-ui/app/(dashboard)/cluster/components/worker-card.tsx # ee/tabby-ui/app/(dashboard)/page.tsx # ee/tabby-ui/lib/gql/generates/gql.ts # ee/tabby-ui/lib/gql/generates/graphql.ts # ee/tabby-ui/lib/gql/request-documents.tsx # ee/tabby-ui/lib/hooks/use-workers.ts # ee/tabby-webserver/src/api.rs # ee/tabby-webserver/src/lib.rs # ee/tabby-webserver/src/service/worker.rs # ee/tabby-webserver/ui/404.html # ee/tabby-webserver/ui/_next/static/KwTUFhDElVXkDE3jiCtD0/_buildManifest.js # ee/tabby-webserver/ui/_next/static/KwTUFhDElVXkDE3jiCtD0/_ssgManifest.js # ee/tabby-webserver/ui/_next/static/b_TXnwhyGI26zeMbbpjKm/_buildManifest.js # ee/tabby-webserver/ui/_next/static/b_TXnwhyGI26zeMbbpjKm/_ssgManifest.js # ee/tabby-webserver/ui/_next/static/chunks/app/playground/page-178c9cbcb60d3ef7.js # ee/tabby-webserver/ui/_next/static/yoSiYgCA8KR8UtHATY8_5/_buildManifest.js # ee/tabby-webserver/ui/_next/static/yoSiYgCA8KR8UtHATY8_5/_ssgManifest.js # ee/tabby-webserver/ui/index.html # ee/tabby-webserver/ui/index.txt # ee/tabby-webserver/ui/playground.html # ee/tabby-webserver/ui/playground.txt # ee/tabby-webserver/ui/swagger.html # ee/tabby-webserver/ui/swagger.txt # website/docs/extensions/troubleshooting.md # website/docs/installation/docker-compose.mdx # website/docs/installation/docker.mdx

cromefire · 2023-12-10T12:56:23Z

So updating this from main starts to get complicated now, because there's conflicts in a lot of places, so it would be great to get this in like some time soon... I'm trying to fix it up now, but I'm not sure how long I can do this, it's quite time consuming.

Frozen-byte · 2023-12-10T13:35:23Z

You'll probably need to remove your iGPU from the container, ROCm sometimes has some problems with that (alternatively I think in the llama.cpp docs it also called out an environment variable to make it only use a specific GPU).

Thank you! I can verify that pinning devices works perfectly now (1bc6f48):

    devices:
      - /dev/dri/card0
      - /dev/dri/renderD128
      - /dev/kfd

# Conflicts: # ee/tabby-ui/app/(dashboard)/cluster/components/cluster.tsx

cromefire · 2023-12-10T14:01:16Z

Also the llama.cpp fork should probably be updated sooner than later, as there's some fixes for AMD GPU performance in there.

wsxiaoys · 2023-12-10T14:11:59Z

Hey @cromefire - I would suggest first have a PR of building docker image / binary checkin first, then we can review how to integrate to rocm device information into dashboard.

cromefire · 2023-12-10T14:17:29Z

Sure. I'll extract that. The changes to support more accelerators are the hardest to maintain though.

PS: Also keep in mind that it won't only be ROCm device info soon, but probably oneAPI and/or Vulkan as well, which is why I made this be a more generic Accelerator interface.

* Added rocm builds and documentation * Pulled build improvements from #902 * Fixed build container for rocm build * Install git in rocm container * Fixed github step * Try to fix if statement * Added more generic dependency installation * upgraded rustup action * Update sccache * Try pytorch manylinux image * Switched location for toolchain parameter * Downgraded to deprecated action again * Readded set default step * Install minimal rocm on the fly * fixed typo in binary name * Downgraded checkout action * Use curl to download * Add -y flag to yum * Also install rocblas * Update release.yml * Update release.yml * Update prepare_build_environment.sh * Update prepare_build_environment.sh * Update build.rs * Update build.rs * Update README.md * Update website/docs/faq.mdx * Update index.md * Update and rename docker-cuda.yml to docker.yml * Delete .github/workflows/docker-rocm.yml * Delete rocm.Dockerfile * Rename cuda.Dockerfile to Dockerfile * Update docker.yml * Update website/docs/installation/docker.mdx * Update website/docs/installation/docker-compose.mdx * Update docker-compose.mdx * Update docker-compose.mdx * Update docker.mdx * Update docker.mdx * Update website/docs/faq.mdx --------- Co-authored-by: Meng Zhang <[email protected]>

# Conflicts: # .github/workflows/release.yml # Cargo.lock # crates/llama-cpp-bindings/src/llama.rs # crates/tabby/src/worker.rs # ee/tabby-webserver/src/api.rs # ee/tabby-webserver/src/lib.rs # ee/tabby-webserver/ui/404.html # ee/tabby-webserver/ui/_next/static/Yg0h7wsN0w8kbqBtlggWe/_buildManifest.js # ee/tabby-webserver/ui/_next/static/Yg0h7wsN0w8kbqBtlggWe/_ssgManifest.js # ee/tabby-webserver/ui/_next/static/b_TXnwhyGI26zeMbbpjKm/_buildManifest.js # ee/tabby-webserver/ui/_next/static/b_TXnwhyGI26zeMbbpjKm/_ssgManifest.js # ee/tabby-webserver/ui/_next/static/chunks/787-9e31941ac5498659.js # ee/tabby-webserver/ui/_next/static/chunks/app/(dashboard)/page-178d78b0cb76c3d8.js # ee/tabby-webserver/ui/_next/static/chunks/app/(dashboard)/team/page-ec2e0dfcc033c8d6.js # ee/tabby-webserver/ui/_next/static/chunks/app/auth/signin/page-ba1372efadfb09ae.js # ee/tabby-webserver/ui/_next/static/chunks/app/auth/signup/page-259c5cda08cde501.js # ee/tabby-webserver/ui/_next/static/chunks/app/layout-da8c0f298b52b0a9.js # ee/tabby-webserver/ui/_next/static/chunks/app/playground/page-33c06e2d963b9bd9.js # ee/tabby-webserver/ui/_next/static/f19OhW5W2SP9IXexWzvSm/_buildManifest.js # ee/tabby-webserver/ui/_next/static/f19OhW5W2SP9IXexWzvSm/_ssgManifest.js # ee/tabby-webserver/ui/api.html # ee/tabby-webserver/ui/api.txt # ee/tabby-webserver/ui/auth/signin.html # ee/tabby-webserver/ui/auth/signin.txt # ee/tabby-webserver/ui/auth/signup.html # ee/tabby-webserver/ui/auth/signup.txt # ee/tabby-webserver/ui/cluster.html # ee/tabby-webserver/ui/cluster.txt # ee/tabby-webserver/ui/index.html # ee/tabby-webserver/ui/index.txt # ee/tabby-webserver/ui/playground.html # ee/tabby-webserver/ui/playground.txt # ee/tabby-webserver/ui/team.html # ee/tabby-webserver/ui/team.txt # website/docs/faq.mdx

cromefire · 2023-12-22T13:19:01Z

Just removed the conflicts and maybe another friendly reminder that these changes suck up a lot of time the longer they lie around and I constantly have to resolve the conflicts...

# Conflicts: # .github/workflows/release.yml # ee/tabby-webserver/graphql/schema.graphql # ee/tabby-webserver/src/lib.rs # ee/tabby-webserver/ui/404.html # ee/tabby-webserver/ui/_next/static/DR2uz83UxeFune2UqDmYc/_buildManifest.js # ee/tabby-webserver/ui/_next/static/DR2uz83UxeFune2UqDmYc/_ssgManifest.js # ee/tabby-webserver/ui/_next/static/IyjC1gd1mIwIqFgfamAQE/_buildManifest.js # ee/tabby-webserver/ui/_next/static/IyjC1gd1mIwIqFgfamAQE/_ssgManifest.js # ee/tabby-webserver/ui/_next/static/chunks/320-558d0cf728abd693.js # ee/tabby-webserver/ui/_next/static/chunks/358-42f7d20a1d08acdf.js # ee/tabby-webserver/ui/_next/static/chunks/787-7044efff94e13a32.js # ee/tabby-webserver/ui/_next/static/chunks/936-c5db2bc813a2dffc.js # ee/tabby-webserver/ui/_next/static/chunks/app/(dashboard)/page-dcf4098462fb2820.js # ee/tabby-webserver/ui/_next/static/chunks/app/(dashboard)/team/page-9f81e0759cc05ccf.js # ee/tabby-webserver/ui/_next/static/chunks/app/auth/signup/page-f91f45006a9deb29.js # ee/tabby-webserver/ui/_next/static/chunks/app/layout-6661c62a6a0742fa.js # ee/tabby-webserver/ui/_next/static/chunks/webpack-1456af27b04dd17f.js # ee/tabby-webserver/ui/_next/static/chunks/webpack-3e396986a653b40a.js # ee/tabby-webserver/ui/_next/static/chunks/webpack-543c81bccbb8217e.js # ee/tabby-webserver/ui/_next/static/css/40752a0bacee63c0.css # ee/tabby-webserver/ui/_next/static/css/724bf7d939e05f2e.css # ee/tabby-webserver/ui/_next/static/css/98dd5f72c9a55366.css # ee/tabby-webserver/ui/_next/static/f19OhW5W2SP9IXexWzvSm/_buildManifest.js # ee/tabby-webserver/ui/_next/static/f19OhW5W2SP9IXexWzvSm/_ssgManifest.js # ee/tabby-webserver/ui/api.html # ee/tabby-webserver/ui/api.txt # ee/tabby-webserver/ui/auth/signin.html # ee/tabby-webserver/ui/auth/signin.txt # ee/tabby-webserver/ui/auth/signup.html # ee/tabby-webserver/ui/auth/signup.txt # ee/tabby-webserver/ui/cluster.html # ee/tabby-webserver/ui/cluster.txt # ee/tabby-webserver/ui/index.html # ee/tabby-webserver/ui/index.txt # ee/tabby-webserver/ui/playground.html # ee/tabby-webserver/ui/playground.txt # ee/tabby-webserver/ui/team.html # ee/tabby-webserver/ui/team.txt # website/docs/faq.mdx

JumpLink · 2024-01-05T14:11:50Z

super nice, when will the PR be merged?

# Conflicts: # crates/tabby/src/main.rs

cromefire and others added 10 commits November 25, 2023 21:29

Added build configurations for Intel and AMD hardware

aa6bf9d

Improved rocm build

5732d7d

Added options for OneAPI and ROCm

1e05350

Build llama using icx

3496bb9

[autofix.ci] apply automated fixes

6098475

Fixed rocm image

7d38fc4

Build ROCm

4f90c98

Tried to adjust compile flags for SYCL

c0804cb

Removed references to oneAPI

8d56d37

Provide info about the used device for ROCm

dd72b82

cromefire changed the title ~~Add support for AMD hardware~~ feat: Add support for AMD hardware Nov 26, 2023

cromefire mentioned this pull request Nov 26, 2023

feat: Add support Intel #895

Closed

Added ROCm documentation

b4a02d1

wsxiaoys requested changes Nov 27, 2023

View reviewed changes

crates/tabby/src/main.rs Outdated Show resolved Hide resolved

crates/tabby/src/services/health.rs Outdated Show resolved Hide resolved

crates/llama-cpp-bindings/build.rs Outdated Show resolved Hide resolved

.github/workflows/docker.yml Outdated Show resolved Hide resolved

Addressed review comments

7b40db6

wsxiaoys requested changes Nov 28, 2023

View reviewed changes

cromefire added 2 commits November 28, 2023 23:44

Refactored to expose generic accelerator information

55c7884

Pull request cleanup

cbbdf22

cromefire requested a review from wsxiaoys November 28, 2023 22:59

cromefire commented Nov 28, 2023

View reviewed changes

wsxiaoys mentioned this pull request Nov 29, 2023

feat: add rocm support #913

Merged

autofix-ci bot and others added 2 commits December 10, 2023 12:43

[autofix.ci] apply automated fixes

f80007c

Merge branch 'main' into rocm-support

dd80a3a

Tried to fix most build issues

b993d5d

cromefire added 2 commits December 10, 2023 14:37

Fixed the rest of the build issues

1bc6f48

Merge branch 'main' into rocm-support

c73b518

# Conflicts: # ee/tabby-ui/app/(dashboard)/cluster/components/cluster.tsx

Added rocm binary build

94d22c4

cromefire mentioned this pull request Dec 10, 2023

feat: Add rocm builds and documentation #1012

Merged

cromefire added a commit to cromefire/tabby that referenced this pull request Dec 10, 2023

Pulled build improvements from TabbyML#902

1be9c1e

cromefire changed the title ~~feat: Add support for AMD hardware~~ feat: Add support for telemetry and web UI display for devices other than CUDA Dec 10, 2023

cromefire added 3 commits December 22, 2023 13:00

Added ROCm 6.0 build

b3ef2e2

Fixed build and slimmed down container

df62a7d

cromefire added 4 commits January 2, 2024 18:47

Fixed certificates in docker file

234f58a

Fixed merge issue

28ce9f1

Fixed merge issue in workflow

fd0891b

cromefire changed the title ~~feat: Add support for telemetry and web UI display for devices other than CUDA~~ feat: Add support for devices other than CUDA in telemetry and web UI Jan 2, 2024

cromefire changed the title ~~feat: Add support for devices other than CUDA in telemetry and web UI~~ feat: Add Support for Devices other than CUDA in Telemetry and Web UI Jan 2, 2024

cromefire added 3 commits January 14, 2024 23:47

Merge branch 'main' into rocm-support

b471554

# Conflicts: # crates/tabby/src/main.rs

Added support for specifying targets for docker build

b13bbec

Fixed docker build

99f63f3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Support for Devices other than CUDA in Telemetry and Web UI #902

feat: Add Support for Devices other than CUDA in Telemetry and Web UI #902

cromefire commented Nov 26, 2023

cromefire commented Nov 26, 2023

cromefire commented Nov 26, 2023

wsxiaoys left a comment

cromefire Nov 28, 2023

cromefire commented Nov 28, 2023 •

edited

wsxiaoys commented Nov 29, 2023

Amzd commented Dec 8, 2023

cromefire commented Dec 8, 2023

Frozen-byte commented Dec 9, 2023

cromefire commented Dec 9, 2023

cromefire commented Dec 10, 2023

Frozen-byte commented Dec 10, 2023 •

edited

cromefire commented Dec 10, 2023

wsxiaoys commented Dec 10, 2023

cromefire commented Dec 10, 2023 •

edited

cromefire commented Dec 22, 2023

JumpLink commented Jan 5, 2024

		/// Universally unique ID of the accelerator, if available
		pub uuid: Option<String>,

feat: Add Support for Devices other than CUDA in Telemetry and Web UI #902

Are you sure you want to change the base?

feat: Add Support for Devices other than CUDA in Telemetry and Web UI #902

Conversation

cromefire commented Nov 26, 2023

cromefire commented Nov 26, 2023

cromefire commented Nov 26, 2023

wsxiaoys left a comment

Choose a reason for hiding this comment

cromefire Nov 28, 2023

Choose a reason for hiding this comment

cromefire commented Nov 28, 2023 • edited

wsxiaoys commented Nov 29, 2023

Amzd commented Dec 8, 2023

cromefire commented Dec 8, 2023

Frozen-byte commented Dec 9, 2023

cromefire commented Dec 9, 2023

cromefire commented Dec 10, 2023

Frozen-byte commented Dec 10, 2023 • edited

cromefire commented Dec 10, 2023

wsxiaoys commented Dec 10, 2023

cromefire commented Dec 10, 2023 • edited

cromefire commented Dec 22, 2023

JumpLink commented Jan 5, 2024

cromefire commented Nov 28, 2023 •

edited

Frozen-byte commented Dec 10, 2023 •

edited

cromefire commented Dec 10, 2023 •

edited