Enhanced GPU discovery and multi-gpu support with concurrency #4517

dhiltgen · 2024-05-18T23:07:08Z

Carries (and obsoletes if we move this one forward first) #4266 and #4441

This refines our GPU discovery to split it into bootstrapping where we discover information about the GPUs once at startup, and then incrementally refresh just free space information, instead of fully rediscovering the GPUs over and over.

Fixes #3158
Fixes #4198
Fixes #3765

jmorganca · 2024-06-02T04:14:49Z

gpu/gpu.go

+
+ switch runtime.GOOS {
+ case "windows":
+ oneapiMgmtName = "ze_intel_gpu64.dll"


This DLL gets installed on Windows with Intel iGPUs as part of the OS base install and doesn't always open reliably – it seems to be causing some crashes on both Win10 and Win11 and so we may want to put this behind a flag until we resolve those issues

What I'm thinking is I'll add a temporary check to see if we have a oneapi runner available, and if not, disable gpu discovery for the oneapi library that way it can still be built from source and theoretically work but be truly a no-op for the official builds until we can test it more fully.

Never mind - this would lead to circular dependencies since the llm package with the payloads depends on gpu.

I'm pretty sure I fixed the bug that lead to the crash on oneapi initialization, so I think we'll be Ok leaving this in place.

gpu/types.go

jmorganca · 2024-06-02T04:24:45Z

llm/server.go

@@ -232,6 +228,10 @@ func NewLlamaServer(gpus gpu.GpuInfoList, model string, ggml *GGML, adapters, pr

 params = append(params, "--parallel", fmt.Sprintf("%d", numParallel))

+ if estimate.TensorSplit != "" {
+ params = append(params, "--tensor-split", estimate.TensorSplit)


This is super cool! Can't wait to try it more on 2x, 4x and 8x gpu systems

llm/ext_server/server.cpp

jmorganca

Overall looks great! Small comment RE some oneapi dll open panics we are seeing on Windows boxes with iGPUs - we'd want to avoid making that part of the critical path until we resolve this

mxyng · 2024-06-05T17:19:24Z

gpu/amd_linux.go

+ continue
+ }
+ filename := filepath.Join(devDir, m.filename)
+ fp, err := os.Open(filename)


os.ReadFile might be better since it's reading the full file into memory and doesn't require a Close

mxyng · 2024-06-05T17:19:37Z

gpu/amd_linux.go

+ // Found the matching DRM directory
+ slog.Debug("matched", "amdgpu", match, "drm", devDir)
+ totalFile := filepath.Join(devDir, DRMTotalMemoryFile)
+ totalFp, err := os.Open(totalFile)


os.ReadFile()?

mxyng · 2024-06-05T17:20:54Z

gpu/amd_linux.go

+ TotalMemory: totalMemory,
+ FreeMemory: (totalMemory - usedMemory),
+ },
+ ID: fmt.Sprintf("%d", gpuID),


Suggested change

ID: fmt.Sprintf("%d", gpuID),

ID: strconv.Itoa(gpuID),

mxyng · 2024-06-05T17:21:38Z

gpu/amd_linux.go

@@ -276,7 +315,7 @@ func AMDGetGPUInfo() []GpuInfo {
 libDir, err = AMDValidateLibDir()
 if err != nil {
 slog.Warn("unable to verify rocm library, will use cpu", "error", err)
- return []GpuInfo{}
+ return []RocmGPUInfo{}


Suggested change

return []RocmGPUInfo{}

return nil

mxyng · 2024-06-05T17:22:04Z

gpu/amd_linux.go

+}
+
+func getFreeMemory(usedFile string) (uint64, error) {
+ usedFp, err := os.Open(usedFile)


os.ReadFile()?

mxyng · 2024-06-05T17:56:22Z

llm/memory.go

 if layer, ok := layers["output_norm"]; ok {
 memoryLayerOutput += layer.size()
 }
-
 if layer, ok := layers["output"]; ok {
 memoryLayerOutput += layer.size()
 } else if layer, ok := layers["token_embd"]; ok {
 memoryLayerOutput += layer.size()
 }

 if gpus[0].Library == "metal" && opts.UseMMap {


we might want to remove this since it's a bug in llama.cpp caused by allocating memory based on file offset of tensors which might accidentally include the output layer

Can you clarify? Prior to this bug being fixed in llama.cpp, do we run the risk of over-allocating layers if I removed this chunk?

if gpus[0].Library == "metal" && opts.UseMMap { includeOutput = true }

mxyng · 2024-06-05T17:59:28Z

llm/memory.go

+ continue
+ }
+ gpusWithSpace = append(gpusWithSpace, gs{i, &gpus[i]})
+ gpuAllocations[i] += gpus[i].MinimumMemory + layerBuffer // We hold off on graph until we know partial vs. full


iirc multigpu should always use the partial graph size

mxyng · 2024-06-05T18:00:18Z

llm/memory.go

+ gpuAllocations[gpuZeroID] += gpuZeroOverhead
+ }
+
+ layerSizes = make([]uint64, int(ggml.KV().BlockCount()))


This only needs to be once since all repeating layers are the same size

mxyng · 2024-06-05T18:03:54Z

llm/memory_test.go

+ }
+ projectors := []string{}
+ opts := api.DefaultOptions()
+ estimate := EstimateGPULayers(gpus, ggml, projectors, opts)


any call to EstimateGPULayers should be called in t.Run so they can be run separately

mxyng · 2024-06-05T18:05:05Z

llm/memory_test.go

+ },
+ }
+ // Nested array: GPU0 layer space, GPU1 layer space, expected gpu0, expected gpu1
+ for i, s := range [][]uint64{


a struct will be easier to read

Suggested change

for i, s := range [][]uint64{

for i, s := range []struct{

layer0, layer1 uint64,

expect0, expect1 uint64,

}{

This reverts commit 476fb8e.

The amdgpu drivers free VRAM reporting omits some other apps, so leverage the upstream DRM driver which keeps better tabs on things

Now that we call the GPU discovery routines many times to update memory, this splits initial discovery from free memory updating.

This worked remotely but wound up trying to spawn multiple servers locally which doesn't work

Still not complete, needs some refinement to our prediction to understand the discrete GPUs available space so we can see how many layers fit in each one since we can't split one layer across multiple GPUs we can't treat free space as one logical block

Our default behavior today is to try to fit into a single GPU if possible. Some users would prefer the old behavior of always spreading across multiple GPUs even if the model can fit into one. This exposes that tunable behavior.

adjust timing on some tests so they don't timeout on small/slow GPUs

This library will give us the most reliable free VRAM reporting on windows to enable concurrent model scheduling.

While models are loading, the VRAM metrics are dynamic, so try to load on a GPU that doesn't have a model actively loading, or wait to avoid races that lead to OOMs

dhiltgen mentioned this pull request May 18, 2024

Enable concurrency by default #4218

Draft

dhiltgen force-pushed the gpu_incremental branch 4 times, most recently from 05ba1ca to 91be1fa Compare May 20, 2024 20:50

dhiltgen marked this pull request as ready for review May 20, 2024 23:44

dhiltgen force-pushed the gpu_incremental branch 4 times, most recently from ecde7d9 to d788717 Compare May 28, 2024 21:29

dhiltgen force-pushed the gpu_incremental branch 2 times, most recently from f02b076 to 076450a Compare May 30, 2024 20:13

dhiltgen marked this pull request as draft May 30, 2024 20:45

dhiltgen force-pushed the gpu_incremental branch from 076450a to 6b78c76 Compare May 30, 2024 21:37

dhiltgen marked this pull request as ready for review May 30, 2024 22:01

dhiltgen mentioned this pull request May 30, 2024

feat: enable OLLAMA Arc GPU support with SYCL backend #3796

Closed

dhiltgen force-pushed the gpu_incremental branch 4 times, most recently from bfbb50e to 137b4d9 Compare June 1, 2024 19:32

This was referenced Jun 1, 2024

Support GPU runners on CPUs without AVX #2187

Open

Support GPU runners with AVX2 #2281

Open

ROCM setup with two 7900 XTX outputs generate irrelevant content. #3158

Open

dual GPU 8G/16G - CUDA error: out of memory with dolphin-mixtral #3460

Closed

jmorganca reviewed Jun 2, 2024

View reviewed changes

gpu/types.go Outdated Show resolved Hide resolved

jmorganca reviewed Jun 2, 2024

View reviewed changes

gpu/types.go Outdated Show resolved Hide resolved

jmorganca reviewed Jun 2, 2024

View reviewed changes

llm/ext_server/server.cpp Show resolved Hide resolved

jmorganca reviewed Jun 2, 2024

View reviewed changes

dhiltgen force-pushed the gpu_incremental branch 2 times, most recently from 6d89cce to 1ea95e1 Compare June 4, 2024 18:47

dhiltgen mentioned this pull request Jun 4, 2024

scheduler unaware of gpu system memory usage on windows can result in thrashing with concurrent models loaded #4599

Open

mxyng reviewed Jun 5, 2024

View reviewed changes

dhiltgen force-pushed the gpu_incremental branch 2 times, most recently from 49ca654 to 85669c8 Compare June 6, 2024 16:38

dhiltgen added 14 commits June 6, 2024 10:40

Revert "Limit GPU lib search for now (ollama#4777)"

2880c00

This reverts commit 476fb8e.

Fix server.cpp for the new cuda build macros

a44d930

Use DRM driver for VRAM info for amd

047543d

The amdgpu drivers free VRAM reporting omits some other apps, so leverage the upstream DRM driver which keeps better tabs on things

Refine GPU discovery to bootstrap once

38c1462

Now that we call the GPU discovery routines many times to update memory, this splits initial discovery from free memory updating.

Fix concurrency integration test to work locally

20f44aa

This worked remotely but wound up trying to spawn multiple servers locally which doesn't work

Support forced spreading for multi GPU

0e0d569

Our default behavior today is to try to fit into a single GPU if possible. Some users would prefer the old behavior of always spreading across multiple GPUs even if the model can fit into one. This exposes that tunable behavior.

refined test timing

ccefc37

adjust timing on some tests so they don't timeout on small/slow GPUs

Harden unload for empty runners

84a9a0c

Refactor intel gpu discovery

dbf4998

Reintroduce nvidia nvml library for windows

0eb7cda

This library will give us the most reliable free VRAM reporting on windows to enable concurrent model scheduling.

Refine CPU load behavior with system memory visibility

1b44254

Prevent multiple concurrent loads on the same gpus

0d33fb4

While models are loading, the VRAM metrics are dynamic, so try to load on a GPU that doesn't have a model actively loading, or wait to avoid races that lead to OOMs

review comments and coverage

cf81951

dhiltgen force-pushed the gpu_incremental branch from 85669c8 to cf81951 Compare June 6, 2024 17:41

This was referenced Jun 6, 2024

Rocm v6 bump #4874

Draft

Rocm gfx900 workaround #4875

Draft

Intel GPU build support #4876

Draft

Use DRM driver for VRAM info for amd #4441

Closed

Support forced spreading for multi GPU #4266

Closed

Add Jetson cuda variants for arm #4741

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhanced GPU discovery and multi-gpu support with concurrency #4517

Enhanced GPU discovery and multi-gpu support with concurrency #4517

dhiltgen commented May 18, 2024 •

edited

jmorganca Jun 2, 2024 •

edited

dhiltgen Jun 2, 2024

jmorganca Jun 2, 2024

dhiltgen Jun 3, 2024

jmorganca Jun 2, 2024 •

edited

jmorganca left a comment

mxyng Jun 5, 2024

mxyng Jun 5, 2024

mxyng Jun 5, 2024

mxyng Jun 5, 2024

mxyng Jun 5, 2024

mxyng Jun 5, 2024

dhiltgen Jun 5, 2024

mxyng Jun 5, 2024

mxyng Jun 5, 2024

mxyng Jun 5, 2024

mxyng Jun 5, 2024

- for i, s := range [][]uint64{
+ for i, s := range []struct{
+ layer0, layer1 uint64,
+ expect0, expect1 uint64,
+ }{

Enhanced GPU discovery and multi-gpu support with concurrency #4517

Are you sure you want to change the base?

Enhanced GPU discovery and multi-gpu support with concurrency #4517

Conversation

dhiltgen commented May 18, 2024 • edited

jmorganca Jun 2, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmorganca Jun 2, 2024 • edited

Choose a reason for hiding this comment

jmorganca left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhiltgen commented May 18, 2024 •

edited

jmorganca Jun 2, 2024 •

edited

jmorganca Jun 2, 2024 •

edited