-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhanced GPU discovery and multi-gpu support with concurrency #4517
base: main
Are you sure you want to change the base?
Conversation
05ba1ca
to
91be1fa
Compare
ecde7d9
to
d788717
Compare
f02b076
to
076450a
Compare
bfbb50e
to
137b4d9
Compare
gpu/gpu.go
Outdated
|
||
switch runtime.GOOS { | ||
case "windows": | ||
oneapiMgmtName = "ze_intel_gpu64.dll" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This DLL gets installed on Windows with Intel iGPUs as part of the OS base install and doesn't always open reliably – it seems to be causing some crashes on both Win10 and Win11 and so we may want to put this behind a flag until we resolve those issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'm thinking is I'll add a temporary check to see if we have a oneapi runner available, and if not, disable gpu discovery for the oneapi library that way it can still be built from source and theoretically work but be truly a no-op for the official builds until we can test it more fully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind - this would lead to circular dependencies since the llm package with the payloads depends on gpu.
I'm pretty sure I fixed the bug that lead to the crash on oneapi initialization, so I think we'll be Ok leaving this in place.
@@ -232,6 +228,10 @@ func NewLlamaServer(gpus gpu.GpuInfoList, model string, ggml *GGML, adapters, pr | |||
|
|||
params = append(params, "--parallel", fmt.Sprintf("%d", numParallel)) | |||
|
|||
if estimate.TensorSplit != "" { | |||
params = append(params, "--tensor-split", estimate.TensorSplit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is super cool! Can't wait to try it more on 2x, 4x and 8x gpu systems
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great! Small comment RE some oneapi dll open panics we are seeing on Windows boxes with iGPUs - we'd want to avoid making that part of the critical path until we resolve this
6d89cce
to
1ea95e1
Compare
gpu/amd_linux.go
Outdated
continue | ||
} | ||
filename := filepath.Join(devDir, m.filename) | ||
fp, err := os.Open(filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
os.ReadFile
might be better since it's reading the full file into memory and doesn't require a Close
gpu/amd_linux.go
Outdated
// Found the matching DRM directory | ||
slog.Debug("matched", "amdgpu", match, "drm", devDir) | ||
totalFile := filepath.Join(devDir, DRMTotalMemoryFile) | ||
totalFp, err := os.Open(totalFile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
os.ReadFile()
?
gpu/amd_linux.go
Outdated
TotalMemory: totalMemory, | ||
FreeMemory: (totalMemory - usedMemory), | ||
}, | ||
ID: fmt.Sprintf("%d", gpuID), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ID: fmt.Sprintf("%d", gpuID), | |
ID: strconv.Itoa(gpuID), |
gpu/amd_linux.go
Outdated
@@ -276,7 +315,7 @@ func AMDGetGPUInfo() []GpuInfo { | |||
libDir, err = AMDValidateLibDir() | |||
if err != nil { | |||
slog.Warn("unable to verify rocm library, will use cpu", "error", err) | |||
return []GpuInfo{} | |||
return []RocmGPUInfo{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return []RocmGPUInfo{} | |
return nil |
gpu/amd_linux.go
Outdated
} | ||
|
||
func getFreeMemory(usedFile string) (uint64, error) { | ||
usedFp, err := os.Open(usedFile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
os.ReadFile()
?
llm/memory.go
Outdated
if layer, ok := layers["output_norm"]; ok { | ||
memoryLayerOutput += layer.size() | ||
} | ||
|
||
if layer, ok := layers["output"]; ok { | ||
memoryLayerOutput += layer.size() | ||
} else if layer, ok := layers["token_embd"]; ok { | ||
memoryLayerOutput += layer.size() | ||
} | ||
|
||
if gpus[0].Library == "metal" && opts.UseMMap { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might want to remove this since it's a bug in llama.cpp caused by allocating memory based on file offset of tensors which might accidentally include the output layer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify? Prior to this bug being fixed in llama.cpp, do we run the risk of over-allocating layers if I removed this chunk?
if gpus[0].Library == "metal" && opts.UseMMap {
includeOutput = true
}
llm/memory.go
Outdated
continue | ||
} | ||
gpusWithSpace = append(gpusWithSpace, gs{i, &gpus[i]}) | ||
gpuAllocations[i] += gpus[i].MinimumMemory + layerBuffer // We hold off on graph until we know partial vs. full |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iirc multigpu should always use the partial graph size
llm/memory.go
Outdated
gpuAllocations[gpuZeroID] += gpuZeroOverhead | ||
} | ||
|
||
layerSizes = make([]uint64, int(ggml.KV().BlockCount())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only needs to be once since all repeating layers are the same size
llm/memory_test.go
Outdated
} | ||
projectors := []string{} | ||
opts := api.DefaultOptions() | ||
estimate := EstimateGPULayers(gpus, ggml, projectors, opts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any call to EstimateGPULayers
should be called in t.Run
so they can be run separately
llm/memory_test.go
Outdated
}, | ||
} | ||
// Nested array: GPU0 layer space, GPU1 layer space, expected gpu0, expected gpu1 | ||
for i, s := range [][]uint64{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a struct will be easier to read
for i, s := range [][]uint64{ | |
for i, s := range []struct{ | |
layer0, layer1 uint64, | |
expect0, expect1 uint64, | |
}{ |
49ca654
to
85669c8
Compare
This reverts commit 476fb8e.
The amdgpu drivers free VRAM reporting omits some other apps, so leverage the upstream DRM driver which keeps better tabs on things
Now that we call the GPU discovery routines many times to update memory, this splits initial discovery from free memory updating.
This worked remotely but wound up trying to spawn multiple servers locally which doesn't work
Still not complete, needs some refinement to our prediction to understand the discrete GPUs available space so we can see how many layers fit in each one since we can't split one layer across multiple GPUs we can't treat free space as one logical block
Our default behavior today is to try to fit into a single GPU if possible. Some users would prefer the old behavior of always spreading across multiple GPUs even if the model can fit into one. This exposes that tunable behavior.
adjust timing on some tests so they don't timeout on small/slow GPUs
This library will give us the most reliable free VRAM reporting on windows to enable concurrent model scheduling.
While models are loading, the VRAM metrics are dynamic, so try to load on a GPU that doesn't have a model actively loading, or wait to avoid races that lead to OOMs
Carries (and obsoletes if we move this one forward first) #4266 and #4441
This refines our GPU discovery to split it into bootstrapping where we discover information about the GPUs once at startup, and then incrementally refresh just free space information, instead of fully rediscovering the GPUs over and over.
Fixes #3158
Fixes #4198
Fixes #3765