-
Notifications
You must be signed in to change notification settings - Fork 14.1k
vulkan : add dynamic VRAM heuristic for low-VRAM GPUs #17485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
vulkan : add dynamic VRAM heuristic for low-VRAM GPUs #17485
Conversation
Implements a dynamic VRAM allocation heuristic that automatically calculates the optimal number of GPU layers to offload based on available VRAM. Changes: - Added ggml_backend_vk_get_device_info and ggml_backend_vk_get_default_gpu_layers to ggml-vulkan.cpp - Added dynamic heuristic to common_model_params_to_llama in common.cpp - Added llama-vk-device-info tool for inspecting Vulkan devices - Added documentation in docs/vulkan_low_vram.md Tested on AMD RX 6500 XT with 4GB VRAM, achieving 2.5-3.1x speedup.
5c8ad2b to
03fe95d
Compare
Implements a dynamic VRAM allocation heuristic that automatically calculates the optimal number of GPU layers to offload based on available VRAM. Changes: - Added ggml_backend_vk_get_device_info and ggml_backend_vk_get_default_gpu_layers to ggml-vulkan.cpp - Added dynamic heuristic to common_model_params_to_llama in common.cpp - Added llama-vk-device-info tool for inspecting Vulkan devices - Added documentation in docs/vulkan_low_vram.md Tested on AMD RX 6500 XT with 4GB VRAM, achieving 2.5-3.1x speedup.
2273c93 to
e8bf9ed
Compare
common/common.cpp
Outdated
| if (params.n_gpu_layers != -1) { | ||
| mparams.n_gpu_layers = params.n_gpu_layers; | ||
| } | ||
| #ifdef GGML_USE_VULKAN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backend-specific code has been long removed from outside the GGML library, you have to use the backend interface to access them in a generic way. You can already do that for total and free VRAM.
Have you looked into #16653 to see if that already covers your use case? That proposes a backend-agnostic way to automate setting parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for covering the first part, but you have not answered the questions yet. Have you tried that PR?
Replaces Vulkan-specific calls with ggml_backend_dev_memory to be backend-agnostic. Reverts changes to ggml-vulkan.cpp/h and removes vk_device_info example to comply with reviewer feedback.
|
Wow this is embarrassing. I need to test to see of the performance is on
par with my heuristic for low ram gpus and if not, how I can incorporate it
into this awesome pr’s code. This is much more elaborate than mine but sort
of a sword army knife approach. Let me dig into this and do some metrics.
…On Sat, Dec 13, 2025 at 2:20 AM Ruben Ortlam ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In common/common.cpp
<#17485 (comment)>:
> @@ -1161,6 +1165,89 @@ struct llama_model_params common_model_params_to_llama(common_params & params) {
if (params.n_gpu_layers != -1) {
mparams.n_gpu_layers = params.n_gpu_layers;
}
+#ifdef GGML_USE_VULKAN
Thank you for covering the first part, but you have not answered the
questions yet. Have you tried that PR?
—
Reply to this email directly, view it on GitHub
<#17485 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFACJ5RTV64K4UFHJIN6NP34BO42FAVCNFSM6AAAAACNDH5PB2VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTKNZUGI2DSOBZGM>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Dynamic
n_gpu_layersHeuristic for Low-VRAM GPUsSummary
This PR implements a dynamic
n_gpu_layerscalculation based on available VRAM to enable optimal GPU offloading on low-VRAM devices like the AMD RX 6500 XT.Motivation
The primary motivation for this PR is to enable use of llama.cpp on low-VRAM GPUs such as the AMD RX 6500 XT, which is particularly compelling due to its low power consumption and affordability. Many users, including myself, cannot justify purchasing a higher-end GPU, yet still want meaningful acceleration from Vulkan offloading.
Instead of requiring users to manually tune
n_gpu_layers, this PR automates the process to prevent OOM crashes while maximizing acceleration.The design also comports with the expectations outlined in the llama.cpp CONTRIBUTING.md guidelines:
Changes
Core Implementation
Dynamic Heuristic (
common/common.cpp):n_gpu_layersbased on available VRAMn_gpu_layers = -1(default)VRAM Query API (
ggml-vulkan.cpp):ggml_backend_vk_get_device_memory()to query available VRAMDocumentation & Testing
docs/windows_vulkan_low_vram.mdPerformance (llama-bench)
Hardware: AMD RX 6500 XT (4GB VRAM)
Model: Gemma 2B Q4_K_M (1.59 GiB)
Performance Summary
Multi-Model Results
Key Insight: The heuristic maximizes offloading for small models while preventing OOM on larger models.
Testing
Compliance
clang-formatrunMaintainer
Requesting review from @0cc4m (Vulkan backend maintainer per CODEOWNERS).
Willing to maintain long-term if accepted as collaborator.
and hope to extend this method to whisper and ggml for the same motivations!