vulkan : add dynamic VRAM heuristic for low-VRAM GPUs #17485

cafeTechne · 2025-11-25T05:17:44Z

Dynamic `n_gpu_layers` Heuristic for Low-VRAM GPUs

Summary

This PR implements a dynamic n_gpu_layers calculation based on available VRAM to enable optimal GPU offloading on low-VRAM devices like the AMD RX 6500 XT.

Motivation

The primary motivation for this PR is to enable use of llama.cpp on low-VRAM GPUs such as the AMD RX 6500 XT, which is particularly compelling due to its low power consumption and affordability. Many users, including myself, cannot justify purchasing a higher-end GPU, yet still want meaningful acceleration from Vulkan offloading.

Instead of requiring users to manually tune n_gpu_layers, this PR automates the process to prevent OOM crashes while maximizing acceleration.

The design also comports with the expectations outlined in the llama.cpp CONTRIBUTING.md guidelines:

The feature is self-contained and maintains codebase minimalism.
It adds functionality without modifying core operators.
It uses clear naming conventions and avoids architectural complexity.
It provides documentation, benchmarks, and reasoning consistent with contributor requirements.

Changes

Core Implementation

Dynamic Heuristic (common/common.cpp):

Queries GGUF metadata for model size and layer count
Calculates optimal n_gpu_layers based on available VRAM
Reserves 800MB overhead for KV cache and compute buffers
Triggered when n_gpu_layers = -1 (default)
Generalizes across architectures (Gemma, Llama, Qwen, etc.)

VRAM Query API (ggml-vulkan.cpp):

Added ggml_backend_vk_get_device_memory() to query available VRAM
Exposes device memory info to heuristic layer

Documentation & Testing

Added docs/windows_vulkan_low_vram.md
Benchmark scripts for validation
Inline comments explaining heuristic logic

Performance (llama-bench)

Hardware: AMD RX 6500 XT (4GB VRAM)
Model: Gemma 2B Q4_K_M (1.59 GiB)

Performance Summary

Metric	CPU-only	GPU Heuristic	Improvement
Prompt processing (pp512)	497 t/s	1231 t/s	+147%
Token generation (tg128)	19.4 t/s	60.4 t/s	+212%
Layers offloaded	0/27	26/27	Auto-optimized

Multi-Model Results

Model	Size	Layers Offloaded	Performance
Gemma 2B	1.6GB	26/27 (96%)	2.5–3.1× faster
Llama 3.2 3B	1.9GB	28/29 (97%)	~2× faster
Llama 2 7B	3.9GB	21/33 (64%)	1.6× faster

Key Insight: The heuristic maximizes offloading for small models while preventing OOM on larger models.

Testing

✅ llama-bench: Verified 2.5-3.1x speedup on Gemma 2B
✅ Multi-model: Tested on Gemma 2B, Llama 2 7B, Llama 2 13B
✅ OOM Prevention: Larger models gracefully degrade (no crashes)
✅ Platform: Windows 11, AMD RX 6500 XT
⏳ Cross-platform: Linux/macOS testing pending (code is platform-agnostic, so I don't anticipate there being any issues here)

Compliance

✅ No third-party dependencies
✅ Follows naming conventions (snake_case, longest prefix)
✅ No ggml operators modified
✅ Trailing whitespace cleaned
✅ clang-format run

Maintainer

Requesting review from @0cc4m (Vulkan backend maintainer per CODEOWNERS).
Willing to maintain long-term if accepted as collaborator.
and hope to extend this method to whisper and ggml for the same motivations!

ggml/include/ggml-vulkan.h

Implements a dynamic VRAM allocation heuristic that automatically calculates the optimal number of GPU layers to offload based on available VRAM. Changes: - Added ggml_backend_vk_get_device_info and ggml_backend_vk_get_default_gpu_layers to ggml-vulkan.cpp - Added dynamic heuristic to common_model_params_to_llama in common.cpp - Added llama-vk-device-info tool for inspecting Vulkan devices - Added documentation in docs/vulkan_low_vram.md Tested on AMD RX 6500 XT with 4GB VRAM, achieving 2.5-3.1x speedup.

0cc4m · 2025-11-28T09:15:49Z

common/common.cpp

    if (params.n_gpu_layers != -1) {
        mparams.n_gpu_layers = params.n_gpu_layers;
    }
+#ifdef GGML_USE_VULKAN


Backend-specific code has been long removed from outside the GGML library, you have to use the backend interface to access them in a generic way. You can already do that for total and free VRAM.

Have you looked into #16653 to see if that already covers your use case? That proposes a backend-agnostic way to automate setting parameters.

Thank you for covering the first part, but you have not answered the questions yet. Have you tried that PR?

Replaces Vulkan-specific calls with ggml_backend_dev_memory to be backend-agnostic. Reverts changes to ggml-vulkan.cpp/h and removes vk_device_info example to comply with reviewer feedback.

cafeTechne · 2025-12-13T07:50:36Z

Wow this is embarrassing. I need to test to see of the performance is on par with my heuristic for low ram gpus and if not, how I can incorporate it into this awesome pr’s code. This is much more elaborate than mine but sort of a sword army knife approach. Let me dig into this and do some metrics.

…

On Sat, Dec 13, 2025 at 2:20 AM Ruben Ortlam ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In common/common.cpp <#17485 (comment)>: > @@ -1161,6 +1165,89 @@ struct llama_model_params common_model_params_to_llama(common_params & params) { if (params.n_gpu_layers != -1) { mparams.n_gpu_layers = params.n_gpu_layers; } +#ifdef GGML_USE_VULKAN Thank you for covering the first part, but you have not answered the questions yet. Have you tried that PR? — Reply to this email directly, view it on GitHub <#17485 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFACJ5RTV64K4UFHJIN6NP34BO42FAVCNFSM6AAAAACNDH5PB2VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTKNZUGI2DSOBZGM> . You are receiving this because you authored the thread.Message ID: ***@***.***>

vulkan : add dynamic VRAM heuristic for low-VRAM GPUs

5ecff8a

cafeTechne requested review from ggerganov and slaren as code owners November 25, 2025 05:17

github-actions bot added documentation Improvements or additions to documentation testing Everything test related Vulkan Issues specific to the Vulkan backend examples ggml changes relating to the ggml tensor library for machine learning labels Nov 25, 2025

loci-dev mentioned this pull request Nov 25, 2025

UPSTREAM PR #17485: vulkan : add dynamic VRAM heuristic for low-VRAM GPUs auroralabs-loci/llama.cpp#315

Open

jeffbolznv reviewed Nov 25, 2025

View reviewed changes

ggml/include/ggml-vulkan.h Outdated Show resolved Hide resolved

cafeTechne requested a review from 0cc4m as a code owner November 25, 2025 21:43

cafeTechne force-pushed the vulkan-dynamic-vram-heuristic branch from 5c8ad2b to 03fe95d Compare November 27, 2025 18:42

cafeTechne added 2 commits November 27, 2025 15:06

reset branch

ed4ed39

cafeTechne force-pushed the vulkan-dynamic-vram-heuristic branch from 2273c93 to e8bf9ed Compare November 28, 2025 01:36

0cc4m reviewed Nov 28, 2025

View reviewed changes

refactor(common): use generic backend API for VRAM heuristic

4475a37

Replaces Vulkan-specific calls with ggml_backend_dev_memory to be backend-agnostic. Reverts changes to ggml-vulkan.cpp/h and removes vk_device_info example to comply with reviewer feedback.

cafeTechne requested a review from 0cc4m November 29, 2025 17:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan : add dynamic VRAM heuristic for low-VRAM GPUs #17485

vulkan : add dynamic VRAM heuristic for low-VRAM GPUs #17485

cafeTechne commented Nov 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

0cc4m Nov 28, 2025

Uh oh!

0cc4m Dec 13, 2025

Uh oh!

cafeTechne commented Dec 13, 2025 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vulkan : add dynamic VRAM heuristic for low-VRAM GPUs #17485

Are you sure you want to change the base?

vulkan : add dynamic VRAM heuristic for low-VRAM GPUs #17485

Conversation

cafeTechne commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dynamic n_gpu_layers Heuristic for Low-VRAM GPUs

Summary

Motivation

Changes

Core Implementation

Documentation & Testing

Performance (llama-bench)

Performance Summary

Multi-Model Results

Testing

Compliance

Maintainer

Uh oh!

Uh oh!

0cc4m Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

0cc4m Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

cafeTechne commented Dec 13, 2025 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cafeTechne commented Nov 25, 2025 •

edited

Loading

Dynamic `n_gpu_layers` Heuristic for Low-VRAM GPUs