-
Notifications
You must be signed in to change notification settings - Fork 14.1k
Description
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 5 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 3: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
Device 4: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
version: 7414 (9d52f17)
built with GNU 15.2.1 for Linux x86_64
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
Other (Please specify in the next section)
Command line
./build/bin/llama-fit-params -m /usr/local/ai/models/GLM4.6/UD-Q4_K_XL/GLM-4.6-UD-Q4_K_XL-00001-of-00005.ggufProblem description & steps to reproduce
When running the above I get a suggestion that nothing needs to be done abd the model fits, all I need to do is to reduce context to 180k or so. (note there's also a suggested command line at the end that comes up empty which sounds like a bug too?)
But if I use the model as suggested, it does not fit
First Bad Commit
Relevant log output
green@epyc1:~/git/llama.cpp$ ./build/bin/llama-fit-params -m /usr/local/ai/models/GLM4.6/UD-Q4_K_XL/GLM-4.6-UD-Q4_K_XL-00001-of-00005.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 5 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 3: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
Device 4: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
build: 7414 (9d52f17ae) with GNU 15.2.1 for Linux x86_64
llama_params_fit_impl: projected memory use with initial parameters [MiB]:
llama_params_fit_impl: - CUDA0 (NVIDIA GeForce RTX 4090) : 24082 total, 21215 used, 2471 surplus
llama_params_fit_impl: - CUDA1 (NVIDIA GeForce RTX 4090) : 24082 total, 23634 used, 53 surplus
llama_params_fit_impl: - CUDA2 (NVIDIA GeForce RTX 4090) : 24082 total, 26845 used, 3158 deficit
llama_params_fit_impl: - CUDA3 (NVIDIA RTX PRO 6000 Blackwell Workstation Edition): 97250 total, 100268 used, 3580 deficit
llama_params_fit_impl: - CUDA4 (NVIDIA RTX PRO 6000 Blackwell Workstation Edition): 97250 total, 94730 used, 1958 surplus
llama_params_fit_impl: projected to use 266695 MiB of device memory vs. 266748 MiB of free device memory
llama_params_fit_impl: cannot fulfill margin of 1024 MiB on all devices, need to use 7375 MiB less in total
-c 182230 -ngl 999
llama_params_fit_impl: context size reduced from 202752 to 182230 -> need 7375 MiB less memory in total
llama_params_fit_impl: entire model can be fit across devices by reducing context
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 0.77 seconds
Printing fitted CLI arguments to stdout...
green@epyc1:~/git/llama.cpp$ ./build/bin/llama-cli -m /usr/local/ai/models/GLM4.6/UD-Q4_K_XL/GLM-4.6-UD-Q4_K_XL-00001-of-00005.gguf -c 131072 -ngl 999
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 5 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 3: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
Device 4: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
Loading model... |ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4608.00 MiB on device 2: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA2 buffer of size 4831838208
llama_init_from_model: failed to initialize the context: failed to allocate buffer for kv cache
common_init_result: failed to create context with model '/usr/local/ai/models/GLM4.6/UD-Q4_K_XL/GLM-4.6-UD-Q4_K_XL-00001-of-00005.gguf'
common_init_from_params: failed to create context with model '/usr/local/ai/models/GLM4.6/UD-Q4_K_XL/GLM-4.6-UD-Q4_K_XL-00001-of-00005.gguf'
Segmentation fault (core dumped)