Misc. bug: llama-fit-params works unreliably for models that "fit" on GPUs (or so it thinks)

### Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 5 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 3: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
  Device 4: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
version: 7414 (9d52f17ae)
built with GNU 15.2.1 for Linux x86_64

### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

Other (Please specify in the next section)

### Command line

```shell
./build/bin/llama-fit-params -m /usr/local/ai/models/GLM4.6/UD-Q4_K_XL/GLM-4.6-UD-Q4_K_XL-00001-of-00005.gguf
```

### Problem description & steps to reproduce

When running the above I get a suggestion that nothing needs to be done abd the model fits, all I need to do is to reduce context to 180k or so. (note there's also a suggested command line at the end that comes up empty which sounds like a bug too?)

But if I use the model as suggested, it does not fit

### First Bad Commit

b1f3a6e5db7b782ef077bd0e8253ce03283b1f37

### Relevant log output

```shell
green@epyc1:~/git/llama.cpp$ ./build/bin/llama-fit-params -m /usr/local/ai/models/GLM4.6/UD-Q4_K_XL/GLM-4.6-UD-Q4_K_XL-00001-of-00005.gguf 
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 5 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 3: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
  Device 4: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
build: 7414 (9d52f17ae) with GNU 15.2.1 for Linux x86_64
llama_params_fit_impl: projected memory use with initial parameters [MiB]:
llama_params_fit_impl:   - CUDA0 (NVIDIA GeForce RTX 4090)                          :  24082 total,  21215 used,   2471 surplus
llama_params_fit_impl:   - CUDA1 (NVIDIA GeForce RTX 4090)                          :  24082 total,  23634 used,     53 surplus
llama_params_fit_impl:   - CUDA2 (NVIDIA GeForce RTX 4090)                          :  24082 total,  26845 used,   3158 deficit
llama_params_fit_impl:   - CUDA3 (NVIDIA RTX PRO 6000 Blackwell Workstation Edition):  97250 total, 100268 used,   3580 deficit
llama_params_fit_impl:   - CUDA4 (NVIDIA RTX PRO 6000 Blackwell Workstation Edition):  97250 total,  94730 used,   1958 surplus
llama_params_fit_impl: projected to use 266695 MiB of device memory vs. 266748 MiB of free device memory
llama_params_fit_impl: cannot fulfill margin of 1024 MiB on all devices, need to use 7375 MiB less in total
-c 182230 -ngl 999
llama_params_fit_impl: context size reduced from 202752 to 182230 -> need 7375 MiB less memory in total
llama_params_fit_impl: entire model can be fit across devices by reducing context
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 0.77 seconds
Printing fitted CLI arguments to stdout...
green@epyc1:~/git/llama.cpp$ ./build/bin/llama-cli -m /usr/local/ai/models/GLM4.6/UD-Q4_K_XL/GLM-4.6-UD-Q4_K_XL-00001-of-00005.gguf -c 131072 -ngl 999
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 5 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 3: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
  Device 4: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes

Loading model... |ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4608.00 MiB on device 2: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA2 buffer of size 4831838208
llama_init_from_model: failed to initialize the context: failed to allocate buffer for kv cache
common_init_result: failed to create context with model '/usr/local/ai/models/GLM4.6/UD-Q4_K_XL/GLM-4.6-UD-Q4_K_XL-00001-of-00005.gguf'
common_init_from_params: failed to create context with model '/usr/local/ai/models/GLM4.6/UD-Q4_K_XL/GLM-4.6-UD-Q4_K_XL-00001-of-00005.gguf'
Segmentation fault (core dumped)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: llama-fit-params works unreliably for models that "fit" on GPUs (or so it thinks) #18066

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misc. bug: llama-fit-params works unreliably for models that "fit" on GPUs (or so it thinks) #18066

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions