Skip to content

Conversation

@joeldushouyu
Copy link
Contributor

Summary

This pr allows running vision models(tested with gemma4b) on Hexagon NPU.

For now, it only supports using the CDSP for doing fp16xfp32.
Note: I am fully aware that the current FP16xFP32 implementation is not the most optimal. For example, we can easily reduce unnecessary data repetition by using the vtcm as cache, but I think that should probably go into a separate pr that focuses solely on optimization.

Test

I used the f16 vision weights and q40 language weights from unsloth.

1. build hexagon in docker

cmake --preset arm64-android-snapdragon-release -B build-snapdragon
cmake --build build-snapdragon
cmake --install build-snapdragon --prefix pkg-adb/llama.cpp

2. push the weights to phone(tested with samsung s25 ultra

adb push mmproj-F16.gguf data/local/tmp/gguf
adb push gemma-3-4b-it-Q4_0.gguf /data/local/tmp/gguf
adb push hydro_1.png /data/local/tmp/gguf   #Image for testing 

3. run the run-mtmd script

E=1 NDEV=1 D=HTP0 MTMD_DEVICE=HTP0 PROF=1 V=1 M=gemma-3-4b-it-Q4_0.gguf MMPROJ=mmproj-F16.gguf IMG=hydro_1.png ./scripts/snapdragon/adb/run-mtmd.sh -p '"What is in this image."'

@joeldushouyu joeldushouyu changed the title Mtmd hexagon ggml-hexagon: mm for mtmd Dec 9, 2025
@joeldushouyu joeldushouyu marked this pull request as ready for review December 9, 2025 22:27
@github-actions github-actions bot added script Script related ggml changes relating to the ggml tensor library for machine learning labels Dec 10, 2025
@joeldushouyu
Copy link
Contributor Author

As I mentioned earlier, I think there’s still a lot of room to optimize the FP16×FP32 kernel by taking advantage of features like VT-CM and DMA. That said, I’m trying to figure out whether there’s any publicly available documentation on how to use the HMX instructions — the built-in matrix-multiplication hardware on the CDSP?

I noticed in the Hexagon SDK docs that the qhl_hmx library was removed starting from SDK 6.0. Is there a specific reason for its removal, and is there any plan to introduce a replacement or an updated HMX library? My impression is that VT-CM can help reduce data redundancy, but the HMX systolic core should still offer better compute throughput compared to implementing matrix multiplies using HVX vector dot-products.

@joeldushouyu
Copy link
Contributor Author

joeldushouyu commented Dec 10, 2025

Note: commit c73a2c0 is the patch fix to pass the test case in ggml by running.
Mainly because src0 data memory is non-contiguous on some of the test cases.

HB=0 ./scripts/snapdragon/adb/run-tool.sh test-backend-ops -b HTP0 -o MUL_MAT

@mediouni-m
Copy link
Contributor

Is there a specific reason for its removal, and is there any plan to introduce a replacement or an updated HMX library?

A replacement is in preview: https://softwarecenter.qualcomm.com/catalog/item/Hexagon_KL

That said relatively narrow support right now (v73/75/79), no v81

Copy link
Collaborator

@max-krasnyansky max-krasnyansky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing F16 MUL_MAT has been on my TODO list for quite some time.
Thanks for making it functional!
It's nice to be able to enable it by default now and use with LMMs, etc.

I did a little sweep of the key models and overall the perf has improved a bit (probably because we claim a few more ops now).
There is a regression in gemma3n but I bet that it just hits that scalar_sum case which will definitely be slow.
We should be able to fix that quickly. I'm going to merge this now and do follow up with a clean up pass to remove old comments and unused code. We should be able to fix/remove the volatiles and other issues I was running into originally.

Here is the sweep on Gen5

Llama-3.2-3B-Instruct-Q4_0.gguf kv=q8_0             |  Before                  |  After                   |
   prompt eval time =    2319.13 ms /   205 tokens  |  88.40 tokens per second |  90.68 tokens per second |
          eval time =    2430.84 ms /    63 runs    |  25.92 tokens per second |  26.26 tokens per second |
Llama-3.2-3B-Instruct-Q4_0.gguf kv=f16
   prompt eval time =    2371.51 ms /   205 tokens  |  86.44 tokens per second |  90.31 tokens per second |
          eval time =    2434.16 ms /    63 runs    |  25.88 tokens per second |  25.91 tokens per second |
Llama-3.2-1B-Instruct-Q4_0.gguf kv=q8_0
   prompt eval time =     915.89 ms /   205 tokens  | 223.83 tokens per second | 234.61 tokens per second |
          eval time =    1070.99 ms /    63 runs    |  58.82 tokens per second |  59.71 tokens per second |
Llama-3.2-1B-Instruct-Q4_0.gguf kv=f16
   prompt eval time =     892.62 ms /   205 tokens  | 229.66 tokens per second | 238.01 tokens per second |
          eval time =    1075.86 ms /    63 runs    |  58.56 tokens per second |  59.03 tokens per second |
Qwen3-0.6B-Q4_0.gguf kv=q8_0
   prompt eval time =     776.50 ms /   204 tokens  | 262.72 tokens per second | 270.02 tokens per second |
          eval time =    1049.67 ms /    63 runs    |  60.02 tokens per second |  60.62 tokens per second |
Qwen3-0.6B-Q4_0.gguf kv=f16
   prompt eval time =     729.31 ms /   204 tokens  | 279.72 tokens per second | 284.76 tokens per second |
          eval time =    1030.91 ms /    63 runs    |  61.11 tokens per second |  60.61 tokens per second |
Qwen3-4B-Q4_0.gguf kv=q8_0
   prompt eval time =    3721.08 ms /   204 tokens  |  54.82 tokens per second |  56.79 tokens per second |
          eval time =    3312.48 ms /    63 runs    |  19.02 tokens per second |  19.02 tokens per second |
Qwen3-4B-Q4_0.gguf kv=f16
   prompt eval time =    3736.49 ms /   204 tokens  |  54.60 tokens per second |  56.69 tokens per second |
          eval time =    3269.32 ms /    63 runs    |  19.27 tokens per second |  19.43 tokens per second |
gemma-3n-E2B-it-Q4_0.gguf kv=q8_0
   prompt eval time =    2729.46 ms /   202 tokens  |  74.01 tokens per second |  52.16 tokens per second |
          eval time =    3363.18 ms /    63 runs    |  18.73 tokens per second |  16.96 tokens per second |
gemma-3n-E2B-it-Q4_0.gguf kv=f16
   prompt eval time =    2694.32 ms /   202 tokens  |  74.97 tokens per second |  52.22 tokens per second |
          eval time =    3322.40 ms /    63 runs    |  18.96 tokens per second |  16.85 tokens per second |
LFM2-1.2B-Q4_0.gguf kv=q8_0
   prompt eval time =     994.29 ms /   212 tokens  | 213.22 tokens per second | 220.73 tokens per second |
          eval time =     989.16 ms /    63 runs    |  63.69 tokens per second |  66.07 tokens per second |
LFM2-1.2B-Q4_0.gguf kv=f16
   prompt eval time =     977.25 ms /   212 tokens  | 216.93 tokens per second | 221.07 tokens per second |
          eval time =     978.9  ms /    63 runs    |  64.35 tokens per second |  65.39 tokens per second |

@max-krasnyansky
Copy link
Collaborator

As I mentioned earlier, I think there’s still a lot of room to optimize the FP16×FP32 kernel by taking advantage of features like VT-CM and DMA. That said, I’m trying to figure out whether there’s any publicly available documentation on how to use the HMX instructions — the built-in matrix-multiplication hardware on the CDSP?

VTCM will probably not help unless we operate on fp16 weights but we can certainly add some l2fetch'es and in general optimize things. That was my original plan when I did the first cut of F16 version.

The HMX version will require VTCM so we could revisit that as part of enabling HMX.

@max-krasnyansky max-krasnyansky merged commit c45f89d into ggml-org:master Dec 15, 2025
68 of 69 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning script Script related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants