ggml-hexagon: mm for mtmd #17894

joeldushouyu · 2025-12-09T22:17:53Z

Summary

This pr allows running vision models(tested with gemma4b) on Hexagon NPU.

For now, it only supports using the CDSP for doing fp16xfp32.
Note: I am fully aware that the current FP16xFP32 implementation is not the most optimal. For example, we can easily reduce unnecessary data repetition by using the vtcm as cache, but I think that should probably go into a separate pr that focuses solely on optimization.

Test

I used the f16 vision weights and q40 language weights from unsloth.

1. build hexagon in docker

cmake --preset arm64-android-snapdragon-release -B build-snapdragon
cmake --build build-snapdragon
cmake --install build-snapdragon --prefix pkg-adb/llama.cpp

2. push the weights to phone(tested with samsung s25 ultra

adb push mmproj-F16.gguf data/local/tmp/gguf
adb push gemma-3-4b-it-Q4_0.gguf /data/local/tmp/gguf
adb push hydro_1.png /data/local/tmp/gguf   #Image for testing

3. run the run-mtmd script

E=1 NDEV=1 D=HTP0 MTMD_DEVICE=HTP0 PROF=1 V=1 M=gemma-3-4b-it-Q4_0.gguf MMPROJ=mmproj-F16.gguf IMG=hydro_1.png ./scripts/snapdragon/adb/run-mtmd.sh -p '"What is in this image."'

joeldushouyu · 2025-12-10T21:28:35Z

As I mentioned earlier, I think there’s still a lot of room to optimize the FP16×FP32 kernel by taking advantage of features like VT-CM and DMA. That said, I’m trying to figure out whether there’s any publicly available documentation on how to use the HMX instructions — the built-in matrix-multiplication hardware on the CDSP?

I noticed in the Hexagon SDK docs that the qhl_hmx library was removed starting from SDK 6.0. Is there a specific reason for its removal, and is there any plan to introduce a replacement or an updated HMX library? My impression is that VT-CM can help reduce data redundancy, but the HMX systolic core should still offer better compute throughput compared to implementing matrix multiplies using HVX vector dot-products.

joeldushouyu · 2025-12-10T21:37:52Z

Note: commit c73a2c0 is the patch fix to pass the test case in ggml by running.
Mainly because src0 data memory is non-contiguous on some of the test cases.

HB=0 ./scripts/snapdragon/adb/run-tool.sh test-backend-ops -b HTP0 -o MUL_MAT

mediouni-m · 2025-12-14T06:27:00Z

Is there a specific reason for its removal, and is there any plan to introduce a replacement or an updated HMX library?

A replacement is in preview: https://softwarecenter.qualcomm.com/catalog/item/Hexagon_KL

That said relatively narrow support right now (v73/75/79), no v81

max-krasnyansky

Fixing F16 MUL_MAT has been on my TODO list for quite some time.
Thanks for making it functional!
It's nice to be able to enable it by default now and use with LMMs, etc.

I did a little sweep of the key models and overall the perf has improved a bit (probably because we claim a few more ops now).
There is a regression in gemma3n but I bet that it just hits that scalar_sum case which will definitely be slow.
We should be able to fix that quickly. I'm going to merge this now and do follow up with a clean up pass to remove old comments and unused code. We should be able to fix/remove the volatiles and other issues I was running into originally.

Here is the sweep on Gen5

Llama-3.2-3B-Instruct-Q4_0.gguf kv=q8_0             |  Before                  |  After                   |
   prompt eval time =    2319.13 ms /   205 tokens  |  88.40 tokens per second |  90.68 tokens per second |
          eval time =    2430.84 ms /    63 runs    |  25.92 tokens per second |  26.26 tokens per second |
Llama-3.2-3B-Instruct-Q4_0.gguf kv=f16
   prompt eval time =    2371.51 ms /   205 tokens  |  86.44 tokens per second |  90.31 tokens per second |
          eval time =    2434.16 ms /    63 runs    |  25.88 tokens per second |  25.91 tokens per second |
Llama-3.2-1B-Instruct-Q4_0.gguf kv=q8_0
   prompt eval time =     915.89 ms /   205 tokens  | 223.83 tokens per second | 234.61 tokens per second |
          eval time =    1070.99 ms /    63 runs    |  58.82 tokens per second |  59.71 tokens per second |
Llama-3.2-1B-Instruct-Q4_0.gguf kv=f16
   prompt eval time =     892.62 ms /   205 tokens  | 229.66 tokens per second | 238.01 tokens per second |
          eval time =    1075.86 ms /    63 runs    |  58.56 tokens per second |  59.03 tokens per second |
Qwen3-0.6B-Q4_0.gguf kv=q8_0
   prompt eval time =     776.50 ms /   204 tokens  | 262.72 tokens per second | 270.02 tokens per second |
          eval time =    1049.67 ms /    63 runs    |  60.02 tokens per second |  60.62 tokens per second |
Qwen3-0.6B-Q4_0.gguf kv=f16
   prompt eval time =     729.31 ms /   204 tokens  | 279.72 tokens per second | 284.76 tokens per second |
          eval time =    1030.91 ms /    63 runs    |  61.11 tokens per second |  60.61 tokens per second |
Qwen3-4B-Q4_0.gguf kv=q8_0
   prompt eval time =    3721.08 ms /   204 tokens  |  54.82 tokens per second |  56.79 tokens per second |
          eval time =    3312.48 ms /    63 runs    |  19.02 tokens per second |  19.02 tokens per second |
Qwen3-4B-Q4_0.gguf kv=f16
   prompt eval time =    3736.49 ms /   204 tokens  |  54.60 tokens per second |  56.69 tokens per second |
          eval time =    3269.32 ms /    63 runs    |  19.27 tokens per second |  19.43 tokens per second |
gemma-3n-E2B-it-Q4_0.gguf kv=q8_0
   prompt eval time =    2729.46 ms /   202 tokens  |  74.01 tokens per second |  52.16 tokens per second |
          eval time =    3363.18 ms /    63 runs    |  18.73 tokens per second |  16.96 tokens per second |
gemma-3n-E2B-it-Q4_0.gguf kv=f16
   prompt eval time =    2694.32 ms /   202 tokens  |  74.97 tokens per second |  52.22 tokens per second |
          eval time =    3322.40 ms /    63 runs    |  18.96 tokens per second |  16.85 tokens per second |
LFM2-1.2B-Q4_0.gguf kv=q8_0
   prompt eval time =     994.29 ms /   212 tokens  | 213.22 tokens per second | 220.73 tokens per second |
          eval time =     989.16 ms /    63 runs    |  63.69 tokens per second |  66.07 tokens per second |
LFM2-1.2B-Q4_0.gguf kv=f16
   prompt eval time =     977.25 ms /   212 tokens  | 216.93 tokens per second | 221.07 tokens per second |
          eval time =     978.9  ms /    63 runs    |  64.35 tokens per second |  65.39 tokens per second |

max-krasnyansky · 2025-12-15T18:53:05Z

As I mentioned earlier, I think there’s still a lot of room to optimize the FP16×FP32 kernel by taking advantage of features like VT-CM and DMA. That said, I’m trying to figure out whether there’s any publicly available documentation on how to use the HMX instructions — the built-in matrix-multiplication hardware on the CDSP?

VTCM will probably not help unless we operate on fp16 weights but we can certainly add some l2fetch'es and in general optimize things. That was my original plan when I did the first cut of F16 version.

The HMX version will require VTCM so we could revisit that as part of enabling HMX.

joeldushouyu added 3 commits December 9, 2025 16:41

feat: add run_mtmd script for hexagon

ced64af

fix: fix issue in fp16xfp32 mm

cf74c63

fix: remove opt_experiment for fp16xfp32 mm

2c3f20d

joeldushouyu changed the title ~~Mtmd hexagon~~ ggml-hexagon: mm for mtmd Dec 9, 2025

joeldushouyu marked this pull request as ready for review December 9, 2025 22:27

joeldushouyu requested review from lhez and max-krasnyansky as code owners December 9, 2025 22:27

github-actions bot added script Script related ggml changes relating to the ggml tensor library for machine learning labels Dec 10, 2025

fix: ggml-hexagon: matmul fp16xfp32 support non-contigious src0

c73a2c0

fix: fix syntax check for run-mtmd.sh for cli

9a18694

max-krasnyansky approved these changes Dec 15, 2025

View reviewed changes

max-krasnyansky merged commit c45f89d into ggml-org:master Dec 15, 2025
68 of 69 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-hexagon: mm for mtmd #17894

ggml-hexagon: mm for mtmd #17894

joeldushouyu commented Dec 9, 2025

Uh oh!

joeldushouyu commented Dec 10, 2025

Uh oh!

joeldushouyu commented Dec 10, 2025 •

edited

Loading

Uh oh!

mediouni-m commented Dec 14, 2025

Uh oh!

max-krasnyansky left a comment

Uh oh!

max-krasnyansky commented Dec 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ggml-hexagon: mm for mtmd #17894

ggml-hexagon: mm for mtmd #17894

Conversation

joeldushouyu commented Dec 9, 2025

Summary

Test

1. build hexagon in docker

2. push the weights to phone(tested with samsung s25 ultra

3. run the run-mtmd script

Uh oh!

joeldushouyu commented Dec 10, 2025

Uh oh!

joeldushouyu commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mediouni-m commented Dec 14, 2025

Uh oh!

max-krasnyansky left a comment

Choose a reason for hiding this comment

Uh oh!

max-krasnyansky commented Dec 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joeldushouyu commented Dec 10, 2025 •

edited

Loading