-
Notifications
You must be signed in to change notification settings - Fork 14.1k
ggml-hexagon: mm for mtmd #17894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml-hexagon: mm for mtmd #17894
Conversation
|
As I mentioned earlier, I think there’s still a lot of room to optimize the FP16×FP32 kernel by taking advantage of features like VT-CM and DMA. That said, I’m trying to figure out whether there’s any publicly available documentation on how to use the HMX instructions — the built-in matrix-multiplication hardware on the CDSP? I noticed in the Hexagon SDK docs that the qhl_hmx library was removed starting from SDK 6.0. Is there a specific reason for its removal, and is there any plan to introduce a replacement or an updated HMX library? My impression is that VT-CM can help reduce data redundancy, but the HMX systolic core should still offer better compute throughput compared to implementing matrix multiplies using HVX vector dot-products. |
|
Note: commit c73a2c0 is the patch fix to pass the test case in ggml by running. HB=0 ./scripts/snapdragon/adb/run-tool.sh test-backend-ops -b HTP0 -o MUL_MAT |
A replacement is in preview: https://softwarecenter.qualcomm.com/catalog/item/Hexagon_KL That said relatively narrow support right now (v73/75/79), no v81 |
max-krasnyansky
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixing F16 MUL_MAT has been on my TODO list for quite some time.
Thanks for making it functional!
It's nice to be able to enable it by default now and use with LMMs, etc.
I did a little sweep of the key models and overall the perf has improved a bit (probably because we claim a few more ops now).
There is a regression in gemma3n but I bet that it just hits that scalar_sum case which will definitely be slow.
We should be able to fix that quickly. I'm going to merge this now and do follow up with a clean up pass to remove old comments and unused code. We should be able to fix/remove the volatiles and other issues I was running into originally.
Here is the sweep on Gen5
Llama-3.2-3B-Instruct-Q4_0.gguf kv=q8_0 | Before | After |
prompt eval time = 2319.13 ms / 205 tokens | 88.40 tokens per second | 90.68 tokens per second |
eval time = 2430.84 ms / 63 runs | 25.92 tokens per second | 26.26 tokens per second |
Llama-3.2-3B-Instruct-Q4_0.gguf kv=f16
prompt eval time = 2371.51 ms / 205 tokens | 86.44 tokens per second | 90.31 tokens per second |
eval time = 2434.16 ms / 63 runs | 25.88 tokens per second | 25.91 tokens per second |
Llama-3.2-1B-Instruct-Q4_0.gguf kv=q8_0
prompt eval time = 915.89 ms / 205 tokens | 223.83 tokens per second | 234.61 tokens per second |
eval time = 1070.99 ms / 63 runs | 58.82 tokens per second | 59.71 tokens per second |
Llama-3.2-1B-Instruct-Q4_0.gguf kv=f16
prompt eval time = 892.62 ms / 205 tokens | 229.66 tokens per second | 238.01 tokens per second |
eval time = 1075.86 ms / 63 runs | 58.56 tokens per second | 59.03 tokens per second |
Qwen3-0.6B-Q4_0.gguf kv=q8_0
prompt eval time = 776.50 ms / 204 tokens | 262.72 tokens per second | 270.02 tokens per second |
eval time = 1049.67 ms / 63 runs | 60.02 tokens per second | 60.62 tokens per second |
Qwen3-0.6B-Q4_0.gguf kv=f16
prompt eval time = 729.31 ms / 204 tokens | 279.72 tokens per second | 284.76 tokens per second |
eval time = 1030.91 ms / 63 runs | 61.11 tokens per second | 60.61 tokens per second |
Qwen3-4B-Q4_0.gguf kv=q8_0
prompt eval time = 3721.08 ms / 204 tokens | 54.82 tokens per second | 56.79 tokens per second |
eval time = 3312.48 ms / 63 runs | 19.02 tokens per second | 19.02 tokens per second |
Qwen3-4B-Q4_0.gguf kv=f16
prompt eval time = 3736.49 ms / 204 tokens | 54.60 tokens per second | 56.69 tokens per second |
eval time = 3269.32 ms / 63 runs | 19.27 tokens per second | 19.43 tokens per second |
gemma-3n-E2B-it-Q4_0.gguf kv=q8_0
prompt eval time = 2729.46 ms / 202 tokens | 74.01 tokens per second | 52.16 tokens per second |
eval time = 3363.18 ms / 63 runs | 18.73 tokens per second | 16.96 tokens per second |
gemma-3n-E2B-it-Q4_0.gguf kv=f16
prompt eval time = 2694.32 ms / 202 tokens | 74.97 tokens per second | 52.22 tokens per second |
eval time = 3322.40 ms / 63 runs | 18.96 tokens per second | 16.85 tokens per second |
LFM2-1.2B-Q4_0.gguf kv=q8_0
prompt eval time = 994.29 ms / 212 tokens | 213.22 tokens per second | 220.73 tokens per second |
eval time = 989.16 ms / 63 runs | 63.69 tokens per second | 66.07 tokens per second |
LFM2-1.2B-Q4_0.gguf kv=f16
prompt eval time = 977.25 ms / 212 tokens | 216.93 tokens per second | 221.07 tokens per second |
eval time = 978.9 ms / 63 runs | 64.35 tokens per second | 65.39 tokens per second |
VTCM will probably not help unless we operate on fp16 weights but we can certainly add some l2fetch'es and in general optimize things. That was my original plan when I did the first cut of F16 version. The HMX version will require VTCM so we could revisit that as part of enabling HMX. |
Summary
This pr allows running vision models(tested with gemma4b) on Hexagon NPU.
For now, it only supports using the CDSP for doing fp16xfp32.
Note: I am fully aware that the current FP16xFP32 implementation is not the most optimal. For example, we can easily reduce unnecessary data repetition by using the vtcm as cache, but I think that should probably go into a separate pr that focuses solely on optimization.
Test
I used the f16 vision weights and q40 language weights from unsloth.
1. build hexagon in docker
2. push the weights to phone(tested with samsung s25 ultra
adb push mmproj-F16.gguf data/local/tmp/gguf adb push gemma-3-4b-it-Q4_0.gguf /data/local/tmp/gguf adb push hydro_1.png /data/local/tmp/gguf #Image for testing3. run the run-mtmd script
E=1 NDEV=1 D=HTP0 MTMD_DEVICE=HTP0 PROF=1 V=1 M=gemma-3-4b-it-Q4_0.gguf MMPROJ=mmproj-F16.gguf IMG=hydro_1.png ./scripts/snapdragon/adb/run-mtmd.sh -p '"What is in this image."'