-
Notifications
You must be signed in to change notification settings - Fork 31.4k
Description
System Info
Hello,
I am writing to report an issue I observed while evaluating the accuracy of a model quantized with FineGrainedFP8 through lm-eval. I observed significant accuracy discrepancies when deploying the quantized model with the HF backend versus the vLLM backend.
Models Used:
Interestingly, the difference becomes much more pronounced in tasks requiring multiple token generations (e.g., HumanEval, GSM8K) than in tasks that do not (e.g., MMLU). Given that the FineGrainedFP8-quantized model with vLLM backend produce results highly similar to those of the quantized model with FineGrainedFP8,
I suspect that there could be an issue with FP8Linear (e.g., fp8_matmul_trition_kernel).
Would appreciate if you could take a look at it.
Thanks,
Sung Hyuck Hong
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Example command:
lm_eval \
--model hf \
--model_args pretrained="Qwen/Qwen3-8B-FP8"\
--tasks humaneval \
--batch_size auto \
--confirm_run_unsafe_code
Expected behavior
Accuracy results similar to the attached picture should arise