Accuracy issue associated with FineGrainedFP8

### System Info

Hello,

I am writing to report an issue I observed while evaluating the accuracy of a model quantized with FineGrainedFP8 through lm-eval. I observed significant accuracy discrepancies when deploying the quantized model with the HF backend versus the vLLM backend.

<img width="1408" height="1250" alt="Image" src="https://hub.xli0.com/user-attachments/assets/a43037e5-bfb3-4ba4-b225-47fe4e210afe" />


Models Used: 
- [FineGrainedFP8HuggingFace](https://huggingface.co/Qwen/Qwen3-8B-FP8)
- [FineGrainedFP8CompressedTensor](https://huggingface.co/RedHatAI/Qwen3-8B-FP8-block)


Interestingly, the difference becomes much more pronounced in tasks requiring multiple token generations (e.g., HumanEval, GSM8K) than in tasks that do not (e.g., MMLU). Given that the FineGrainedFP8-quantized model with vLLM backend produce results highly similar to those of the quantized model with FineGrainedFP8, 
I suspect that there could be an issue with FP8Linear (e.g., [fp8_matmul_trition_kernel](https://github.com/huggingface/transformers/blob/a8f32a0e9c2fec0f196fab5d8316a03f35d0c528/src/transformers/integrations/finegrained_fp8.py#L474-L481)). 

Would appreciate if you could take a look at it. 

Thanks,
Sung Hyuck Hong 

### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction


Example command:

```
 lm_eval \
  --model hf \
  --model_args pretrained="Qwen/Qwen3-8B-FP8"\
  --tasks humaneval \
  --batch_size auto \
  --confirm_run_unsafe_code
```



### Expected behavior

Accuracy results similar to the attached picture should arise


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Accuracy issue associated with FineGrainedFP8 #42831

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Accuracy issue associated with FineGrainedFP8 #42831

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions