-
Notifications
You must be signed in to change notification settings - Fork 31.4k
Fixes Flash Attention implementation for models #42149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
vasqu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to fix idefics2 properly or leave it out for now, the others lgtm. Will check with run-slow a bit later
| # The call to `_upad_input` in `_flash_attention_forward` is expensive | ||
| # So when the `patch_attention_mask` is full of 1s (i.e. attending to the whole sequence), | ||
| # avoiding passing the attention_mask, which is equivalent to attending to the full sequence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove these comments too
| from ...cache_utils import Cache, DynamicCache | ||
| from ...generation import GenerationMixin | ||
| from ...masking_utils import create_bidirectional_mask | ||
| from ...modeling_attn_mask_utils import _prepare_4d_attention_mask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like _prepare_4d_attention_mask is still used and it will likely cause similar issues at other points, best to completely remove this usage elsewhere too!
See
transformers/src/transformers/models/idefics2/modeling_idefics2.py
Lines 746 to 750 in 3ff0e69
| attention_mask = ( | |
| _prepare_4d_attention_mask(attention_mask, latents.dtype, tgt_len=self.n_latents) | |
| if self.config._attn_implementation != "flash_attention_2" | |
| else attention_mask | |
| ) |
(looks like the flag for that model there also needs to be updated to
_supports_flash_attn)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so do I just reset all the changes from idefics2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea either that or use the create_bidirectional_mask fn here as well if it works; if not, I also appreciate that. Means I need to take a look here :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alright I can't run the testing script as I don't have enough vram but is this the correct approach?
|
tests seem to be failing so I don't think its the correct one |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: idefics3, smolvlm |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I think there were more models which still check for flash_attention_2, would be nice to batch update all. It can totally go in a separate PR later :)
|
run-slow: idefics3, smolvlm |
|
This comment contains models: ["models/idefics3", "models/smolvlm"] |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
CI Results✅ No failing test specific to this PR 🎉 ! |
vasqu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, let's keep it simple for now. We can address more models in separate PRs
Hmm it might need a dummy embedding of the correct size along the latents. Would leave this for a different PR |
|
Also thx for all the PR 🤗 |
Yeah I didn't want to make a whole lot of changes in a single PR . I find it very confusing , sorry if that wasn't the ideal choice |
What does this PR do?
Fixes the issue where models use an outdated
if self.config._attn_implementation != "flash_attention_2":check.Models changed - SmolVLM, idefics3 , idefics2
Fixes #42121
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@zucchini-nlp @vasqu