fix: update vocab size of language model's config on resize - mllama #36591

Ssukriti · 2025-03-06T17:19:13Z

What does this PR do?

The PR updates the config of the mllama language model to update its vocab size. This is useful when vocab size of MllamaForConditionalGeneration class is updated when resizing embeddings using model.resize_token_embeddings(embedding_size) . More details included in the issue below.

The PR includes one potential fix. There are several ways to do this. Suggestions welcome!

Fixes # (issue)
#36590

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Models:

vision models: @amyeroberts, @qubvel

src/transformers/models/mllama/modeling_mllama.py

Ssukriti · 2025-03-16T04:41:49Z

@zucchini-nlp the Pr is ready for review changes have been made as discussed earlier (in resolved conversation above).

all resize embeddings tests are passing, which indicates no change in current behavior
added new unit tests for the function set_vocab_size
mllama bug Error when changing vocab size when fine tuning llama-vision #36590 is fixed. The llava class does not use vocab size in loss, hence the bug did not impact it. On quick look, I didnt find other applicable multimodel classes that could be impacted either. Hence, fix is only for mllama in this PR

zucchini-nlp

Hey, sorry it took long to come back!

After giving it more thought, I think the easiest solution is to stop passing labels to the language model and calculate loss from within ConditionalGeneration class. Something like

transformers/src/transformers/models/gemma3/modeling_gemma3.py

Lines 1366 to 1383 in a861db0

    
           logits = outputs.logits 
        
           loss = None 
        
           if labels is not None: 
        
               # Upcast to float if we need to compute the loss to avoid potential precision issues 
        
               logits = logits.float() 
        
               shift_logits = logits[..., :-1, :] 
        
               shift_labels = labels[..., 1:] 
        
               if attention_mask is not None: 
        
                   # we use the input attention mask to shift the logits and labels, because it is 2D. 
        
                   # we also crop attn mask in case it is longer, which happens in PrefixTuning with peft 
        
                   shift_attention_mask = attention_mask[:, -shift_logits.shape[1] :].to(logits.device) 
        
                   shift_logits = shift_logits[shift_attention_mask.to(logits.device) != 0].contiguous() 
        
                   shift_labels = shift_labels[shift_attention_mask.to(shift_labels.device) != 0].contiguous() 
        
               else: 
        
                   shift_logits = shift_logits.contiguous() 
        
                   shift_labels = shift_labels.contiguous() 
        
               # Flatten the tokens 
        
               loss_fct = nn.CrossEntropyLoss()

, but with self.loss_fn

I realize this is not a perfect solution and doesn't scale, but we are planning big refactor on multimodal models, and the vocab size issue for labels will be resolved by then. Therefore I don't want us to add more abstraction which might be deleted later on

zucchini-nlp · 2025-03-19T10:58:11Z

tests/models/mllama/test_modeling_mllama.py

+
+            new_vocab_size = config.get_text_config().vocab_size + 10
+            model.set_vocab_size(new_vocab_size)
+            self.assertEqual(model.language_model.get_vocab_size(), new_vocab_size)
+            # model's get vocab returns language model's vocab size
+            self.assertEqual(model.language_model.get_vocab_size(), model.get_vocab_size())


to be aligned with the linked issue, let's also test that model can take labels and return loss after resizing

Good call. I added the test in follow up PR #36840

I also added you as a collaborator to my fork if needed. Invite is pending

Ssukriti · 2025-03-19T23:16:36Z

@zucchini-nlp sure we can go with that approach instead . I created a new PR #36840 as had to undo the changes here.

If the solution looks ok, can continue with unit tests , thank you!

Ssukriti mentioned this pull request Mar 6, 2025

Error when changing vocab size when fine tuning llama-vision #36590

Closed

4 tasks

zucchini-nlp reviewed Mar 6, 2025

View reviewed changes

src/transformers/models/mllama/modeling_mllama.py Outdated Show resolved Hide resolved

Ssukriti marked this pull request as ready for review March 14, 2025 20:36

Ssukriti changed the title ~~update config for vocab size of langauge model mllama~~ fix: update vocab size of language model's config on resize - mllama Mar 17, 2025

zucchini-nlp reviewed Mar 19, 2025

View reviewed changes

Abhishek-TAMU mentioned this pull request Mar 19, 2025

feat:Support Entire Vision dataset with Streaming anhuong/fms-hf-tuning#6

Merged

2 tasks

Ssukriti closed this Mar 19, 2025

Ssukriti force-pushed the fix_vocab_size_setting_in_mllama_class branch from c4c64cc to 9be4728 Compare March 19, 2025 23:00

Ssukriti mentioned this pull request Mar 19, 2025

fix: loss computation after embeddings resize - mllama #36840

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: update vocab size of language model's config on resize - mllama #36591

fix: update vocab size of language model's config on resize - mllama #36591

Uh oh!

Ssukriti commented Mar 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Ssukriti commented Mar 16, 2025

Uh oh!

zucchini-nlp left a comment

Uh oh!

zucchini-nlp Mar 19, 2025

Uh oh!

Ssukriti Mar 20, 2025

Uh oh!

Ssukriti Mar 20, 2025

Uh oh!

Ssukriti commented Mar 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	logits = outputs.logits
	loss = None
	if labels is not None:
	# Upcast to float if we need to compute the loss to avoid potential precision issues
	logits = logits.float()
	shift_logits = logits[..., :-1, :]
	shift_labels = labels[..., 1:]
	if attention_mask is not None:
	# we use the input attention mask to shift the logits and labels, because it is 2D.
	# we also crop attn mask in case it is longer, which happens in PrefixTuning with peft
	shift_attention_mask = attention_mask[:, -shift_logits.shape[1] :].to(logits.device)
	shift_logits = shift_logits[shift_attention_mask.to(logits.device) != 0].contiguous()
	shift_labels = shift_labels[shift_attention_mask.to(shift_labels.device) != 0].contiguous()
	else:
	shift_logits = shift_logits.contiguous()
	shift_labels = shift_labels.contiguous()
	# Flatten the tokens
	loss_fct = nn.CrossEntropyLoss()

fix: update vocab size of language model's config on resize - mllama #36591

fix: update vocab size of language model's config on resize - mllama #36591

Uh oh!

Conversation

Ssukriti commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

Uh oh!

Ssukriti commented Mar 16, 2025

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

Ssukriti Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

Ssukriti Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

Ssukriti commented Mar 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Ssukriti commented Mar 6, 2025 •

edited

Loading