Ministral-3-8B-Instruct tokenizer doesn't handle BPE markers properly

### System Info

**Code:**
from transformers import AutoTokenizer, AutoModelForImageTextToText, AutoProcessor
import torch

base_model = "mistralai/Ministral-3-8B-Instruct-2512-BF16"

model = AutoModelForImageTextToText.from_pretrained(base_model, dtype=torch.bfloat16)
model = model.to("cuda:1")
tokenizer = AutoProcessor.from_pretrained(base_model)

user_prompt = "hello how are you?"
messages = [
    {"role": "user", "content": user_prompt},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text=text, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
generate_ids = model.generate(**inputs, max_new_tokens=50, do_sample=False)
decoded_output = tokenizer.batch_decode(generate_ids[:, inputs["input_ids"].shape[1] :], skip_special_tokens=True)[0]
print(decoded_output)

**Output:**
Hello!ĠðŁĺĬĠI'mĠjustĠaĠvirtualĠassistant,ĠsoĠIĠdon'tĠhaveĠfeelings,ĠbutĠI'mĠhereĠandĠreadyĠtoĠhelpĠyouĠwithĠanythingĠyouĠneed!ĠHowĠaboutĠyouâĢĶhowĠareĠ*you*ĠdoingĠtoday?ĠAnythingĠfunĠorĠinterestingĠon

**Environments:**
Python 3.12.7
transformers 5.0.0.dev0 (installed from main branch)
torch: 2.9.0
mistral_common: 1.8.6

The same code with MinistralCommonBackend loaded tokenizer works:
**Code:**
import torch
from transformers import AutoModelForImageTextToText, MistralCommonBackend


tokenizer = MistralCommonBackend.from_pretrained(base_model)
model = AutoModelForImageTextToText.from_pretrained(
    base_model, torch_dtype=torch.bfloat16
)
model = model.to("cuda:2")

user_prompt = "hello how are you?"
messages = [
    {"role": "user", "content": user_prompt},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text=text, return_tensors="pt").to(model.device)
generate_ids = model.generate(**inputs, max_new_tokens=50, do_sample=False)
decoded_output = tokenizer.batch_decode(generate_ids[:, inputs["input_ids"].shape[1] :], skip_special_tokens=True)[0]
print(decoded_output)

**Output:**
 Hello! I'm just a program, so I don't have feelings, but I'm here and ready to help you with anything you need. How about you? How are you doing today?[😊]

### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

from transformers import AutoTokenizer, AutoModelForImageTextToText, AutoProcessor
import torch

base_model = "mistralai/Ministral-3-8B-Instruct-2512-BF16"

model = AutoModelForImageTextToText.from_pretrained(base_model, dtype=torch.bfloat16)
model = model.to("cuda:1")
tokenizer = AutoProcessor.from_pretrained(base_model)

user_prompt = "hello how are you?"
messages = [
    {"role": "user", "content": user_prompt},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text=text, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
generate_ids = model.generate(**inputs, max_new_tokens=50, do_sample=False)
decoded_output = tokenizer.batch_decode(generate_ids[:, inputs["input_ids"].shape[1] :], skip_special_tokens=True)[0]
print(decoded_output)

### Expected behavior

Clean output with BPE markers handled properly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ministral-3-8B-Instruct tokenizer doesn't handle BPE markers properly #42796

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Ministral-3-8B-Instruct tokenizer doesn't handle BPE markers properly #42796

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions