Add prefix sharing to continuous batching #42094

remi-or · 2025-11-07T15:57:03Z

This PR adds a prefix sharing mechanism to the continuous batching API like the one present in VLLM.
It only activates if the model is a full-attention model, as is the case in VLLM.

The mechanism has two main components:

block hashing: each block in the cache, once it is filled up, is given a hash that depends on all the tokens in the sequence up to and including the ones in the block
prefix detection: when starting prefill for a request, we first look for a prefix with KV cache already computed, and if such a prefix is found, we skip the KV computation for it, using references to completed blocks to save compute
block de-reference: when a block is given a hash, we check that no other block shares the same hash. This ensures that each block sharing the same information is unique, and helps keep the cache size in control

What is missing from this PR:

more documentations
checking the code again
edge case: if the prefix is the entire initial request, we still need to do a forward with the last token of the request
checks TODOs to adress: the ones left are for a PR down the line

The PR is draft until these are resolved but any early comment is welcome

HuggingFaceDocBuilderDev · 2025-11-07T16:06:40Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

McPatate · 2025-11-10T16:54:15Z

src/transformers/generation/continuous_batching/cache.py

                raise ValueError(f"Invalid group type: {group_type}")
            self.group_cache_managers.append(cm)

+        # We only use prefix sharing if the whole model has only full attention layers


Is that a "for the moment" thing?

No, it is not compatible w/ sliding window (VLLM agrees)

I understand but why not have only on full attention layers?

For now, there is only sliding window or full attention. The only other type of attention I know is in transformers is block attention

sorry my question wasn't clear, in models that have a mix of sliding and full attn, why not enable prefix caching?

Because the layers with a sliding window overwrite their KV cache when they reach the end of their sliding window, so we have to disable prefix caching for those. And if we disable prefix caching for one layer of the model, we have to disable it for all layers of the model: we need a full foward pass to build the KV cache for all layers, we cannot only do a forward pass for the layers that have no prefix caching.

src/transformers/generation/continuous_batching/requests.py

src/transformers/generation/continuous_batching/cache_manager.py

src/transformers/generation/continuous_batching/continuous_api.py

src/transformers/generation/continuous_batching/cache.py

src/transformers/generation/continuous_batching/cache_manager.py

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>

remi-or · 2025-11-12T17:38:59Z

Maybe I will proof read again, but I am removing draft for now. Will address review soon.

remi-or · 2025-11-13T08:41:36Z

Removed draft, ready to merge IMO

src/transformers/generation/continuous_batching/cache.py

src/transformers/generation/continuous_batching/continuous_api.py

src/transformers/generation/continuous_batching/scheduler.py

tests/generation/test_continuous_batching.py

remi-or · 2025-11-13T13:44:24Z

Benchmarks, on H100, with the last version of the example script that adds the --add-prefix option and sampling off by default
Command: python examples/pytorch/continuous_batching.py --attn $attn -mp none --add-prefix --samples 500

Attention	Version	Generated tokens	Duration (s)	Throughput (tok/s)	Total prefill length matched
SDPA	With prefix sharing	112033	90.47	1238.37	16096
SDPA	No prefix sharing	112223	89.95	1247.55	-
SDPA	Main branch	112223	90.31	1242.60	-
Flash attention	With prefix sharing	111494	24.76	4503.30	16096
Flash attention	No prefix sharing	112599	26.87	4190.22	-
Flash attention	Main branch	112599	27.10	4155.44	-

The throughput gain increases when the prefix length increases. For reference, the prefixes in the table above are of length 0, 40, 60 or 80 tokens (roughly). If instead we add a large system prompt, so all requests share a 2500 tokens prefix, the gap is more noticeable. Here are the number with flash attention:

No prefix sharing:    195.49 seconds for 122435 tokens. 626.29tok/s
W/ prefix sharing:    129.37 seconds for 121763 tokens. 941.17tok/s

src/transformers/generation/continuous_batching/continuous_api.py

ArthurZucker

as discussed offline, we might want to simplify the stream, when we request the cache for n blocks, we know the n-1 are FULL / Completed and thus can keep the logic to compute hash there. This means it can be "scheduled" as you don't need the result until the next forward (thus can be done in the BG while the model runs a forward)

src/transformers/generation/continuous_batching/cache_manager.py

ArthurZucker · 2025-11-17T10:05:27Z

src/transformers/generation/continuous_batching/cache_manager.py

+        """Returns the number of free blocks left. Both initialized and uninitialized blocks are considered free."""
+        return len(self._uninit_block_ids) + len(self._init_block_ids)
+
+    def is_enough_free_blocks(self, n_blocks: int) -> bool:


Suggested change

def is_enough_free_blocks(self, n_blocks: int) -> bool:

def has_enough_free_blocks(self, n_blocks: int) -> bool:

will do with f2, it's used in other place. Good catch, thanks!

ArthurZucker · 2025-11-17T10:20:18Z

src/transformers/generation/continuous_batching/cache_manager.py

+            # Update loop variables
+            parent_hash = block.hash
+
+    def compute_hash(self, parent_hash: int | None, tokens: list[int]) -> int:


Suggested change

def compute_hash(self, parent_hash: int | None, tokens: list[int]) -> int:

def __hash__(self, parent_hash: int | None, tokens: list[int]) -> int:

only takes self as an argument :(
https://docs.python.org/3.5/reference/datamodel.html#object.__hash__

remi-or requested a review from McPatate November 7, 2025 15:57

remi-or added 10 commits November 12, 2025 12:28

Fix a bug in the CB memory calcuation

399d943

Nit in example

3a5e3d7

Replace _free_blocks with a proper object BlockManager

d44d737

Removed dead code

1a8f018

Added hasing mechanism (wip)

371bcb3

Added de-duplication

6dc28d9

Add de-initialization mechnaism

56d0030

Add prefix detection

fc18b3b

Ensure we always keep 1 token for decode start

69ef1e5

Removed some todos and small fix

3e1b4f3

remi-or force-pushed the cb-prefix-sharing branch from 43bb315 to 3e1b4f3 Compare November 12, 2025 12:29

McPatate reviewed Nov 12, 2025

View reviewed changes

remi-or and others added 3 commits November 12, 2025 16:22

Update src/transformers/generation/continuous_batching/cache.py

3ed7936

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>

Update src/transformers/generation/continuous_batching/continuous_api.py

5a0e4d4

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>

DOCSSSS

d7f2126

remi-or added 2 commits November 13, 2025 08:35

Review comments

877df4c

Style

ea68a1c

remi-or marked this pull request as ready for review November 13, 2025 08:39

Merge branch 'main' into cb-prefix-sharing

7ade82d

remi-or added 3 commits November 13, 2025 08:58

Added a flag to allow prefix sharing

5a88f41

[IMPORTANT] bug fix for prefix length memoization

da80c4f

Added a test for Cb prefix sharing

a84f9e8

McPatate reviewed Nov 13, 2025

View reviewed changes

remi-or added 3 commits November 13, 2025 11:05

Example, start of refactor

44fdf20

End of refactor for example script

ae98ce3

Added a do sample arg

4730839

remi-or added 2 commits November 13, 2025 13:20

Added reporting on prefix sharing

dcff594

Added a context managr option for CB manager

ab8bd76

remi-or requested a review from McPatate November 13, 2025 13:44

McPatate approved these changes Nov 13, 2025

View reviewed changes

src/transformers/generation/continuous_batching/continuous_api.py Show resolved Hide resolved

remi-or and others added 2 commits November 13, 2025 14:12

Nit and style

ce80e15

Merge branch 'main' into cb-prefix-sharing

6f4d984

ArthurZucker approved these changes Nov 17, 2025

View reviewed changes

remi-or and others added 2 commits November 17, 2025 11:28

Merge branch 'main' into cb-prefix-sharing

60e8ad0

Review comment from ArthurZucker

fba7916

ArthurZucker merged commit 47227f4 into main Nov 17, 2025
20 of 24 checks passed

ArthurZucker deleted the cb-prefix-sharing branch November 17, 2025 12:20

McPatate mentioned this pull request Nov 19, 2025

feat: kv cache retention across conversations #40801

Closed

	def is_enough_free_blocks(self, n_blocks: int) -> bool:
	def has_enough_free_blocks(self, n_blocks: int) -> bool:

	def compute_hash(self, parent_hash: int \| None, tokens: list[int]) -> int:
	def __hash__(self, parent_hash: int \| None, tokens: list[int]) -> int:

Add prefix sharing to continuous batching #42094

Add prefix sharing to continuous batching #42094

Uh oh!

Conversation

remi-or commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Nov 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

remi-or commented Nov 12, 2025

Uh oh!

remi-or commented Nov 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

remi-or commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

remi-or commented Nov 7, 2025 •

edited

Loading

remi-or commented Nov 13, 2025 •

edited

Loading