Skip to content

Conversation

@ddh0
Copy link
Contributor

@ddh0 ddh0 commented Dec 11, 2025

This PR implements a new sampler that reshapes token probability distributions to favor tokens near a configurable target probability, rather than selecting from the highest-probability candidates. The technique is called Power Law sampling and it was originally described and implemented by @MrJackSpade here.

How it works

Traditional samplers ask:

"Which tokens are most probable?"

Power Law sampling asks:

"Which tokens are near my target probability?"

This allows controlled exploration of the probability space. Setting a lower target (e.g., 0.45-0.65) favors "interesting but plausible" tokens from the mid-range of the distribution, while higher targets (e.g., 0.85 - 0.95) behave more like standard samplers. The sampler evolved from ideas similar to Mirostat, but targets probability directly rather than perplexity for more intuitive control.

Adaptive target tracking

The sampler maintains a weighted history of the original probabilities of selected tokens. If recent selections have been higher-probability than the target, it compensates by temporarily lowering the effective target, and vice-versa. This keeps the average selection probability near your configured target over time.

Parameters

Flag Description Valid range Default
--power-law-target Select tokens near this probability. Negative = disabled. [0.0, 1.0] -1.0
--power-law-decay Decay rate for target adaptation over time. Effective history length ≈ 1/(1-decay) tokens. [0.0, 0.99] 0.9

In most cases, just play with --power-law-target. The decay default of 0.9 (~10 token history) works well. Lower decay values make adaptation more reactive but the model may start to feel unstable. Higher values like 0.99 equate to extremely slow adaptation over time. Decay is clamped to 0.99 to prevent unbounded accumulation, thus we get a maximum "effective history size" of ~100 tokens.

Negative target values will disable the sampler and just sample a token from the un-transformed distribution. Since the default target is set to -1.0, the sampler is disabled by default. This is intentional, since it's a specialized sampler.

Usage notes

This sampler must be last in the chain, like the existing greedy, dist, or mirostat samplers, because it selects a token ID rather than just transforming logits.

The sampler works best when the only other samplers are light truncation, e.g. --top-k 64 combined with --min-p 0.05 to remove very unlikely tokens. You should disable penalties, DRY, and most other samplers as they are not expected to play nice.

Example usage

./build/bin/llama-server -m ~/gguf/my-model.gguf --samplers "top-k;min-p;power-law" --top-k 64 --min-p 0.05 --power-law-target 0.55

@ddh0

This comment was marked as outdated.

@ddh0 ddh0 marked this pull request as ready for review December 11, 2025 23:59
@ddh0 ddh0 requested a review from ggerganov as a code owner December 11, 2025 23:59
@ddh0
Copy link
Contributor Author

ddh0 commented Dec 12, 2025

Nevermind, sorry, I think we want to do a little more testing. I'm going to mark this as draft again temporarily.

@ddh0 ddh0 marked this pull request as draft December 12, 2025 02:55
Copy link
Contributor

@pnb pnb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very interesting! I wish the original compared to XTC, since the goals seem highly similar.

As an aside, I am curious if there is some way to make it work without selecting a token (i.e., only steps 1-3). I see why token selection is necessary, given the need to save the original probability to the history for the adaptive adjustment part. But, for example, maybe it would suffice instead to save the original probability of the highest-probability token after transforming, regardless of which one is eventually selected by a downstream sampler.


// fixed power law transform parameters (from original implementation)
const float distribution_width = 0.2f;
const float peak_logit_value = 3.0f;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these parameters be configurable like in the original implementation? There is probably a tradeoff with feature creep, having too many options for users to control, but some of these seem potentially important (especially distribution_width). Also, I noticed peak_logit_value is outside the range suggested in the original implementation; is that intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Myself and the original author are discussing the parameters over the next few days, I agree that the current implementation is probably not ideal, which is why I marked it back as draft.

I will post a comment in the main thread with an update once we've got it more figured out. Thank you!

@ddh0 ddh0 force-pushed the power-law-sampler branch from 778a00e to 1c58e9a Compare December 15, 2025 04:35
@ddh0 ddh0 marked this pull request as ready for review December 16, 2025 04:35
@ddh0 ddh0 requested a review from ngxson as a code owner December 16, 2025 04:35
@ddh0
Copy link
Contributor Author

ddh0 commented Dec 16, 2025

@pnb I've basically re-done the entire PR since your last comment, as well as updated the top comment with a much clearer explanation. Let me know if I can clear anything up.

@Geechan
Copy link

Geechan commented Dec 16, 2025

This is a fantastic sampler for creative tasks, and is truly a game changer in this regard. It's difficult to understand how effective it is until you try it for yourself.

I've found a target value of 0.4-0.7 to be excellent for creative tasks, with higher values for more deterministic tasks. It manages to do what XTC does without many of the pitfalls behind that sampler - a self correcting, dynamic algorithm basically keeps the sampler in check much better than setting a random chance to apply top truncation to. A lot of so called 'AI slop' is heavily over-represented in the top tokens, and so this sampler really helps a model shine in the still strong and coherent mid range while not drifting too far from established probabilities and the natural distribution charts of models (unlike adjusting temperature).

I hope to see this merged!

@AesSedai
Copy link

Also chiming in here as an early tester of this sampler, it's really refreshing for creative tasks like Geechan mentioned. It breaks the streak of high-confidence token selection that leads to the familiar patterns you get used to, while not impacting coherence.

Overall excited to see this merged in and tested more widely.

@z80maniac
Copy link
Contributor

How does this sampler handle the cases where high probability is justified? For example, a punctuation. Let's say we have a text And then he added "Also. Assuming the model follows English grammar, there will be almost 100% probability that the next token is ,. Will Power Law sampler discard it?

Or what about tokens in the middle of a word? Let's say there is a text about a man and his tractor, and the prompt ends with I rode my trac. The next token must be tor so it will also have almost 100% probability. Will Power Law sampler discard it?

And if these high probability tokens won't be discarded, then how will the sampler differentiate between useful high-probability tokens and high-probability slop or repetition?

This is all theoretical and maybe it doesn't matter in practice, but I'm just interested if the above cases are somehow accounted for.

@MaggotHATE
Copy link
Contributor

Very interesting sampler, thank you for the implementation! I like the effect so far, it stays on topic even on long results.

One question: if this sample must be the last in the chain, why include it alongside other samplers? For now it looks like a user can make a mistake by putting it elsewhere, which is probably not what we want. Maybe it's worth adding it into the chain at the end, where the dist is, and notify that it will always be the last one if included.

@ddh0
Copy link
Contributor Author

ddh0 commented Dec 17, 2025

How does this sampler handle the cases where high probability is justified? [...] This is all theoretical and maybe it doesn't matter in practice, but I'm just interested if the above cases are somehow accounted for.

The idea is that you're supposed to configure your truncation samplers (like top-k and/or min-p) in such a way that removes garbage tokens from the candidates pool, before it even hits Power Law. It's the same for temperature - if you're using a high temperature you should cut out the nonsense before you apply it. (@z80maniac)

if this sample must be the last in the chain, why include it alongside other samplers? For now it looks like a user can make a mistake by putting it elsewhere, which is probably not what we want. Maybe it's worth adding it into the chain at the end, where the dist is, and notify that it will always be the last one if included.

This is good feedback, thank you. I will consider how to change it so that the power law sampler is guaranteed to always be at the end of the chain, if it's active. (@MaggotHATE)

@pnb
Copy link
Contributor

pnb commented Dec 17, 2025

I took another look through the code and I think the choice of what is a tunable parameter vs. what is a fixed default is great. The knobs to tune make sense, and I tried playing with the other parameters (that are now constants) without seeing much obvious effect in the text. Overall I would say the effect of this sampler is a little subtle compared to XTC, but it is noticeable with a low target like .05, where lots of excessively popular adverbs disappear from the results.

@ddh0
Copy link
Contributor Author

ddh0 commented Dec 17, 2025

Maybe it's worth adding it into the chain at the end, where the dist is, and notify that it will always be the last one if included.

This is addressed now in 7752998.

Gentle poke to @ggerganov - are there any more changes needed here? What are your thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants