-
Notifications
You must be signed in to change notification settings - Fork 14.1k
implement Power Law sampling #17927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
implement Power Law sampling #17927
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
|
Nevermind, sorry, I think we want to do a little more testing. I'm going to mark this as draft again temporarily. |
pnb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very interesting! I wish the original compared to XTC, since the goals seem highly similar.
As an aside, I am curious if there is some way to make it work without selecting a token (i.e., only steps 1-3). I see why token selection is necessary, given the need to save the original probability to the history for the adaptive adjustment part. But, for example, maybe it would suffice instead to save the original probability of the highest-probability token after transforming, regardless of which one is eventually selected by a downstream sampler.
src/llama-sampling.cpp
Outdated
|
|
||
| // fixed power law transform parameters (from original implementation) | ||
| const float distribution_width = 0.2f; | ||
| const float peak_logit_value = 3.0f; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these parameters be configurable like in the original implementation? There is probably a tradeoff with feature creep, having too many options for users to control, but some of these seem potentially important (especially distribution_width). Also, I noticed peak_logit_value is outside the range suggested in the original implementation; is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Myself and the original author are discussing the parameters over the next few days, I agree that the current implementation is probably not ideal, which is why I marked it back as draft.
I will post a comment in the main thread with an update once we've got it more figured out. Thank you!
my git skills are lacking
778a00e to
1c58e9a
Compare
last commit with debug logging!
|
@pnb I've basically re-done the entire PR since your last comment, as well as updated the top comment with a much clearer explanation. Let me know if I can clear anything up. |
|
This is a fantastic sampler for creative tasks, and is truly a game changer in this regard. It's difficult to understand how effective it is until you try it for yourself. I've found a target value of 0.4-0.7 to be excellent for creative tasks, with higher values for more deterministic tasks. It manages to do what XTC does without many of the pitfalls behind that sampler - a self correcting, dynamic algorithm basically keeps the sampler in check much better than setting a random chance to apply top truncation to. A lot of so called 'AI slop' is heavily over-represented in the top tokens, and so this sampler really helps a model shine in the still strong and coherent mid range while not drifting too far from established probabilities and the natural distribution charts of models (unlike adjusting temperature). I hope to see this merged! |
|
Also chiming in here as an early tester of this sampler, it's really refreshing for creative tasks like Geechan mentioned. It breaks the streak of high-confidence token selection that leads to the familiar patterns you get used to, while not impacting coherence. Overall excited to see this merged in and tested more widely. |
|
How does this sampler handle the cases where high probability is justified? For example, a punctuation. Let's say we have a text Or what about tokens in the middle of a word? Let's say there is a text about a man and his tractor, and the prompt ends with And if these high probability tokens won't be discarded, then how will the sampler differentiate between useful high-probability tokens and high-probability slop or repetition? This is all theoretical and maybe it doesn't matter in practice, but I'm just interested if the above cases are somehow accounted for. |
|
Very interesting sampler, thank you for the implementation! I like the effect so far, it stays on topic even on long results. One question: if this sample must be the last in the chain, why include it alongside other samplers? For now it looks like a user can make a mistake by putting it elsewhere, which is probably not what we want. Maybe it's worth adding it into the chain at the end, where the |
The idea is that you're supposed to configure your truncation samplers (like top-k and/or min-p) in such a way that removes garbage tokens from the candidates pool, before it even hits Power Law. It's the same for temperature - if you're using a high temperature you should cut out the nonsense before you apply it. (@z80maniac)
This is good feedback, thank you. I will consider how to change it so that the power law sampler is guaranteed to always be at the end of the chain, if it's active. (@MaggotHATE) |
|
I took another look through the code and I think the choice of what is a tunable parameter vs. what is a fixed default is great. The knobs to tune make sense, and I tried playing with the other parameters (that are now constants) without seeing much obvious effect in the text. Overall I would say the effect of this sampler is a little subtle compared to XTC, but it is noticeable with a low target like .05, where lots of excessively popular adverbs disappear from the results. |
This is addressed now in Gentle poke to @ggerganov - are there any more changes needed here? What are your thoughts? |
This PR implements a new sampler that reshapes token probability distributions to favor tokens near a configurable target probability, rather than selecting from the highest-probability candidates. The technique is called Power Law sampling and it was originally described and implemented by @MrJackSpade here.
How it works
Traditional samplers ask:
Power Law sampling asks:
This allows controlled exploration of the probability space. Setting a lower target (e.g., 0.45-0.65) favors "interesting but plausible" tokens from the mid-range of the distribution, while higher targets (e.g., 0.85 - 0.95) behave more like standard samplers. The sampler evolved from ideas similar to Mirostat, but targets probability directly rather than perplexity for more intuitive control.
Adaptive target tracking
The sampler maintains a weighted history of the original probabilities of selected tokens. If recent selections have been higher-probability than the target, it compensates by temporarily lowering the effective target, and vice-versa. This keeps the average selection probability near your configured target over time.
Parameters
--power-law-target-1.0--power-law-decay0.9In most cases, just play with
--power-law-target. The decay default of 0.9 (~10 token history) works well. Lower decay values make adaptation more reactive but the model may start to feel unstable. Higher values like 0.99 equate to extremely slow adaptation over time. Decay is clamped to 0.99 to prevent unbounded accumulation, thus we get a maximum "effective history size" of ~100 tokens.Negative target values will disable the sampler and just sample a token from the un-transformed distribution. Since the default
targetis set to -1.0, the sampler is disabled by default. This is intentional, since it's a specialized sampler.Usage notes
This sampler must be last in the chain, like the existing
greedy,dist, ormirostatsamplers, because it selects a token ID rather than just transforming logits.The sampler works best when the only other samplers are light truncation, e.g.
--top-k 64combined with--min-p 0.05to remove very unlikely tokens. You should disable penalties, DRY, and most other samplers as they are not expected to play nice.Example usage