Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude Top Choices (XTC): A sampler that boosts creativity, breaks writing clichés, and inhibits non-verbatim repetition #6335

Open
wants to merge 5 commits into
base: dev
Choose a base branch
from

Conversation

p-e-w
Copy link
Contributor

@p-e-w p-e-w commented Aug 18, 2024

Background

Apart from some special cases like repetition penalties, all widely used sampling algorithms fall into two categories:

  1. Truncation samplers that remove the "tail" of the predicted probability distribution, with the goal of improving the coherence of the output by getting rid of tokens that don't actually make sense in the current position. Examples: Top-K, Top-P, Min-P, TFS, etc.
  2. Distortion samplers that change the shape of the probability distribution, usually to either improve coherence (by skewing the distribution towards higher-probability tokens) or increase "creativity" (i.e., randomness) by doing the opposite. Examples: Temperature, dynamic temperature, quadratic sampling.

All of these sampling strategies have one thing in common: They don't change the probability order of tokens, and in particular, the most probable tokens from the raw distribution are still the most probable tokens after applying such samplers.

It is therefore unsurprising that existing samplers are somewhat ill-suited for the task of enhancing a model's creativity. The best you can do is either reduce truncation (which will shift the range of acceptable tokens towards the "garbage end" of the distribution), or reshape the distribution to make low probability (garbage) tokens more likely. The result tends to be models going "off the rails" rather than being more creative in the commonly used sense of the word.

What XTC does

This pull request introduces the Exclude Top Choices (XTC) sampling algorithm. XTC is a novel sampler that turns truncation on its head: Instead of pruning the least likely tokens, under certain circumstances, it removes the most likely tokens from consideration.

More precisely, it removes all except the least likely token meeting a given threshold, with a given probability. This ensures that at least one "viable" choice remains, retaining coherence. Truncation samplers can be applied as usual, preventing garbage from being sampled. The result is coherent output (because truncation removes bad tokens) with unprecedented creativity (because XTC removes "boring" tokens).

xtc

My experience so far has been that this gives spectacular results. The creativity is off the charts, while the coherence is virtually unchanged. This is especially apparent when regenerating a chat message several times: Models tend to generate roughly the same message structure each time once a sufficiently long context has established expectations. But with XTC enabled, models will often generate messages that are completely different from previous attempts, because eliminating the most likely choices breaks ingrained patterns.

One stone, many birds

XTC doesn't just boost creativity, it also breaks writing clichés and inhibits repetition, including non-verbatim (paraphrased/structural) repetition. It is the first sampler that I'm aware of that can successfully do the latter. Standard repetition penalties operate by first trying to identify repetition and then penalizing tokens accordingly. But detecting paraphrased or structural repetition is extremely difficult, so repetition penalties usually aren't able to prevent it from happening. By contrast, XTC penalizes tokens simply for being very likely, which often includes tokens that reflect the model's tendency to repeat previous output.

Demonstration

The following outputs are not cherry-picked. They were the first outputs I generated with each given configuration.

  • Model: mistral-7b-instruct-v0.2.Q4_K_M
  • Prompt: "Write a story about a wizard who is losing his ability to do magic, and tries everything to get it back."
  • All samplers disabled, except the ones mentioned.

Baseline (Min-P = 0.02)

I'd be happy to help you craft a story about a wizard facing the loss of his magical abilities. Here's a possible beginning:

Once upon a time in the enchanted realm of Eldoria, there lived a revered wizard named Zephyrion. Known for his wisdom and extraordinary magical powers, Zephyrion had served his kingdom faithfully for decades. However, one fateful day, as he attempted to conjure a simple healing spell, his magic failed him. Alarmed, Zephyrion consulted the ancient tomes of Eldoria's magical history, seeking answers to his predicament.

Determined to restore his magical abilities, Zephyrion embarked on a quest. He traveled to far-off lands, seeking the counsel of the most renowned healers, sages, and enchantresses. He delved deep into the forgotten depths of magical knowledge, experimented with rare herbs and potions, and even underwent a series of trials to purge his spirit of any impurities.

As Zephyrion continued his quest, he encountered a myriad of challenges and obstacles. He encountered dark forces that sought to hinder his progress and test the limits of his will and determination. But through perseverance and unwavering resolve, he prevailed, uncovering the secret to reclaiming his magical abilities and, in the process, growing wiser, kinder, and more powerful than ever before.

Would you like me to elaborate on any specific part of this story or create additional characters for the wizard to meet during his journey?

Notes:

  • This is more of an abstract of a story than an actual story. It's extremely generic, with no real plot.
  • "Once upon a time" is the mother of all clichés.
  • "Eldoria" is a well-known name commonly generated by LLMs.

Min-P = 0.02, Temperature = 1.5

Certainly, I'd be happy to help you craft a story about a wizard who is losing his magical abilities. This wizard, named Eldrion, was once the most powerful sorcerer in the realm of Eldoria. However, as he grew older, he began to notice a troubling trend. The spells that once flowed effortlessly from his fingertips now required an unprecedented amount of concentration. The potions that once bubbled to perfection under his gentle stirring now spluttered and threatened to boil over.

Determined not to let his waning powers define his existence, Eldrion embarked on a journey to regain his magical prowess. He consulted the ancient scrolls, sought guidance from the council of elders, and even made a pilgrimage to the enchanted glen where the elves' wise sorceress lived. But as each remedy proved fruitless, a creeping dread began to settle in his heart.

However, he refused to succumb to despair. He began to reflect on the possible sources of his decline - perhaps he had grown complacent in his youth? Or had his magical powers been drawing from the vitality of the natural world without his realizing it? Regardless of the cause, Eldrion believed that the answer must lie within himself, and he devoted himself to mastering the most esoteric of arcane techniques - meditation and self-reflection.

As he journeyed deeper into the meditative arts, he found that his focus grew sharper, and his sense of connection to the natural world was rekindled. Slowly but surely, the power within him stirred once more, manifesting in subtle yet tangible ways. Though his rediscovered abilities were but a shadow of his former brilliance, Eldrion understood that true magical power comes not from the inherent potency of one's abilities, but rather the determination to continue mastering and honing those gifts, no matter the circumstances.

Notes:

  • Definitely more creative than the baseline.
  • "Eldrion from Eldoria" is terrible.
  • Third paragraph is slightly rambling.

Min-P = 0.02, XTC threshold = 0.1, XTC probability = 0.5

In the enchanted realm of Eldrida, resided a wizard named Gadryn, renowned for his profound magical abilities. As the years passed, however, his powers began to wane. Baffled and disheartened, he embarked on a quest to reclaim his lost magical abilities.

He first ventured to the Whispering Woods, where the ancient dryads imparted their ancient wisdom. Gadryn spent days immersed in the forest, meditating and practicing. Yet, no change graced his magic.

Next, he ascended the Crystal Mountains, seeking answers from the sagacious Mountain Spirits. The Spirits instructed him in arcane rituals, but alas, his powers remained elusive.

Gadryn next ventured to the Sapphire Seas, believing that the ancient knowledge concealed within the Ocean's Depths may hold the solution. The Sea Serpent, the guardian of the ocean's deepest wisdom, shared tales of magic being linked to a wizard's pure intention.

Touched by the words of the serpent, Gadryn returned to his tower deep in self-reflection. He questioned the source of his magic's waning, and gradually, he understood: it was his disillusionment with life and the pursuit of power that drained his powers.

Humbled by his experiences, Gadryn embraced a more profound connection to life, nurturing friendships and sharing his knowledge with those around him. And in time, the once waning magic of Gadryn was rekindled.

Notes:

  • No filler phrases where the model tells you what it is about to do.
  • Uses "Eldrida" instead of "Eldoria"! XTC breaks one of the worst clichés in the business.
  • Lots of actual names for places and beings.
  • Clearly the best overall, though obviously not a real story given the primitive prompt and generic model.

How to try out XTC

  1. Check out or merge the xtc branch from my fork.
  2. Start TGWUI, go to the "Parameters" tab, and set xtc_probability to a value greater than zero (0.5 is a good start). I recommend pairing it with Min-P (0.02) and DRY (multiplier 0.8), with all other samplers disabled.
  3. Chat or generate output as usual.

xtc_params

If you want to use XTC over the API (e.g. with SillyTavern), you will need to patch the client to send the appropriate XTC parameters, or TGWUI itself to hardcode a non-zero probability. Note that SillyTavern also sends the "sampler priority" parameter, which might interfere with proper operation of XTC unless further patching is done (see next section).

Important note: To use XTC with a GGUF model, you need to use the "llamacpp_HF creator" in the "Model" tab and then load the model using llamacpp_HF, because otherwise Transformers-based samplers have no effect.

Position in the sampler stack

While there is certainly room for experimentation, I strongly recommend to place XTC after all truncation samplers. This ensures that truncation happens based on the original distribution and remains predictable, regardless of how much probability mass is removed by XTC.

Checklist

@LostRuins
Copy link

LostRuins commented Aug 19, 2024

Wouldn't you get a similar effect from setting a high temperature after removing all poor candidates?

Let's say you removed all candidates except the top 4 (via top-K/min-P/top-P or whatever)

Bear = 50%
Tree = 30%
Door = 20%
Sword = 10% (after softmax)

Now for more creativity, crank up the temperature, and you end up with
Bear = 27%
Tree = 25%
Door = 24%
Sword = 23%

whereas with XTC you just get
Sword = 100%

how would that be more creative? The first set still has an equally likely chance to pick from a good variety of candidates. Especially considering people like to reroll gens a lot, you probably just end up with many runs of Sword, whereas you have more variety without it.

@LostRuins
Copy link

The only benefit perhaps would be to remove "toxic" slop tokens from the output, e.g. Shivers down her spine, but then identifying such slop tokens is non-trivial

@lemonteaa
Copy link

Not sure where to put this, I did a quick, hacky, might be bugged implementation of XTC on llama-cpp-python using LogitProcessor: https://gist.github.com/lemonteaa/d33f33e3f44d009813c02a341d294706#file-xtc-py

(It is hacky because llama-cpp-python's samplers mostly calls back to samplers implemented in llama.cpp itself. On the other hand, trying to get new proposed samplers merged into major backends in the ecosystem probably requires passing through a rigorous process (and there are good, legitimate reasons to it) and is simply going to take time.)

(Disclaimer) intended usage is if anyone is impatient and want a lightweight way to test it out quick

@p-e-w
Copy link
Contributor Author

p-e-w commented Aug 19, 2024

@LostRuins

Wouldn't you get a similar effect from setting a high temperature after removing all poor candidates?

I have tried that approach many times. The problem is that this throws away the information contained in the probability distribution, by essentially making all remaining tokens (almost) equally likely. One of the following two things will happen:

If you truncate aggressively, only 1-2 candidates will remain, which are then sampled with near-equal probability. This is the opposite of creativity, as it simply locks in the most likely candidates.

If, on the other hand, you truncate more loosely, the model will start to derail because it can no longer distinguish between likely and less likely tokens. And enhanced creativity is still not guaranteed, because the most likely tokens remain the most likely tokens.

XTC doesn't alter the relative probabilities of tokens, retaining all the information from the distribution. It only excludes high-probability tokens from sampling under certain circumstances.

The output generated with XTC is very different from what happens when you increase the temperature. The best way to convince yourself of that is to try it.

The only benefit perhaps would be to remove "toxic" slop tokens from the output, e.g. Shivers down her spine, but then identifying such slop tokens is non-trivial

Actually, identifying such tokens is quite easy: They will usually be the most probable tokens in the distribution. If the input is "shivers down her", then the probability distribution might be

spine 0.71
back  0.12
...

And in that case, spine will be excluded by XTC. That's precisely why it works. [Now, you may argue that "shivers down her back" is just as bad as "shivers down her spine", but you get my point 😏.]

MaggotHATE added a commit to MaggotHATE/Llama_chat that referenced this pull request Aug 19, 2024
@LostRuins
Copy link

Fair enough. This sampler is simple enough that I could probably get a working example sometime soon.

Just thinking, do you have ideas on how a "critical" token can be preserved? I understand that so long as more than one token passes the threshold then the most likely token(s) MUST be discarded - I could imagine some cases where that would lead to extensive degradation of outputs unless the threshold is very high. Would it be worth considering the probability difference between the most likely and the least likely token?

Consider examples with threshold = 0.1

CandidateA = 85%
CandidateB = 11%
CandidatesC+ = ~4%

Here we only have 2 candidates that pass the threshold, and XTC means we MUST pick B. Do you think that is ideal in this scenario, considering how confident the model is of A over B? This is quite different from your Bear/Tree/Door/Sword example. This would more likely be 2 + 3 = Five

Again this is just brainstorming.

@Lissanro
Copy link

I wanted to test this in SillyTavern (using text-generation-webui with this pull request applied), so I wrote a patch for it, I am sharing it here in case someone else prefers using SillyTavern UI too (at least for me, it makes testing much easier):
https://dragon.studio/2024/08/SillyTaver-Implement-support-for-the-XTC-sampler.patch

Note: In SillyTavern, within "AI Response Configuration" > "Sampler Select" enable "Ooba Sampler Priority Block", "xtc_probability" and "xtc_threshold". Then in "Sampler Priority" section click "Load default order" to make sure it is correct.

@stepfunction83
Copy link

stepfunction83 commented Aug 19, 2024

I played around with this for a while and found that with the default params of 0.1/0.5 it had a tendency to runaway with huge amounts of verbosity. I expect this is because the EOS token is being truncated when it really shouldn't.

I would probably add a parameter to handle the EOS token separately to more accurately control the length of generations. Maybe add a separate probability for excluding EOS from the top candidates where 0 means it's never excluded and 1 means it's always excluded (current behavior).

@LostRuins
Copy link

Maybe show the logits - why is the EOS not the only candidate with p>0.1 after the AI response was completed? What other candidates were there?

@stepfunction83
Copy link

stepfunction83 commented Aug 19, 2024

This is more of an issue for prose, where the "end" is an open-ended question. There are multiple places where the model could potentially cut itself off.

Per this post on Reddit:

https://www.reddit.com/r/LocalLLaMA/comments/1ev8n2s/comment/livhrdj/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

I'll try setting a logit bias for EOS and see how that works.

@stepfunction83
Copy link

stepfunction83 commented Aug 19, 2024

I'm also wondering if instead of excluding all but the minimum above the threshold, whether the number to exclude could be parametrized as "Exclude Count":

If there were three tokens above the threshold:

"Yes" : 0.3
"No" : 0.2
"Maybe" : 0.1

An "Exclude Count: 1" parameter would exclude the top result, returning "No", while an "Exclude Count: 2" parameter would exclude "Yes" and "No", returning "Maybe". This could be another way of controlling the aggressiveness of the sampler, with 0 defaulting to the normal behavior.

Something like applying the logic only to the top n probs:

        if exclude_n > 0:
            # Mark the top `exclude_n` tokens with probability above the threshold
            sorted_indices_to_remove[..., :exclude_n] = probs[..., 1:exclude_n+1] >= self.threshold
        else:
            # Default behavior: exclude all but one token over the threshold
            sorted_indices_to_remove[..., :-1] = probs[..., 1:] >= self.threshold

I would guess that even excluding just the top one or two could have a large impact on the feel of the result.

XTC only takes effect if at least *two* tokens are above the threshold, so values larger than 0.5 do not make sense
@p-e-w
Copy link
Contributor Author

p-e-w commented Aug 20, 2024

I have considered several additional parameters and mechanisms (probability blending rather than "all or nothing", exclusion count control, token whitelist, ...) already during development, but they all add complexity to the user experience, and at the end of the day, there are already two parameters for controlling the strength of XTC and toning down its effects.

As xtc_threshold approaches 0.5, the effect of XTC vanishes, and as xtc_probability approaches 0, the effect of XTC also vanishes. Therefore, if the user experiences detrimental effects (which I don't, with the recommended parameter values), then they can suppress those effects to any desired degree, and still get a one-dimensional control manifold along which to play with the "flavor" of the sampler. That's plenty of control, IMO.

Right now, you can look at any probability distribution, and immediately see what effect a given set of XTC parameters would have on it, without needing to do any computation or even real thinking. The only other sampler for which this is true is Top-K, and it's a feature that I would really like to preserve.

@oobabooga
Copy link
Owner

Looks good. I had tried a rudimentary version of this more similar to top_k/top_p in the past, where the most likely tokens were always removed. Now I see that the missing element was to remove the most likely tokens above a threshold except for the last one. So in cases where only 1 choice is viable, like 4 after 2 + 2 = , that certain token will still be picked. This also ensures that the language's grammar and punctuation will be respected.

@p-e-w have you experienced any issues with the model failing to stop due to the EOS/EOT tokens not being generated while using these new parameters?

@stepfunction83
Copy link

That's a fair point. There's something beautiful about something like min_p where it's simple, elegant, and easy to understand.

Due to the effectiveness of XTC, I would also expect this to become a standard inclusion for any sort of creative generation going forward, so some more granular control may be nice to have.

If there are added parameters, it would still work just fine without using them, but the option could exist all the same for people to experiment with.

Then again an exclude_n may not add any real value, but it does feel like a natural generalization of the base sampler. In fact, it's kind of like an inverted top-k.

@p-e-w
Copy link
Contributor Author

p-e-w commented Aug 20, 2024

@oobabooga

have you experienced any issues with the model failing to stop due to the EOS/EOT tokens not being generated while using these new parameters?

No. In fact, I haven't noticed any artifacts with the recommended parameter values. My real-world testing has mostly consisted of adventure-style roleplay chat with these parameters:

  • Model: L3-8B-Stheno-v3.2
  • Min-P: 0.02
  • DRY multiplier/base/allowed length: 0.8/1.75/2
  • XTC threshold/probability: 0.1/0.5

Message length was what I'm used to, and I saw no special characters or other garbage in the output. Number of paragraphs was also like before, even though XTC can theoretically suppress line breaks in some situations.

What is your opinion on additional parameters like those proposed in the comments above? They are easy enough to add of course, but I'm worried that there will simply be too many knobs to turn. Determining when output is "better" is difficult enough even in the best case, but with more than two parameters that all control XTC's strength in some sense, the parameter space would be overwhelmingly large and any perceived improvement might be little more than noise.

@stepfunction83
Copy link

I won't belabor the point then. This is a great feature and the amount of testing you've done probably outweighs the few tests I've done. Raising the threshold would probably result in a similar effect to excluding the top n.

@RedDragonGecko
Copy link

in my limited testing with a mistral large model, while it did improve creative writing dramatically it also seemed to make the model much dumber.

@stepfunction83
Copy link

stepfunction83 commented Aug 20, 2024 via email

@RedDragonGecko
Copy link

I used your suggested settings: xtc_probability(0.5.), Min-P (0.02) DRY (multiplier 0.8), with all other samplers disabled.
I used BigHuggyD_NeverSleep_Lumimaid-v0.2-123B_exl2_5.0bpw_h8 loaded at 32768 context length split over 4 3090's.
In my testing dumb mistakes like "He got out of the car and drove off." Occurred often.

@stepfunction83
Copy link

Try raising the XTC threshold to 0.2 from 0.1 and see how it feels then. Increasing the threshold should result in fewer low probability tokens selected.

@p-e-w
Copy link
Contributor Author

p-e-w commented Aug 21, 2024

@RedDragonGecko

in my limited testing with a mistral large model, while it did improve creative writing dramatically it also seemed to make the model much dumber.

Mistral models are famous for running hot by default (for NeMo, Mistral officially recommends a temperature of just 0.3). What this means is that the top logits are more tightly packed together than for other models. Which can lead to unusually many tokens meeting the default threshold of 0.1, which will in turn lead to many sensible choices being discarded, resulting in nonsense being generated occasionally.

As suggested by @stepfunction83, you can try raising the threshold to get a more sensible cutoff. You could also lower the temperature, as long as you don't have the "temperature last" sampling option active.

@Lissanro
Copy link

Lissanro commented Aug 21, 2024

After extensive testing over last few days, I think what is missing is a list of tokens to exclude from the effect of XTC. It could be just like dry_sequence_breakers, but in this case a list of tokens to exclude from the effect (so if the token is the most probable and it is in the exclusion list, it should not be cut off).

As it is now, it can cut off end of stream tokens, new lines, among some other things like "```". This can break workflows, for example if I want more creativity to generate one prompt at a time in a text block, it can mess up formatting by missing a new line before ending the text block, or fail to end the message and generate more than 1 text block. It is even more unstable if count of blocks to be generated is more than one.

Just adjusting threshold or probability does not achieve desired effect, quite the opposite - the issue still can occur, even if less frequently, but the output becomes less creative.

There are more nuanced cases as well. For example, character names can be sometimes determined incorrectly, especially true if a character has more than one way to be named, like a title and name, and throwing out the most probable option causes unwanted change in style, either making it more formal or more causal than it should be.

Having a field with a list like xtc_sequence_exclusions (implemented in a way similar to dry_sequence_breakers, as a comma-separated list of quoted strings) would completely solve this. Perhaps consider setting it by default to newlines, "```" and of stream tokens - if someone wants to get longer paragraphs or messages, they could just delete them in the list, so it should be easy to adjust and understand. And adding custom strings, based on personal use case, would add great flexibility.

@LostRuins
Copy link

@p-e-w btw I implemented it in koboldcpp too. think i got it right
LostRuins/koboldcpp@5bf527a

only part to confirm is - if less than 2 tokens exceed xtc_threshold, the sampler does nothing, correct?

@stepfunction83
Copy link

stepfunction83 commented Aug 21, 2024 via email

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Aug 30, 2024

I wouldn't really hold off on it. Try it on some of your models, it's fun. Keep in mind the limitations.

Finally saw it eat newlines in addition to the EOS token. Euryale 2.2 based on llama-3.1 70b. On mixtral/qwen models I was never able to reproduce that.

Not sure if this is 100% right (am bit of a dunce at this), but it did seem to make the replies shorter in cursory testing.

# Exclude Top Choices (XTC)
class XTCLogitsWarper(LogitsWarper):
    def __init__(self, threshold: float, probability: float, filter_value: float = -float("Inf")):
        self.threshold = threshold
        self.probability = probability
        self.filter_value = filter_value
 
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
 
 
        if random.random() >= self.probability:
            return scores
 
        sorted_logits, sorted_indices = torch.sort(scores, descending=True)
        probs = sorted_logits.softmax(dim=-1)
 
        sorted_indices_to_remove = torch.full_like(probs, False, dtype=torch.bool)
 
        # Set indices to True for tokens that should be removed
        sorted_indices_to_remove[..., :-1] = probs[..., 1:] >= self.threshold
 
        # Convert sorted_indices_to_remove to the original indices
        indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)
 
        # Ensure newline and EOS tokens are not removed
        indices_to_remove[:, shared.tokenizer.encode("\n")[-1]] = False
        indices_to_remove[:, shared.tokenizer.eos_token_id] = False
 
        scores = scores.masked_fill(indices_to_remove, self.filter_value)
        return scores

edit: fixed newline

@Kaszebe
Copy link

Kaszebe commented Aug 31, 2024

How do I install this on Ubuntu? Do I just install it and have it as my 2nd installation of Oobabooga?

@ReMeDy-TV
Copy link

ReMeDy-TV commented Sep 1, 2024

How do I Git clone your xtc branch? I'm not convinced it's grabbing the xtc files.

I tried git clone https://github.com/p-e-w/text-generation-webui.git

Or is the only thing that changes the extensions folder? In which case I can just copy that over into my existing Ooba install.

@stepfunction83
Copy link

stepfunction83 commented Sep 1, 2024 via email

@p-e-w
Copy link
Contributor Author

p-e-w commented Sep 2, 2024

Can anyone name a model smaller than 70B with which the "runaway length" problem can reliably be reproduced? I can't run 70B models, and I'm quite reluctant to implement a fix for a problem if I can't verify it actually works.

Thinking about it theoretically, I realized that the naive approach of just preventing EOS/EOT/newline etc. from being eliminated might not be the right solution. That's because by shielding those tokens from being penalized, you are making them more likely to be chosen than they naturally would. If newline and some other token (and a third one) are above the threshold, and you eliminate the other token but not the newline token because of the proposed countermeasure, you have made newline (potentially a lot) more probable than it was before. As a result, you might regularly see shorter outputs than you want.

A better fix might be to check if one of the protected tokens would be eliminated, and if that is the case, do nothing at all, i.e., don't eliminate any tokens. This would keep the probability of the critical tokens unchanged, and leave output/message length completely unaffected by XTC.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Sep 2, 2024

Well.. funny you say that. For some reason preventing removals breaks CR+. Would your approach result in the sampler not activating most of the time? I guess I should try it and just put a print statement.

Ok so exllama had a problem with the CR+, not the sampler.

Tested returning tokens with \n and EOS, as they were, and I can see it still activates most of the time. The replies are longer again.

in case anyone wants to try:


# Exclude Top Choices (XTC)
class XTCLogitsWarper(LogitsWarper):
    def __init__(self, threshold: float, probability: float, filter_value: float = -float("Inf")):
        self.threshold = threshold
        self.probability = probability
        self.filter_value = filter_value
 
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
 
 
        if random.random() >= self.probability:
            return scores
 
        sorted_logits, sorted_indices = torch.sort(scores, descending=True)
        probs = sorted_logits.softmax(dim=-1)
 
        sorted_indices_to_remove = torch.full_like(probs, False, dtype=torch.bool)
 
        # Set indices to True for tokens that should be removed
        sorted_indices_to_remove[..., :-1] = probs[..., 1:] >= self.threshold
 
        # Convert sorted_indices_to_remove to the original indices
        indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)
 
        # Check if newline or EOS tokens would be removed
        newline_removed = indices_to_remove[:, shared.tokenizer.encode("\n")[-1]].any()
        eos_removed = indices_to_remove[:, shared.tokenizer.eos_token_id].any()
 
        # If newline or EOS tokens would be removed, return the original scores
        if newline_removed or eos_removed:
          #  print("we did it reddit")
            return scores
        # Otherwise, remove tokens with the mask
        scores = scores.masked_fill(indices_to_remove, self.filter_value)
        #print("back in the hole!")
        return scores

@oobabooga
Copy link
Owner

So this is the relevant addition

        # Check if newline or EOS tokens would be removed
        newline_removed = indices_to_remove[:, shared.tokenizer.encode("\n")[-1]].any()
        eos_removed = indices_to_remove[:, shared.tokenizer.eos_token_id].any()
 
        # If newline or EOS tokens would be removed, return the original scores
        if newline_removed or eos_removed:
          #  print("we did it reddit")
            return scores

@p-e-w I'm fine with doing this under the hood like this and not adding any new parameter.

@Lissanro
Copy link

Lissanro commented Sep 3, 2024

I think it is fine to handle under the hood the EOS token, but I think it could be better to have the newline in the exclusion list string similar to the one we already have in DRY, which also has it by default, so it would be easy to add more exclusions as necessary.

@p-e-w
Copy link
Contributor Author

p-e-w commented Sep 3, 2024

@oobabooga

IMO this change is a bit premature, though it's of course fine to temporarily have in the PR for people to play around with.

The main issue is that unlike with DRY, there is no good theoretical justification for treating some tokens differently than others. DRY prevents repetition of sequences, but some sequences (like instruction templates) must be repeated in order for output to be well-formed, so special-casing them makes sense.

But XTC is supposed to increase creativity. And "creativity" includes creativity with output length. There is nothing wrong with having greater variation in response length any more than having greater variation in the choice of adverbs. A priori, having an exclusion list for XTC makes no more sense than having such a list for Min-P.

I strongly suspect that the problem this hack is supposed to solve can actually be addressed by combining XTC with existing samplers. If you can reproduce the problem, try the following setup:

  1. Sampler priority Min-P -> Temperature -> XTC (with "temperature last" disabled)
  2. Temperature < 1
  3. XTC parameters as recommended above

This increases the "confidence" of the prediction, which should make it more likely that high-probability tokens are the only ones above the threshold and don't get culled.

Several other such sampler adjustments come to mind, but a solution that special-cases some tokens just feels wrong. It's equivalent to allowing users to exclude output length from creativity, which raises the question why not also have an option to exclude names from creativity etc. No other sampler supports such arbitrary special treatment, AFAIK.

@oobabooga
Copy link
Owner

I'm not too moved about this addition either, it just seemed like a lot of people were interested in this. As for me personally, I like the idea of the sampler and would rather keep it simple. It's a sampler that messes with logits in an aggressive way, so naturally sometimes it may cause derailments. If it ended up messing with the EOT or the newlines, I'd just try again and regenerate.

Is there a reason to not revert my last commit above and merge this PR?

@Lissanro
Copy link

Lissanro commented Sep 3, 2024

I actually covered the need to exclude names or other strings in my previous messages. Especially true for characters that have multiple ways to address them but most probable way is supposed to remain most frequent, no other settings except the exclusion list can help to solve this. There are also the same issue with places references, if I want to keep probabilities intact for them. More than that, not being able to exclude custom strings can cause unwanted oscillation effects - first XTC forces the model to choose the least probable way to address a character, it sees its own examples, makes it more probable way (and thus XTC removes it), and switches again. It can happen with places references too, for the same reason, making it harder to steer the story.

There are other workflows where an option to have an exclusion list is essential, for example, prompt generation, which can benefit from added creativity, but also need to have probability untouched for multiple keywords which need to remain common.

Some people suggested having a checkbox for EOS token - personally, I do not have a strong opinion about this and I am fine if EOS exclusion is hardcoded, and I am also fine if it will be a checkbox. But having an exclusion list that I can customize is definitely important, and can greatly affect the usefulness of the sampler. There are a lot of other things I would like to put in it, including periods, backtick symbols.

The way I see the exclusion list working, is to behave the same way like there was no XTC enabled if there is a match. People who do not need it, can just leave it alone with the default value (for example, a newline string), so I think it would not get in the way, and DRY already has similar string list which could be used as an example in case it is not clear what I mean.

@p-e-w
Copy link
Contributor Author

p-e-w commented Sep 3, 2024

@oobabooga

I propose to leave this PR open for another week or so. Discussion has been very lively, with a variety of opinions voiced. There is now a commit that might address the most common complaint, and at minimum I'd like to hear confirmation from some of the people who have experienced runaway output length that this change indeed fixes the issue. I'm also still hoping that someone might be able to name a model below 70B that exhibits this behavior in the first place, so I can try alternative solutions.

FWIW, I don't see anything wrong with people hacking around on sampler code in their local copies of inference software to tune the output to their own preferences. I do this all the time, and it's not much more complicated than adjusting parameters really. TGWUI even allows doing this through extensions, keeping the code clean. The fact that EOS doesn't have a universal string representation makes it difficult to find a clean parameter-based solution that works for all models here: Either EOS exclusion is hardcoded (even though some people might prefer the default behavior), or ad-hoc placeholder strings need to be introduced (which is really ugly), or numeric token IDs are used (which is cumbersome for the user).

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Sep 3, 2024

I like creativity as much as the next guy, but rambling blocks of dense text aren't it. It was kind of the equivalent of suppressing the EOS token after a while. With the newer method, it doesn't cut the models short anymore, so it still produces heftier output.

If people want to exclude other things, I'd say passing them into the function and having something pre-tokenize them would be ideal. Then it only has to deal with the token IDs.

Re-rolling helped things out too, but for some people who get less t/s it might not be ideal to have to do it 2-3 times per output in a chat. For me the decider was when using a 70b tune and the character went from mid length creative replies to newline free 500 token blocks of text gradually enough for the context to lock it in.

It reproduces regardless of having the samplers as recommended. With higher temperatures, it seemed to make the model more incoherent as it started swapping words that no longer made sense and weren't grammatically correct. It's easy to make that more extreme simply setting .01 and .9 parameters and then you can look for this kind of effect in your outputs at more sane settings. It can get subtle but still be there. For me it's undesirable.

@p-e-w
Copy link
Contributor Author

p-e-w commented Sep 4, 2024

@Ph0rk0z

  1. Why aren't XTC's parameters working for you? What happens if you increase the threshold, or lower the probability? In the limit, those parameters disable XTC, so this is guaranteed to fix any problems you have. Why isn't this sufficient?
  2. Have you tried my above recommendation of combining a low temperature with XTC in the specified sampler order?
  3. Can you reproduce this problem with any model smaller than 70B?

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Sep 4, 2024

I don't have a lot of models lower than 70b that I use regularly. Especially nothing lower than 30b. Was hoping someone who enjoys those would find one.

I already use low temperature, and the parameters do work for me. But in that case it's turning XTC off or having negligible effect.
When the threshold is high and probability is low, mostly got nothing, some extra alignment or odd word swaps.

Having set threshold lower and probability slightly higher, it generates real kino. Filtering for those 2 tokens, it isn't as likely to degrade. For me it's win-win.

@Alexey-Akishin
Copy link

The latest patch feels like improvement, after using it for few days. I am testing with Mistral Large. I also tested with Magnum, but it is based on Mistral Large, I did not find any smaller models that would work well for me.

I am fine if eos would be hardcoded if it is hard to make it controllable... but I feel that exclusion list is definitely a necessity with this sampler. And it could make exclusion of newline controllable without any additional checkboxes, just by having it as an item in the list by default.

When actually using this sampler for writing, as a writer I need fine control. There are many phrases, names or words which I want ignored by the XTC sampler (so it would be like it is turned off when processing current token if current and previous tokens match a string I added to the exclusion list).

As it is now, for example, if some phrase is supposed to be used most of the time but some other may be used too with similar meaning, without possibility to exclude it, usage of the more rare option will be forced much more often, so I have to stop the generation, manually edit it, continue. Otherwise, not just some phrase but overall tone of writing may change in the wrong direction, especially in longer text generations. Of course, no matter how good sampler is, I still have to do manual edits in the middle of generation, I do not expect the model to be perfect 100% of the time.

This problem arises not just with phrases... let's say I give context that implies that variety of items may be discussed, which implies that there is an item that needs to be mentioned more often, but the sampler forces the opposite to happen. So for example I make multiple generations, and instead of expected distribution get the opposite of what I wanted because I could not add necessary keywords to the exclusion list. Even within a single generation, it presents a problem and makes sometimes harder to steer the model in the way I want, because it keeps inserting less probable words, names or phrases more often - generally, this is a good thing, and this is why I like this sampler. But as a writer, if I know exact words or phrases that I do not want to be affected, I really miss an option to add them. I hope it is possible to implement.

Like someone else mentioned, just a simple list of strings like it is there for the dry sampler, would work great. I think it is more important even, because I don't remember ever changing the "dry" list of strings... but I definitely would be changing XTC list of excluded string if it was implemented, adjusting it depending on a story, and perhaps having some custom profiles I can quickly switch.

@p-e-w
Copy link
Contributor Author

p-e-w commented Sep 8, 2024

I'm at a loss for what to do here. Every single report of problems mentions 70+B models. I don't even have the ghost of a theory for why larger models are affected but smaller ones are not (I now have several thousands of messages generated with 4 different models <= 35B, and not a single case of excessive message length or missing newlines).

The best proposed "solution" is either hardcoding or partially-hardcoding a bandaid exclusion list, even though there is no theoretical justification for treating EOS/newline differently than any other token.

I'm leaning towards recommending that this PR be merged in its original form without special-casing any tokens, and that people experiencing problems patch the sampler code themselves to do what they want, until we get a better understanding of what is actually going on here. I accept and believe that there are problems for certain use cases with certain models, but I don't think adding ad-hoc parameters is a good idea just because they have been observed to alleviate some of those problems in some cases. Other truncation samplers don't have special treatment for specific tokens either.

I have described multiple times why it doesn't make sense that XTC should introduce behaviors that don't happen without it, since by construction such behaviors would happen with a significant probability even with XTC disabled. The fact that some reports appear to contradict this demonstrates that we currently lack an understanding of the actual mechanics that cause runaway outputs. I don't believe that trying to fix this issue without understanding it is the right way forward.

@stepfunction83
Copy link

stepfunction83 commented Sep 8, 2024 via email

@p-e-w
Copy link
Contributor Author

p-e-w commented Sep 8, 2024

I would argue that most people would desire the improved word variety that comes with XTC, but would not want it to impact paragraph or generation length.

If they do, that should be an explicit choice to do so.

Every existing sampler impacts paragraph and generation length, and none provide an option to prevent that.

If you have Min-P set to 0.1, and the newline token has a probability below 10% of the top token's probability, then Min-P will set that probability to zero, and suddenly a newline cannot occur at that position when previously it could. If you increase or decrease the temperature, the probability of newlines etc. will be altered.

By definition, samplers modify token probabilities, and since token probabilities control paragraph and output length, all samplers impact those.

But there is no reason to expect that XTC would distort paragraphs and output lengths to an observable degree, and indeed it doesn't, for any model that I have tested. The fact that some users have observed distortions shows that we don't understand what is happening in those cases, and "let's tape over that with a feature that no other sampler provides" isn't the right answer.

I'm unconvinced that the problems seen with some large models cannot be solved by simply combining XTC with other samplers. All existing truncation samplers are content-agnostic, and I don't see why XTC shouldn't be as well.

@stepfunction83
Copy link

The difference is that existing samplers touch the tail of the distribution, not the head. Min-p excluding low probability tokens is very different than XTC excluding the highest probability ones.

XTC is basically restricting the "intelligence" of the model instead of trying to enhance it, so avoiding applying that to key structural tokens would be desirable.

My other point is that there's no harm in providing flexibility to users. If they don't need it, they can ignore it. If they want to engage with it, they can engage. Limiting functionality just because it's been done that way so far is not a good argument in my view.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Sep 8, 2024

I don't see much harm in a checkbox. Makes both sides happy. It still writes long when the tokens are not made biased towards eos/newline. There is no way that any turncating sampler can catch this behavior from the other end. How do you push specific tokens to be in exactly the right middleground? IME, XTC has had to be run at low temperature anyway to keep the models from going incoherent. I tried to use token bias on EOS and it didn't work that well either.

Theory seems simple. When newline or eos is in the top priority, it gets eliminated and the model keeps writing. The alternative is large models outputting long, dense and rambling text after some messages. Set up correctly, XTC is kino. Set up incorrectly it's thesaurus schizo.

@p-e-w
Copy link
Contributor Author

p-e-w commented Sep 9, 2024

I rented an A100 server this morning, planning to test XTC with large models, but I couldn't get TGWUI to work with CUDA 11.8. So I still haven't been able to reproduce these problems yet.

This issue is quite deep, and I think it's time to separate it from this PR, which is working fine for many people including myself.

@oobabooga

I propose the following:

  1. Revert 0f62744
  2. Merge this PR in its original form
  3. Anyone who encounters problems with XTC can propose a solution in a new PR, where it can be discussed on its own merits.

That way, the vast majority of users who aren't running 70+B models can enjoy the benefits of XTC, and potential improvements for very large models can be tested and evaluated separately.

@morbidCode
Copy link

I tried your original settings to Magnum-v2-70b, and the result is beautiful I think.

Prompt: Write a story about a battle to the death between 2 men: Jeff, who controls fire, and John, who controls water.

Output:
It was a hot summer day when Jeff and John faced each other on the empty beach. The waves lapped at their feet as the two men took their stances, each ready to fight to the death for ultimate elemental supremacy.

Jeff focused his mind, raising his arms towards the sky. With a snarl of effort, he summoned flames from thin air. The fire coalesced into a searing blaze in his hands. The heat radiated out, crackling with power.

John responded by diving into the cool waters at his feet. He submerged himself fully, then erupted upwards, spewing a geyser of water into the air. The liquid took shape, cohering around his outstretched arms like a shimmering shield.

"Come then, you charred pretender!" John called out defiantly over the roar of the sea.

Jeff responded with only a glowering glare, flinging balls of scorching fire at his foe. John dodged them nimbly, the flames hissing as they struck the beach. In turn, he sent tidal waves crashing over Jeff, drenching the flames.

They battled back and forth like this for long moments - John's surging deluges snuffing out Jeff's infernal blasts, only for the fire user to conjure new flames, burning through John's drenching waters.

The sand sizzled and steamed from the elemental clash. The sun beat down mercilessly. Finally, they tired of this stalemate.

Simultaneously, the two elemental warriors unleashed their greatest efforts. Jeff called down a pillar of white-hot fire from the heavens above. John called up a tsunami of water from the depths below.

Fire met water in the middle, an explosion of scalding steam filling the sky. The two fighters grappled together amidst the blinding fog of their battle's final release.

When the steam dissipated, only one figure stood. Jeff, blackened and blistered but triumphant.

"I've claimed my rightful place as king of the elements!" he cackled with blackened, burning lips.

But his laughter cut off short, and his body went limp, crumpling to the scoured beach. An arrow, dripping with water's essence, protruded from his smoking breastbone.

John walked out from behind the cover of a nearby rock formation. "You made a critical error, old friend. Did you really think I didn't have one final trick up my sleeve? You were so focused on playing with fire, you forgot about my trusty bow and quiver full of enchanted arrows."

The water user retrieved his bow from the fallen Jeff's corpse. He nocked a single arrow, aimed at the lifeless form. "Now, to the grave with you, traitor and foe!"

He fired one last shaft into Jeff's unmoving body, ensuring the fire mage was well and truly gone. John stood over his enemy, a satisfied grin on his face, triumphant at last over his former companion, and the master of both water and fire.

And that, as they say, is that!

indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)

# If newline or EOS tokens would be removed, return the original scores
if indices_to_remove[:, self.special_token_ids].any()
Copy link

@feffy380 feffy380 Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing : at the end. PR doesn't run in the current state :P

@Hardeh
Copy link

Hardeh commented Sep 19, 2024

I switched to SillyTavern staging branch and tested XTC around with Nemo-based 12b models, and for me it seems like model is definitely dumber, struggling with facts from the context here and there, at least with default 0.1/0.5 settings. But yes, writing is more creative, although i'm not sure it's worth it.

@MaggotHATE
Copy link

Nemo-based

That's because they are finetuned from an already good model to have better responses at the top - as such, XTC can negate the work done by finetuning. Nemo Instruct works well with 0.1/0.5

A way to control that would be an upper limit, such as xtc_threshold_max which I've implemented for myself. In my experience it helped with mini-magnum, for example. However, it's not a part of the base idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.