-
Notifications
You must be signed in to change notification settings - Fork 431
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
add generation quality debugging docs
- Loading branch information
Andrew Lapp
committed
Feb 6, 2024
1 parent
8896cdb
commit 95074e5
Showing
1 changed file
with
92 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,92 @@ | ||
--- | ||
title: Debugging | ||
--- | ||
|
||
# Debugging | ||
|
||
Language models are a new and complex area of research. Adding constrained generation to them can bring up unexpected issues. We have some debugging methods documented here to help determine the source(s) of problems. | ||
|
||
## Understanding Generation Quality Issues | ||
|
||
Language models determine the probability of every possible next-token based on the full sequence of preceding tokens. In Outlines, we can either choose the highest probability next token ("Greedy") or select randomly weighted by token probability ("Multinomial"). | ||
|
||
If the output quality is lacking, the model (or prompt) might not be well-suited for your particular use case. Logging next-token-probabilities, based on the model's "logits" can help with troubleshooting. This can be accomplished via `outlines.logging.enable_logits_logging()`. | ||
|
||
_(Note: `enable_logits_logging()` will slow down generation and shouldn't be used in production.)_ | ||
|
||
### Example: | ||
|
||
In this debug example we attempt to extract sentiment from a restaurant review, but at first the model is struggling. | ||
|
||
```python | ||
import outlines.logging | ||
outlines.logging.enable_logits_logging() | ||
|
||
import outlines | ||
|
||
model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2") | ||
|
||
prompt = """You are a sentiment-labelling assistant. | ||
Is the following review positive or negative? | ||
Review: This restaurant is just awesome! | ||
""" | ||
|
||
generator = outlines.generate.choice(model, ["Positive", "Negative"]) | ||
answer = generator(prompt) | ||
``` | ||
|
||
#### Output (click to expand) | ||
|
||
<details> | ||
|
||
``` | ||
Selected: 'N' for batch_item=0 | ||
Top Raw Tokens: 'The': 0.560, '\n': 0.212, 'I': 0.106, 'Great': 0.017, 'From': 0.017, 'We': 0.013, 'F': 0.011, 'They': 0.006, EOS: 0.006 | ||
Top Guided Tokens: 'P': 0.456, 'N': 0.330, 'Pos': 0.173, 'Ne': 0.018, 'Neg': 0.016, 'Po': 0.007, 'P': 0.000, 'N': 0.000, EOS: -0.000 | ||
Selected: 'ega' for batch_item=0 | ||
Top Raw Tokens: 'ice': 0.964, 'ut': 0.006, 'at': 0.006, 'ood': 0.004, 'ic': 0.003, 'ick': 0.002, 'ear': 0.002, 'umer': 0.002, EOS: 0.002 | ||
Top Guided Tokens: 'eg': 0.514, 'ega': 0.419, 'e': 0.065, 'e': 0.001, '\x00': 0.000, '\x04': 0.000, '': 0.000, '\x01': 0.000, EOS: -0.000 | ||
Selected: 't' for batch_item=0 | ||
Top Raw Tokens: 'ive': 0.442, 'ative': 0.381, 'тив': 0.054, ':': 0.030, 'iv': 0.006, 'itive': 0.006, 'Review': 0.005, 'ativ': 0.003, EOS: 0.003 | ||
Top Guided Tokens: 't': 0.903, 'ti': 0.097, 't': 0.000, '\x04': 0.000, '\x00': 0.000, '': 0.000, '\x01': 0.000, '\x02': 0.000, EOS: -0.000 | ||
Selected: 'ive' for batch_item=0 | ||
Top Raw Tokens: 'ive': 0.993, 'ion': 0.005, 'iv': 0.002, 've': 0.000, 'ivity': 0.000, 'ively': 0.000, 'ives': 0.000, 'if': 0.000, EOS: 0.000 | ||
Top Guided Tokens: 'ive': 0.998, 'iv': 0.002, 'i': 0.000, 'i': 0.000, '\x00': 0.000, '\x04': 0.000, '': 0.000, '\x01': 0.000, EOS: -0.000 | ||
Selected: '' for batch_item=0 | ||
Top Raw Tokens: ':': 0.625, '\n': 0.227, 'or': 0.073, ',': 0.022, '/': 0.016, '.': 0.008, '?': 0.007, '-': 0.003, EOS: 0.003 | ||
Top Guided Tokens: EOS: 1.000, '': 0.000, '\x04': 0.000, '\x01': 0.000, '\x00': 0.000, '': 0.000, '\x02': 0.000, '\x03': 0.000 | ||
``` | ||
|
||
</details> | ||
|
||
#### Analysis | ||
|
||
The model incorrectly classified the review as "Negative". | ||
|
||
We can observe in the "Raw Tokens" section that prior to constraining generation to "Positive" / "Negative" the most likely next token was `The`, and tokens allowing for legal generations had very low probabilities: | ||
|
||
``` | ||
Top Raw Tokens: 'The': 0.560, '\n': 0.212, 'I': 0.106, 'Great': 0.017, 'From': 0.017, 'We': 0.013, 'F': 0.011, 'They': 0.006, EOS: 0.006 | ||
Top Guided Tokens: 'P': 0.456, 'N': 0.330, 'Pos': 0.173, 'Ne': 0.018, 'Neg': 0.016, 'Po': 0.007, 'P': 0.000, 'N': 0.000, EOS: -0.000 | ||
``` | ||
|
||
Ideally the Raw Tokens are closely aligned to Guided Tokens. To accomplish this, we update the prompt as follows | ||
|
||
```python | ||
prompt = """You are a sentiment-labelling assistant. | ||
Is the following review positive or negative? | ||
Review: This restaurant is just awesome! | ||
Review label: | ||
""" | ||
``` | ||
|
||
This change results in substantially better "Raw Tokens" and an accurate label being applied, the model assigns a 100% probability to "Positive", whereas previously the chance of "Positive" was ~64%: | ||
|
||
``` | ||
Selected: 'Pos' for batch_item=0 | ||
Top Raw Tokens: 'Pos': 0.858, 'POS': 0.060, '\n': 0.031, 'pos': 0.022, '+': 0.007, '**': 0.007, 'The': 0.004, 'Pos': 0.002, EOS: 0.002 | ||
Top Guided Tokens: 'Pos': 1.000, 'P': 0.000, 'Po': 0.000, 'Neg': 0.000, 'Ne': 0.000, 'N': 0.000, 'P': 0.000, 'N': 0.000, EOS: -0.000 | ||
``` |