Skip to content

Commit

Permalink
add generation quality debugging docs
Browse files Browse the repository at this point in the history
  • Loading branch information
Andrew Lapp committed Feb 6, 2024
1 parent 8896cdb commit 95074e5
Showing 1 changed file with 92 additions and 0 deletions.
92 changes: 92 additions & 0 deletions docs/reference/debug.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
---
title: Debugging
---

# Debugging

Language models are a new and complex area of research. Adding constrained generation to them can bring up unexpected issues. We have some debugging methods documented here to help determine the source(s) of problems.

## Understanding Generation Quality Issues

Language models determine the probability of every possible next-token based on the full sequence of preceding tokens. In Outlines, we can either choose the highest probability next token ("Greedy") or select randomly weighted by token probability ("Multinomial").

If the output quality is lacking, the model (or prompt) might not be well-suited for your particular use case. Logging next-token-probabilities, based on the model's "logits" can help with troubleshooting. This can be accomplished via `outlines.logging.enable_logits_logging()`.

_(Note: `enable_logits_logging()` will slow down generation and shouldn't be used in production.)_

### Example:

In this debug example we attempt to extract sentiment from a restaurant review, but at first the model is struggling.

```python
import outlines.logging
outlines.logging.enable_logits_logging()

import outlines

model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2")

prompt = """You are a sentiment-labelling assistant.
Is the following review positive or negative?
Review: This restaurant is just awesome!
"""

generator = outlines.generate.choice(model, ["Positive", "Negative"])
answer = generator(prompt)
```

#### Output (click to expand)

<details>

```
Selected: 'N' for batch_item=0
Top Raw Tokens: 'The': 0.560, '\n': 0.212, 'I': 0.106, 'Great': 0.017, 'From': 0.017, 'We': 0.013, 'F': 0.011, 'They': 0.006, EOS: 0.006
Top Guided Tokens: 'P': 0.456, 'N': 0.330, 'Pos': 0.173, 'Ne': 0.018, 'Neg': 0.016, 'Po': 0.007, 'P': 0.000, 'N': 0.000, EOS: -0.000
Selected: 'ega' for batch_item=0
Top Raw Tokens: 'ice': 0.964, 'ut': 0.006, 'at': 0.006, 'ood': 0.004, 'ic': 0.003, 'ick': 0.002, 'ear': 0.002, 'umer': 0.002, EOS: 0.002
Top Guided Tokens: 'eg': 0.514, 'ega': 0.419, 'e': 0.065, 'e': 0.001, '\x00': 0.000, '\x04': 0.000, '': 0.000, '\x01': 0.000, EOS: -0.000
Selected: 't' for batch_item=0
Top Raw Tokens: 'ive': 0.442, 'ative': 0.381, 'тив': 0.054, ':': 0.030, 'iv': 0.006, 'itive': 0.006, 'Review': 0.005, 'ativ': 0.003, EOS: 0.003
Top Guided Tokens: 't': 0.903, 'ti': 0.097, 't': 0.000, '\x04': 0.000, '\x00': 0.000, '': 0.000, '\x01': 0.000, '\x02': 0.000, EOS: -0.000
Selected: 'ive' for batch_item=0
Top Raw Tokens: 'ive': 0.993, 'ion': 0.005, 'iv': 0.002, 've': 0.000, 'ivity': 0.000, 'ively': 0.000, 'ives': 0.000, 'if': 0.000, EOS: 0.000
Top Guided Tokens: 'ive': 0.998, 'iv': 0.002, 'i': 0.000, 'i': 0.000, '\x00': 0.000, '\x04': 0.000, '': 0.000, '\x01': 0.000, EOS: -0.000
Selected: '' for batch_item=0
Top Raw Tokens: ':': 0.625, '\n': 0.227, 'or': 0.073, ',': 0.022, '/': 0.016, '.': 0.008, '?': 0.007, '-': 0.003, EOS: 0.003
Top Guided Tokens: EOS: 1.000, '': 0.000, '\x04': 0.000, '\x01': 0.000, '\x00': 0.000, '': 0.000, '\x02': 0.000, '\x03': 0.000
```

</details>

#### Analysis

The model incorrectly classified the review as "Negative".

We can observe in the "Raw Tokens" section that prior to constraining generation to "Positive" / "Negative" the most likely next token was `The`, and tokens allowing for legal generations had very low probabilities:

```
Top Raw Tokens: 'The': 0.560, '\n': 0.212, 'I': 0.106, 'Great': 0.017, 'From': 0.017, 'We': 0.013, 'F': 0.011, 'They': 0.006, EOS: 0.006
Top Guided Tokens: 'P': 0.456, 'N': 0.330, 'Pos': 0.173, 'Ne': 0.018, 'Neg': 0.016, 'Po': 0.007, 'P': 0.000, 'N': 0.000, EOS: -0.000
```

Ideally the Raw Tokens are closely aligned to Guided Tokens. To accomplish this, we update the prompt as follows

```python
prompt = """You are a sentiment-labelling assistant.
Is the following review positive or negative?
Review: This restaurant is just awesome!
Review label:
"""
```

This change results in substantially better "Raw Tokens" and an accurate label being applied, the model assigns a 100% probability to "Positive", whereas previously the chance of "Positive" was ~64%:

```
Selected: 'Pos' for batch_item=0
Top Raw Tokens: 'Pos': 0.858, 'POS': 0.060, '\n': 0.031, 'pos': 0.022, '+': 0.007, '**': 0.007, 'The': 0.004, 'Pos': 0.002, EOS: 0.002
Top Guided Tokens: 'Pos': 1.000, 'P': 0.000, 'Po': 0.000, 'Neg': 0.000, 'Ne': 0.000, 'N': 0.000, 'P': 0.000, 'N': 0.000, EOS: -0.000
```

0 comments on commit 95074e5

Please sign in to comment.