[Performance]: guided generation is very slow in offline mode #8313

stas00 · 2024-09-10T02:21:33Z

Proposal to improve performance

With a single request / online mode I'm getting:

no guided 300 tok/sec
outlines 150 tok/sec (2x slower)
lm-format-enforcer 90 tok/sec (~3x slower)

with offline mode I get:

outlines is about 10-20x slower than no guided generation
lm-format-enforcer is about 4x faster than outlines (note that it is slower than outlines for online)

for online I was using this schema:

json_template = {
    "type": "object",
    "properties": {
        "criteria": {"type": "array", "items": {"type": "string"}, "minItems": 1},
        "response": { "type": "string" }
    },
    "required": ["criteria", "response"]
}

for offline I was using an even simpler schema:


{
   "type":"object",
   "properties":{
      "name":{
         "type":"string", "minLength":2, "maxLength":5
      },
      "age":{
         "type":"integer"
      }
   },
   "required":[ "name", "age"]
}

the huge performance hit in the offline mode is very strange for both backends.

2x slow down in the online mode is pretty bad too as it's already a huge impact. The offline mode can actually tolerate 2x no problem as there is no human in the loop, but 10-20x is a way impractical.

vllm=0.6.0 and outlines==0.0.46

The text was updated successfully, but these errors were encountered:

Quang-elec44 · 2024-09-10T03:55:40Z

@stas00 In my experience, guided generation is always slower than normal. I recommend you try sglang instead. Sglang achieves better throughput than vLLM, but the guided generation is still slower.

stas00 · 2024-09-10T04:06:06Z

Thank you for this suggestion, @Quang-elec44 - I understand that it'll be slower, but it should be marginally slower, not 20x slower. Possibly some problem in the integration?

robertgshaw2-neuralmagic · 2024-09-10T17:24:00Z

Thank you for this suggestion, @Quang-elec44 - I understand that it'll be slower, but it should be marginally slower, not 20x slower. Possibly some problem in the integration?

@stas00 is this a new issue on v0.6.0?

robertgshaw2-neuralmagic · 2024-09-10T17:28:08Z

@stas00 - looks like someone has just fixed the issue:

[Bugfix] Reenable LRU cache on Outlines' guide getters #8308

robertgshaw2-neuralmagic · 2024-09-10T17:33:52Z

@stas00 In my experience, guided generation is always slower than normal. I recommend you try sglang instead. Sglang achieves better throughput than vLLM, but the guided generation is still slower.

Hey @Quang-elec44 - this is actually not true as of vllm's current release: https://blog.vllm.ai/2024/09/05/perf-update.html

We also have more performance optimizations that will come out in v0.6.0:

[Core][Bugfix][Perf] Introduce MQLLMEngine to avoid asyncio OH #8157

stas00 · 2024-09-10T17:46:27Z

@stas00 is this a new issue on v0.6.0?

no, same with older versions - e.g. 0.5.5

looks like someone has just fixed the issue:

Robert, I have already tried it to no avail.

I'm working on a reproducible test case - will share as soon as I have it.

robertgshaw2-neuralmagic · 2024-09-10T18:22:33Z

Thanks @stas00 - @simon-mo can you copy in the folks who implemented the guided decoding?

Lap1n · 2024-09-10T19:04:19Z

@stas00 Have you (or can you) tried with vllm v0.5.2 if you observe the same performance issue?

In my investigation, I found a regression in performance from v0.5.2 -> 0.5.3 which this PR fixes.

However, in my benchmarks, there seems to be another significant performance regression in guided generation with Outlines from 0.5.3.post1 to 0.5.4 that I have not investigated yet.

Note that in my tests, I only tested in online mode.

stas00 · 2024-09-10T19:11:14Z

@Lap1n, I can't try v0.5.2 since it didn't support guided generated in the offline mode.

But I think the problem here is something entirely different, I'm trying to dig to the root of it.

As I'm writing the offline repro scripts w/ TinyLlama-1.1B-Chat-v0.6 vllm is only 10% slower w/ outlines than w/o them once FSM has been compiled - which is absolutely normal and expected. So something else is going on - will update once I have a more clear picture.

It's possible that there is a multitude of issues here, and it's possible that somehow the big overhead is model-dependent (even though in theory the type or a size of a model shouldn't matter at all overhead-wise).

stas00 · 2024-09-10T19:54:30Z

@robertgshaw2-neuralmagic, do you know why outlines cache doesn't get used and instead it rebuilds the same. e.g. here is one of the repro scripts I created:

import time

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

model_name_or_path = "TinyLlama/TinyLlama-1.1B-Chat-v0.6"

schema = '{ "type": "object", "properties": { "name": { "type": "string", "minLength": 2,  "maxLength": 5 }, "age": { "type": "integer"} }, "required": ["name", "age"] }'

model = LLM(
    model=model_name_or_path,
    tokenizer=model_name_or_path,
    tokenizer_mode="auto",
    tensor_parallel_size=1,
    trust_remote_code=True,
    dtype="bfloat16",
    gpu_memory_utilization=0.8,
    guided_decoding_backend="outlines",
)

prompt = "Give an example of a person's profile that fits this JSON schema: {schema}"

sampling_params = SamplingParams(
    temperature=0.0,
    seed=42,
    max_tokens=2048,
)
kwargs = dict(
    sampling_params=sampling_params,
    guided_options_request=dict(
        guided_json=schema,
    ),
)

# warmup
output = model.generate(prompt, **kwargs)

response = ''
start_time = time.time()
for i in range(5):
    output = model.generate(prompt, **kwargs)
    response += output[0].outputs[0].text
end_time = time.time()

print(response)

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

response_tokens = tokenizer.encode(response)
response_tokens_len = len(response_tokens)
#print(response_tokens)

duration = end_time - start_time
print(f"decode had {response_tokens_len} tokens")
print(f"decode latency: {duration:0.2f}secs")
print(f"decode throughput: {response_tokens_len/duration:0.2f} tok/sec")

w/o warmup we get about 6 tok/sec due to FSM setup overhead. if we warm up it's 166 tok/sec.

re-running it the 2nd time w/o warm up repeats the same recompilation - shouldn't that be pulled from the cache and re-used, rather than recompiling it?

Lap1n · 2024-09-10T19:57:40Z

@stas00 I also got the same issue, seems like an issue with the key's format that Outlines uses to store cached data. I opened a PR a few days ago to fix this on Outlines' repo.

stas00 · 2024-09-10T21:54:24Z

so I'm waiting for my colleague to give me a repro case so that I could narrow it down for him.

Meanwhile, why is this pure outlines script so slow?

import outlines
import time
from transformers import AutoTokenizer
import json

model_name_or_path = "TinyLlama/TinyLlama-1.1B-Chat-v0.6"

schema = '{ "type": "object", "properties": { "name": { "type": "string", "minLength": 2,  "maxLength": 5 }, "age": { "type": "integer"} }, "required": ["name", "age"] }'

model = outlines.models.transformers(model_name_or_path)
generator = outlines.generate.json(model, schema)

prompt = "Give an example of a person's profile that fits this JSON schema: {schema}"

# warmup
response = generator(prompt)

# real
response = ''
start_time = time.time()
for i in range(5):
    response += json.dumps(generator(prompt))
end_time = time.time()
print(response)

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

response_tokens = tokenizer.encode(response)
response_tokens_len = len(response_tokens)
print(response_tokens)

duration = end_time - start_time
print(f"decode had {response_tokens_len} tokens")
print(f"decode latency: {duration:0.2f}secs")
print(f"decode throughput: {response_tokens_len/duration:0.2f} tok/sec")

I wrote it first to check that standalone outlines was fast and use it as a base-line before I did the exact same generation with vllm+outlines as I shared here:
#8313 (comment)

but actually I discovered that it's very slow!

On A100:

$ python outlines-direct.py
{"name": "John", "age": 29}{"name": "John", "age": 30}{"name": "John", "age": 50}{"name": "Dan", "age": 20}{"name": "John", "age": 30}
decode had 70 tokens
decode latency: 15.89secs
decode throughput: 4.41 tok/sec

$ python outlines-vllm.py 
{"name":"John","age":25}{"name":"John","age":25}{"name":"John","age":25}{"name":"John","age":25}{"name":"John","age":25}
decode had 51 tokens
decode latency: 0.28secs
decode throughput: 181.05 tok/sec

that's a 40x difference! Perhaps this is what we are hitting, but this is in reverse - vllm+outlines is 40x faster than outlines on its own.

edit: I found the issue with the direct outlines use - I had to put it on cuda - it was 40x slower on cpu

model = outlines.models.transformers(model_name_or_path, device='cuda:0')

now it's 2x slower than vllm integration:

decode had 67 tokens
decode latency: 0.75secs
decode throughput: 89.89 tok/sec

not sure why this time.

stas00 · 2024-09-10T21:56:05Z

the other mismatch thing I noticed is that vllm strips spaces between elements of json:

# outlines:
{"name": "John", "age": 29}{"name": "John", "age": 30}

# vllm:
{"name":"John","age":25}{"name":"John","age":25}

I guess that saves a few tokens from needing to be generated w/o compromising the output structure - nice!

stas00 · 2024-09-10T22:50:01Z

ok, I see that vllm isn't using outlines.generate.json generator but uses the regex engine. which has multiple issues:

why does it recompile the regex on every request?

vllm/vllm/model_executor/guided_decoding/outlines_logits_processors.py

Line 142 in b1f3e18

regex_string = build_regex_from_schema(schema_str, whitespace_pattern)

this should be cached during subsequent requests of the same run!

to see that it recompiles it run my script here: #8313 (comment)
and add a print on that line 142, you will see it gets called 6 times with 6 requests.

this leads to vllm+outline failing to complete json last characters when max_tokens isn't too long. whereas outlines.generate.json seems to always return the correct (complete) json.

The problem here is that if the generator runs out of max_tokens and the json structure isn't closed - i.e. missing say the very last token of } vllm will just return a broken json. So if a regex is used vllm has to know to back off and force the json closure before it runs out of max_tokens. Currently I have to retry multiple times before I get all prompts to generate the correct structure - which sort of defeats the purpose of guided generation as it takes much much longer.

edit: the problem with repro code is now defined here: #8350

robertgshaw2-neuralmagic · 2024-09-11T15:07:48Z

@stas00 - Thank you for the detailed analysis here!

QQ - are you planning to open up a PR to fix this? We would definitely appreciate a contribution if you have the bandwidth

stas00 · 2024-09-11T19:01:52Z

QQ - are you planning to open up a PR to fix this? We would definitely appreciate a contribution if you have the bandwidth

The summary of things so far:

one problem proved to be a repeated recomputation of FSM in outlines because the cache was being ignored in vllm integration. @Lap1n has proposed a PR to fix it here Make Outlines' cache reusable across startup by making the cache's key as string dottxt-ai/outlines#1129
the other issue is re-compiling of the regex on each request - I proposed to add it to this other PR by @Lap1n: [Bugfix] Reenable LRU cache on Outlines' guide getters #8308 (comment) and opened an Issue [Performance]: JSONLogitsProcessor repeats the same build_regex_from_schema again and again #8383
the truncated JSON is being actively discussed in this Issue: [Bug]: guided generation can't always finish generating the requested structure #8350

For the bigger issue that started the current main issue in the OP I'm blocked by my colleague to give me a larger repro test case, so that I could verify it and if true reduce it to a small repro case, so please bear with me until they come through.

Quang-elec44 · 2024-09-12T01:12:47Z

@

@stas00 In my experience, guided generation is always slower than normal. I recommend you try sglang instead. Sglang achieves better throughput than vLLM, but the guided generation is still slower.

Hey @Quang-elec44 - this is actually not true as of vllm's current release: https://blog.vllm.ai/2024/09/05/perf-update.html

We also have more performance optimizations that will come out in v0.6.0:

[Core][Bugfix][Perf] Introduce MQLLMEngine to avoid asyncio OH #8157

Thanks for your information. I know that vllm=0.06.0 made strong progress compared to sglang=0.3.0. However, in my experiments, sglang still achieves better throughput (my GPUs are A30). I may spend time testing the benchmark on my machine to have a closer look.

aabbccddwasd · 2024-10-01T14:14:33Z

any progress？

stas00 added the performance Performance-related issues label Sep 10, 2024

stas00 mentioned this issue Sep 11, 2024

[Bugfix] Reenable LRU cache on Outlines' guide getters #8308

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance]: guided generation is very slow in offline mode #8313

[Performance]: guided generation is very slow in offline mode #8313

stas00 commented Sep 10, 2024 •

edited

Loading

Quang-elec44 commented Sep 10, 2024

stas00 commented Sep 10, 2024

robertgshaw2-neuralmagic commented Sep 10, 2024

robertgshaw2-neuralmagic commented Sep 10, 2024

robertgshaw2-neuralmagic commented Sep 10, 2024

stas00 commented Sep 10, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Sep 10, 2024

Lap1n commented Sep 10, 2024 •

edited

Loading

stas00 commented Sep 10, 2024 •

edited

Loading

stas00 commented Sep 10, 2024 •

edited

Loading

Lap1n commented Sep 10, 2024 •

edited

Loading

stas00 commented Sep 10, 2024 •

edited

Loading

stas00 commented Sep 10, 2024 •

edited

Loading

stas00 commented Sep 10, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Sep 11, 2024

stas00 commented Sep 11, 2024 •

edited

Loading

Quang-elec44 commented Sep 12, 2024

aabbccddwasd commented Oct 1, 2024

[Performance]: guided generation is very slow in offline mode #8313

[Performance]: guided generation is very slow in offline mode #8313

Comments

stas00 commented Sep 10, 2024 • edited Loading

Proposal to improve performance

Quang-elec44 commented Sep 10, 2024

stas00 commented Sep 10, 2024

robertgshaw2-neuralmagic commented Sep 10, 2024

robertgshaw2-neuralmagic commented Sep 10, 2024

robertgshaw2-neuralmagic commented Sep 10, 2024

stas00 commented Sep 10, 2024 • edited Loading

robertgshaw2-neuralmagic commented Sep 10, 2024

Lap1n commented Sep 10, 2024 • edited Loading

stas00 commented Sep 10, 2024 • edited Loading

stas00 commented Sep 10, 2024 • edited Loading

Lap1n commented Sep 10, 2024 • edited Loading

stas00 commented Sep 10, 2024 • edited Loading

stas00 commented Sep 10, 2024 • edited Loading

stas00 commented Sep 10, 2024 • edited Loading

robertgshaw2-neuralmagic commented Sep 11, 2024

stas00 commented Sep 11, 2024 • edited Loading

Quang-elec44 commented Sep 12, 2024

aabbccddwasd commented Oct 1, 2024

stas00 commented Sep 10, 2024 •

edited

Loading

stas00 commented Sep 10, 2024 •

edited

Loading

Lap1n commented Sep 10, 2024 •

edited

Loading

stas00 commented Sep 10, 2024 •

edited

Loading

stas00 commented Sep 10, 2024 •

edited

Loading

Lap1n commented Sep 10, 2024 •

edited

Loading

stas00 commented Sep 10, 2024 •

edited

Loading

stas00 commented Sep 10, 2024 •

edited

Loading

stas00 commented Sep 10, 2024 •

edited

Loading

stas00 commented Sep 11, 2024 •

edited

Loading