Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Outlines' cache reusable across startup by making the cache's key as string #1129

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

Lap1n
Copy link

@Lap1n Lap1n commented Sep 4, 2024

When using vllm and outlines, in our use cases, it seems that the diskcache functionality is not working. Every time the server is started, it doesn't seem to be able to reuse the previously computed FSM cache.

One way that can fix this issue is to serialize the cache key object as a string.

See this issue.

@Lap1n Lap1n marked this pull request as ready for review September 4, 2024 20:03
@Lap1n Lap1n changed the title fix: cache key as string Make Outlines' cache reusable across startup by making the cache's key as string Sep 9, 2024
@stas00
Copy link

stas00 commented Sep 10, 2024

I second that with vllm the cache is being ignored and FSM recompiled on every use.

I manually applied your patch and validated it works. Shaves about 3 secs off from 4 secs now on the first call.

thank you for the fix, @Lap1n

@cpfiffer
Copy link
Contributor

Is there a way to add a test for this issue?

@stas00
Copy link

stas00 commented Sep 11, 2024

That's a great idea: run the request twice, capture the log, and check that the "Compiling FSM index for all state transitions" message appears only once.

@cpfiffer
Copy link
Contributor

Seems like it's prone to failure if the message ever changes. I'm not super familiar with this part of the codebase, but could we just check the cache directly somehow?

@Lap1n
Copy link
Author

Lap1n commented Sep 11, 2024

Good idea, I am looking into adding a test case for it

@stas00
Copy link

stas00 commented Sep 11, 2024

Seems like it's prone to failure if the message ever changes.

"Compiling FSM index for all state transitions" is printed by outlines so there should be no problem whatsoever to anchor on this string, since if it changes, the test will fail and it will be fixed in the same PR changing the info log.

But, of course, this is just a proposal, if there are other ways - that works for me.

@cpfiffer
Copy link
Contributor

"Compiling FSM index for all state transitions" is printed by outlines so there should be no problem whatsoever to anchor on this string, since if it changes, the test will fail and it will be fixed in the same PR changing the info log.

Imo this is just kicking the can down the road. We should try and find a good testing case early so we don't run into this again way later. If it's super hard to do, it's super hard to do, that's fine -- but it is a band-aid test rather than a solid one.

@Lap1n
Copy link
Author

Lap1n commented Sep 12, 2024

@cpfiffer I added a test that fails on main and works on this PR and in vLLM. It seems that the issue is with using class method as the function to cache (vLLM use it similarly).

Copy link
Contributor

@cpfiffer cpfiffer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thank you! I think this should work. Let me get eyes from the outlines adults.

@brandonwillard
Copy link
Member

brandonwillard commented Sep 12, 2024

It seems that the issue is with using class method as the function to cache (vLLM use it similarly).

What is it about a class method that's causing this? Is it adding the class instance argument to the cache information/key?

@Lap1n
Copy link
Author

Lap1n commented Sep 12, 2024

Yes, in addition, the reinstantiation of the object arguments also causes the issue. Probably an issue with the binary data being different (although the object attributes are the same). That is why it is most likely safer to serialize the args as a string instead of a binary key. So in short, I was able to reproduce the issue with using classmethod function and objects as arguments.

@brandonwillard
Copy link
Member

That is why it is most likely safer to serialize the args as a string instead of a binary key. So in short, I was able to reproduce the issue with using classmethod function and objects as arguments.

From what I recall about the caching setup/dependencies, forcing conversion to a string is problematic in different ways. For example, it removes the ability to set and/or use the underlying serialization implementations for the objects involved, and it leans on __str__ implementations that aren't usually designed for serialization and the like.

@Lap1n
Copy link
Author

Lap1n commented Sep 12, 2024

I see, but it would still be better than the current implementation which is not using as well the underlying serialization implementations and is instead storing it as a binary python object (which is most likely unsafe), correct? Please correct me if I missed something, new to diskcache's internal working.

@brandonwillard
Copy link
Member

brandonwillard commented Sep 12, 2024

I see, but it is still better than the current implementation which won't use as well the underlying serialization implementations and will instead store it as a binary python object (which is most likely unsafe), correct?

That's not clear. It could start failing again if a __str__ implementation changes in any of the objects involved, among other things.

It sounds like we need to directly address the issues involved with applying this to a method.

@Lap1n
Copy link
Author

Lap1n commented Sep 12, 2024

But if str impl change (and I guess it would be ok that it fails), I guess it would be failing as well with binary implementation as well?

@Lap1n
Copy link
Author

Lap1n commented Sep 12, 2024

I understand your point, we could indeed take a look at a better solution. I didn't see this fix as a final one as well, but mostly as an intermediate one as it currently supports all the features of the previous approach, with this additional benefit of using serialized data as key and making vLLM with Outlines usable in a production setting which is currently unusable without it (with autoscaling etc). FSM recompilation can take >1min for some prompts.

Although, another alternative to quickly fix could maybe be to modify the usage of outlines's cache in vLLM.

@brandonwillard
Copy link
Member

Another alternative to quickly fix this would be to modify the usage of outlines's cache in vLLM.

That sounds like the best approach, because—otherwise—this becomes a feature request to make cache support [class]methods.

If we're specifically talking about vllm's use of *.outlines_logits_processors.*LogitsProcessor._get_guide, it seems like this change could be very straightforward, as well.

@Lap1n
Copy link
Author

Lap1n commented Sep 16, 2024

@brandonwillard seems like the error is not related to class methods after all. I've tried to extract the classmethod into an external function and it still doesn't work in vLLM (code here and here). It seems that it is very hard to reproduce with unit tests, but fairly easy when running and restarting any Python script using Outline's cache.

I wonder if it could be related to Diskcache's caveats of the default key serialization implementation.

Thoughts?

@brandonwillard
Copy link
Member

@brandonwillard seems like the error is not related to class methods after all. I've tried to extract the classmethod into an external function and it still doesn't work in vLLM (code here and here). It seems that it is very hard to reproduce with unit tests, but fairly easy when running and restarting any Python script using Outline's cache.

We'll need to know exactly where the cache misses are. There's a nested application of the cache in the call to create_states_mapping from RegexGuide.__init__, so we need to know if the misses you're observing are there or in vllm's direct use of cache. Also, the latter changes the tokenizer object through _adapt_tokenizer, which could affect the hashing.

@brandonwillard
Copy link
Member

I wonder if it could be related to Diskcache's caveats of the default key serialization implementation.

This mostly emphasises the need to implement consistent serialization for the objects involved; otherwise, I don't remember where the cloudpickle we're using stands with regard to the tuple-related issues.

@brandonwillard
Copy link
Member

brandonwillard commented Sep 16, 2024

@Lap1n, I just noticed that cache is being applied to a tokenizer argument in vllm that's typed as PreTrainedTokenizerBase. In outlines, the tokenizer arguments are of Tokenizer type, and, more specifically, the TransformerTokenizer subtype that has custom serialization methods (i.e. __[get|set]state__ methods). That alone could be the issue (either from within outlines itself or vllm).

In other words, vllm may need to wrap those PreTrainedTokenizerBase arguments in a type that implements a cacheable serialization like TransformerTokenizer.

Are you able to reproduce this with only outlines, and, if so, which tokenizer? It could be that the transformers serialization we're effectively leaning on isn't working for this use case.

@Lap1n
Copy link
Author

Lap1n commented Sep 17, 2024

@brandonwillard I see, I previously tried Hugging Face's tokenizers as arguments (e.g. the tokenizer for llama3 8b) in the test case when I was trying to reproduce the issue, but it seems to work with them as well with the current implementation🤔

I will try to investigate on vLLM side now to see if it could be the tokenizer issue.

Thanks for investigating btw!

@brandonwillard
Copy link
Member

Thanks for investigating btw!

Thank you for helping find and debug this! Also, don't hesitate to keep us in the loop regarding the vLLM side of things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants