-
Notifications
You must be signed in to change notification settings - Fork 431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle regexps with lookaround #456
Comments
+1, getting similar issues with even a subset of json that's flat key/value pairs only. """
""" This should be high priority in my opinion, for supporting a realistic set of grammars. |
As a quick workaround for those interested particularly in JSON, here's what I've been doing. The unsupported rule from the above grammar is the Lark implementation of
As a summary, This does not globally solve the issue, and this broader problem related to lookarounds remains open, but hopefully this can help folks who are running into this in the meantime. For other languages requiring other characters to be escaped for their string implementations, this approach can be quickly adjusted as well. |
Yeah, we could at the very least offer replacements for the |
I don't see any non-regular examples in |
We could convert non-regular lark terminals to lark rules with multiple regular lark terminals programmatically at runtime, but this seems like overkill. I think we should just document the fact that "Outlines-lark" is a subset of lark which disallows non-regular terminals. Another problem I've come across experimenting with the We either need to monkeypatch Concretely, the following lark definition
is compiled into
|
@lapp0 , I don't think that compiling to DFA precludes the handling of lookaround (from a theoretical perspective). Other features of python regex like backreference are indeed not regular, but lookaround is a pragmatic add-on that doesn't change expressive power. That being said, it doesn't appear trivial to "desugar" in practice, and many regex engines actually do implement lookaround with backtracking (even though it's not theoretically necessary). Some community discussions I've found on this topic:
It also looks like there's some contemporary work on actually implementing non-backtracking regex engines that can support lookaround, but what I've found so far uses a different symbolic derivative-based formalism, rather than the automata-based formalism of All this to say, this isn't a theoretical limitation. I just don't know enough to approach this problem in practice though. I think your suggestion of avoiding lookaround, providing alternatives for common grammars, and attempting to minimize their construction during lark compilation, seems like probably the best value strategy that should get things most of the way there for real use cases. Thanks to the maintainers for continuing to work on this issue! |
Thanks for the info @benlipkin, glad to see I was mistaken and look-arounds are allowed within a regular language. I've tried to integrate some common lark grammars into Outlines for #562 however most of them explicitly or implicitly result in terminals with look-arounds. I'll look into how I might integrate lookarounds into |
Fixes #823 This comment details the issues error: #823 (comment) The reproduction code provided results in a json schema with `OneOf[pets]`: ``` class Model(BaseModel): pet: Union[Cat, Dog] = Field(..., discriminator='pet_type') ``` Before this PR: `OneOf` uses negative lookaheads to assert that only one schema member is included. This is illegal in `interegular`, more details available here: #456 After `OneOf` uses or-joined non-capturing groups which don't have the same issues with `interegular`.
What behavior of the library made you think about the improvement?
Outlines currently uses
interegular
to compile regexps to FSMs, but not all features of python regex are supported, in particular lookarounds.I ran into this issue when trying to test the following grammar:
In the
CFGFSM
constructor, terminal patterns are compiled to regexps so that they can later be used to initializeRegexFSM
proposals, e.g.,RegexFSM(terminal.pattern.to_regexp())
. When the supplied regex includes lookarounds, we get an error such as the following:How would you like it to behave?
It would seem there are at least a couple options here:
interegular
to expand support to lookarounds. This is on the TODOs outlined in their README, and perhaps could use a push.The text was updated successfully, but these errors were encountered: