Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finish new code gen approach for RegexCompiler / source generator #62268

Merged
merged 5 commits into from
Dec 3, 2021

Conversation

stephentoub
Copy link
Member

Contributes to #61902

This enables RegexCompiler / source generator to use the new source gen approach for all expressions except those that employ RegexOptions.RightToLeft and lookbehinds (which are implemented with RightToLeft).

In a subsequent PR, I'll delete all of the now defunct code for the old approach and also clean up the new approach, including adding a lot more comments. I wanted to get this in now though to unblock concurrent efforts to move the code over to be span-based. I also want to then spend some time looking at the generated code and finding ways to clean it up, e.g. ensuring its as pay-for-play, such as not emitting code for undoing captures if we can prove there won't be any to undo. I also want to clean up some of the algorithms in a few places to avoid walking around the tree so much.

I expect a bit of a tail of bugs from this, so it'll be good to also prioritize porting over more tests from other engines, to help fill any gaps we may have.

cc: @joperezr, @danmoseley

@ghost
Copy link

ghost commented Dec 2, 2021

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Contributes to #61902

This enables RegexCompiler / source generator to use the new source gen approach for all expressions except those that employ RegexOptions.RightToLeft and lookbehinds (which are implemented with RightToLeft).

In a subsequent PR, I'll delete all of the now defunct code for the old approach and also clean up the new approach, including adding a lot more comments. I wanted to get this in now though to unblock concurrent efforts to move the code over to be span-based. I also want to then spend some time looking at the generated code and finding ways to clean it up, e.g. ensuring its as pay-for-play, such as not emitting code for undoing captures if we can prove there won't be any to undo. I also want to clean up some of the algorithms in a few places to avoid walking around the tree so much.

I expect a bit of a tail of bugs from this, so it'll be good to also prioritize porting over more tests from other engines, to help fill any gaps we may have.

cc: @joperezr, @danmoseley

Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions

Milestone: -

@stephentoub
Copy link
Member Author

(Investigating why this is failing only on mono)

@joperezr
Copy link
Member

joperezr commented Dec 2, 2021

I expect a bit of a tail of bugs from this, so it'll be good to also prioritize porting over more tests from other engines, to help fill any gaps we may have.

Is it reasonable (and possible) to also do a perf test dry run over this PR so that we can catch any potential regressions in perf given this is a bigger re-write? I know that in theory we expect that not to be the case, but it might be good to do a quick run in case we can catch any mishaps in terms of perf for a specific scenario.

@stephentoub
Copy link
Member Author

Is it reasonable (and possible) to also do a perf test dry run over this PR

Yup. I did one yesterday just for https://github.com/dotnet/performance/blob/1d541b2a6ea90ee2abc84d7ab876079d97fb1d1c/src/benchmarks/micro/libraries/System.Text.RegularExpressions/Perf.Regex.Industry.cs#L44 and the three tests there improved a few percent. I can kick off a larger one.

@joperezr
Copy link
Member

joperezr commented Dec 2, 2021

    private const string AdditionalDeclarationsPlaceholder = "<>PLACEHOLDER_FOR_ADDITIONAL_DECLARATIONS";
    private const string AdditionalDeclarationsPlaceholder = "<>PLACEHOLDER_FOR_ADDITIONAL_DECLARATIONS";

Since this specific text will be replaced later, are there any concerns in case there is a wild case where this specific string is part of the consumer's regex? #Resolved


Refers to: src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs:31 in 9dc7dc5. [](commit_id = 9dc7dc5eb30db64a9c4749677f879d2d9db93383, deletion_comment = False)

@stephentoub
Copy link
Member Author

are there any concerns in case there is a wild case where this specific string is part of the consumer's regex?

That's fair. I could add a Guid.NewGuid() to it :) Or if you have other suggestions, I'm open to them. I might be able to just note the position we're at in the underlying builder and insert the text at that position; that might be simpler.

@joperezr
Copy link
Member

joperezr commented Dec 2, 2021

I think this is already a super edge case, but adding a GUID would make it basically impossible to ever collide so that might be a good idea. The other thought is to not force all declarations to be on the top of the method, I understand that it makes the code more readable, but would it be possible instead to just declare the variables we need at the time we need them?

@stephentoub
Copy link
Member Author

stephentoub commented Dec 2, 2021

I understand that it makes the code more readable, but would it be possible instead to just declare the variables we need at the time we need them?

I actually prefer the readability when they are declared on first use, and that's how I initially had it. But the branching that ends up being employed by backtracking results in cases where, even though we know variables will have already been initialized, the C# compiler still complains. Consider a case like:

using System;

public class C
{
    public void M()
    {
        int a = 0;
        goto Switch;

        Zero:
        a = 1;
        int value = 42;
        goto Switch;
        
        One:
        Console.WriteLine(value);
        a = 2;
        goto Switch;
        
        Switch:
        switch (a)
        {
            case 0: goto Zero;
            case 1: goto One;
        }
    }
}

There's no way to end up at that Console.WriteLine(value) without having gone through the int value = 42; line, but the compiler can't see that and ends up failing compilation with an error CS0165: Use of unassigned local variable 'value' error.

My solution then was to force the declarations to the beginning of the method and default initialize them.

Copy link
Member

@joperezr joperezr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that there are a lot of changes here it is possible I may have missed something, but after following the changes going into the Emitter and compiler, things look good and I also tested at least the source generator changes here and even though we could probably simplify some things in the future, I'm happy with the way backtracking is looking now.

@stephentoub
Copy link
Member Author

Great, thanks for reviewing. I'll merge this and have another PR out soon.

@stephentoub stephentoub merged commit fd1a62c into dotnet:main Dec 3, 2021
@stephentoub stephentoub deleted the finishregexcodegen branch December 3, 2021 00:13
@joperezr
Copy link
Member

joperezr commented Dec 3, 2021

Thanks for getting this in, I will start working on the Go methods for these two engines.

@ghost ghost locked as resolved and limited conversation to collaborators Jan 2, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants