Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replaces '.Replace' calls with a Regex #72

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

GStefanowich
Copy link
Contributor

@GStefanowich GStefanowich commented Mar 25, 2024

I removed the block of .Replace calls from the AddressExtractor class.

Not that this is a corporate environment with a bunch of people editing over eachother, but I wrote the filter system so that there wasn't a spaghetti mess of things overlapping (Including people trying to make overlapping edits). Each filter either accepts or denies an email as valid.


I replaced the block of Replaces with a new Filter, so if an address starts with an illegal character ('!`{#\n\) it'll trim them from the start. (Rather than a full replace).

Then if there was a replace it'll return REVALIDATE which will refilter it against other filters to make sure it's still valid.


With filters if one filter removes an illegal character, eg -, and another filter checks for valid domains, then you might have something where you pass in the email @my-test.dom-ain. The first filter might remove our "illegal" - character which results with @mytest.domain, and then the second filter will deny .domain for being an invalid TLD.


For the remaining test you last added EmailAddressesInExtracted with the email foo|test@example.com|bar, the Filter system needs to be modified because you can only currently return one modified Email address. A filter could see | as an illegal character, run a .Split('|') for ["foo", "test@example.com", "bar"] and then run each through the filter system again, where foo and bar would be skipped.

@troyhunt
Copy link
Contributor

With issue #61 in mind, wouldn't it be better just to treat any character not in the allow list as a terminating string that denotes either the start or the end of the address? I can't think of a downside to this, not given the specific criteria required for the email address pattern that occurs within those characters.

@GStefanowich
Copy link
Contributor Author

That's ultimately up to you on how you want to handle it.

It'll muddle up the main Regex more, and goes a bit against the design of "Loose main Regex, filter out the bad results" by making the main Regex stricter.

It could perform better with less Regexes, or it could perform better by quickly capturing loose results and only running the "Heavy" Regexes fewer times. I'm not sure and not about to implement both ways to test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants