Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly filter URL with an email address in it #2690

Open
Tracked by #39
krystofwoldrich opened this issue May 2, 2023 · 12 comments
Open
Tracked by #39

Correctly filter URL with an email address in it #2690

krystofwoldrich opened this issue May 2, 2023 · 12 comments

Comments

@krystofwoldrich
Copy link
Member

Description

A related issue in Dart.

This came up as an issue in sentry-dart but as the logic is the same it's likely an issue in sentry-java too.

We should check it and fix it if needed.

@krystofwoldrich krystofwoldrich added Type: Bug Something isn't working Platform: Java labels May 2, 2023
@adinauer
Copy link
Member

adinauer commented May 3, 2023

Java SDK is affected as well, here's a test:

    @Test
    fun `keeps email address`() {
        val urlDetails = UrlUtils.parseNullable(
            "https://staging.server.com/api/v4/auth/password/reset/email@example.com"
        )!!
        assertEquals("https://staging.server.com/api/v4/auth/password/reset/email@example.com", urlDetails.url)
        assertNull(urlDetails.query)
        assertNull(urlDetails.fragment)
    }

with result:

Expected :https://staging.server.com/api/v4/auth/password/reset/email@example.com
Actual   :https://[Filtered]@example.com

@marandaneto
Copy link
Contributor

@adinauer will you fix it right away? people that rely on transaction name and span desc from this parser, this will be broken, if you figure out how, I can just replicate on Dart.

@adinauer
Copy link
Member

adinauer commented May 8, 2023

Wasn't planning to do this soon. We can bump prio if you want.

@marandaneto
Copy link
Contributor

marandaneto commented May 8, 2023

Wasn't planning to do this soon. We can bump prio if you want.

I believe the priority is rather important, otherwise, every span and transaction name will be wrong and not identified and they all be grouped together because the filtered URL is always the same every time the parsing breaks.

@adinauer
Copy link
Member

adinauer commented May 8, 2023

Yeah, erring on the other side would mean leaking authority of URLs tho, so we have to be careful.

@marandaneto
Copy link
Contributor

Well the logic to strip out the authority is there, it's just a matter of fixing the parsing to keep the domain and its sub/folders, it's a bug in the parsing, not a matter of PII anymore.

@adinauer
Copy link
Member

adinauer commented May 8, 2023

Sure it's a bug but fixing it may break the filtering. It's not quite easy to determine whether something is part of an email or authority of an URL as there doesn't seem to be a clear set of characters we can use.

Take e.g. this URL https://staging.server.com?email=someone@example.com there's no clear way to tell whether staging.server.com?email=someone is the username of the authority section and example.com is the host or whether staging.server.com is the host and ?email=someone@example.com is the query string. There's https://www.rfc-editor.org/rfc/rfc1738 pages 18-20:

user           = *[ uchar | ";" | "?" | "&" | "=" ]
password       = *[ uchar | ";" | "?" | "&" | "=" ]
uchar          = unreserved | escape
unreserved     = alpha | digit | safe | extra
alpha          = lowalpha | hialpha
lowalpha       = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" |
                 "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" |
                 "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" |
                 "y" | "z"
hialpha        = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
                 "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" |
                 "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"
digit          = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
                 "8" | "9"
safe           = "$" | "-" | "_" | "." | "+"
extra          = "!" | "*" | "'" | "(" | ")" | ","
escape         = "%" hex hex

So we could use / and # to help the regex but that doesn't solve https://staging.server.com?email=someone@example.com. Best appraoch will be to check if we can rewrite the code to use java.net.URL as it seems to be able to figure this out somehow. Haven't looked at the code in detail but seems a bit complicated at first glance.

@marandaneto
Copy link
Contributor

@adinauer yes, I'd suggest using the builtin Uri or URI class depending on what we need, on Dart, the Uri works pretty well:

  test('test', () {
    final uri =
        Uri.parse('https://staging.server.com/?email=someone@example.com');
    uri.queryParameters;

    uri.queryParameters.forEach((key, value) {
      expect(key, 'email');
      expect(value, 'someone@example.com');
    });
  });

So to be honest, this is there already, maybe we don't even need our own regexes.

@stefanosiano
Copy link
Member

@adinauer What places should we scrubble the email for? Do you have any more insight on this?

@adinauer
Copy link
Member

adinauer commented Nov 9, 2023

Should be a change in UrlUtils by using e.g. java.net.URL and letting it take care of parsing the URL.

@stefanosiano
Copy link
Member

@adinauer And is it only for OkHttp integration? Or other integrations like apollo need it?

@adinauer
Copy link
Member

adinauer commented Nov 9, 2023

This should be the same for all integrations using UrlUtils

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Backlog
Development

No branches or pull requests

5 participants