Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lint: add linter for unicode directional formatting characters #72287

Closed

Conversation

stevendanna
Copy link
Collaborator

This PR adds a linter to disallow the use of directional formatting
characters to help prevent them being used to get malicious code past
code review.

Ideally our code-review tool would highlight such characters for us
since such characters might routinely appear in binary artifacts.

See also:

Release note: None

This PR adds a linter to disallow the use of directional formatting
characters to help prevent them being used to get malicious code past
code review.

Ideally our code-review tool would highlight such characters for us
since such characters might routinely appear in binary artifacts.

See also:

- https://www.trojansource.codes/
- https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html
- golang/go#20209

Release note: None
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@stevendanna
Copy link
Collaborator Author

GitHub makes this pretty clear now:

Screenshot 2021-11-01 at 23 41 54

But reviewable does not.

@stevendanna stevendanna requested a review from dt November 1, 2021 23:45
@dt
Copy link
Member

dt commented Nov 2, 2021

Nice!

Personally, I’d probably say the (new, I think?) GitHub banner is sufficient, but folks who still use reviewable might disagree, and I suppose we support that workflow. I don’t feel strongly either way so I’ll defer to @bdarnell or @knz.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm neutral on this. It's pretty cheap and we don't currently have any legitimate use case for these characters in our codebase. But it's also very narrowly targeted and I think there are plenty of other ways to subtly make code not do what it appears to (see the underhanded C contest for example). If there were a well-maintained linter against unicode shenanigans (especially something based on data directly from the unicode consortium or a project like ICU), I'd be in favor of using it. But I don't really like maintaining our own ad-hoc blocklist of unicode characters.

@@ -633,6 +633,54 @@ func TestLint(t *testing.T) {
}
})

t.Run("TestDisallowedCharacterSequences", func(t *testing.T) {
t.Parallel()
disallowedCharacterSequences := []string{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did this list of characters come from? We should include a link.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was all of the "explicit" formatting characters listed in: https://www.unicode.org/reports/tr9/#Directional_Formatting_Characters

I believe it also matches what rust added a check for in their advisory.

"\u202C", // POP-DIRECTIONAL-FORMATTING
"\u202D", // LEFT-TO-RIGHT-OVERRIDE
"\u202E", // RIGHT-TO-LEFT-OVERRIDE
"\u2066", // LEFT-TO-RIGHT-ISOLATE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the "isolate" characters actually dangerous? I understand the problems with the override characters, but I've also had to use them in a real project to get bidi text to render correctly. The isolate characters were introduced more recently (2013, after the last time I had to deal with these issues) and were supposed to be a safer alternative.

This is important because banning all of these characters tells speakers of RTL languages that their language cannot be represented properly in source code. That may be acceptable in CockroachDB since it is already our policy that source code and comments be in English, but it would not be good if this rule spread into compilers and other language-wide tools. If we could narrow the block list to only the override and embedding characters while allowing isolates, that would be a better outcome since it could address the security concerns while still allowing all languages to be properly represented (I think. I can't read any RTL languages so there's still quite a bit of conjecture here).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trojan source paper (https://trojansource.codes/trojan-source.pdf) does have an example of malicious use of right-to-left isolate. (maybe first-strong-isolate is still safe, but I'm not sure).

I also suspect that there's a way to combine homoglyph attacks with bidi manipulation without any explicit RTL control characters. For example, the arabic date separator U+060D can pass for a comma in some fonts and it is flagged for RTL mode.

Copy link
Collaborator Author

@stevendanna stevendanna Nov 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is important because banning all of these characters tells speakers of RTL languages that their language cannot be represented properly in source code

A great point. Further, even in CRDB, it might get in the way of non-english developers who want to leave notes to themselves in the source code while working on something (even if they plan to remove those notes before submitting the PR).

I also suspect that there's a way to combine homoglyph attacks with bidi manipulation without any explicit RTL control characters. For example, the arabic date separator U+060D can pass for a comma in some fonts and it is flagged for RTL mode.

This check is indeed pretty naive.

Copy link
Collaborator Author

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's also very narrowly targeted and I think there are plenty of other ways to subtly make code not do what it appears to (see the underhanded C contest for example).

💯 There are lots of ways to subvert code review.

Now that GitHub has an explicit warning about this, I think the remaining value here is pretty minimal. I think, let's hold off to see if reviewable might be able to add a similar UI indication. If they can, that seems sufficient to me.

@@ -633,6 +633,54 @@ func TestLint(t *testing.T) {
}
})

t.Run("TestDisallowedCharacterSequences", func(t *testing.T) {
t.Parallel()
disallowedCharacterSequences := []string{
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was all of the "explicit" formatting characters listed in: https://www.unicode.org/reports/tr9/#Directional_Formatting_Characters

I believe it also matches what rust added a check for in their advisory.

"\u202C", // POP-DIRECTIONAL-FORMATTING
"\u202D", // LEFT-TO-RIGHT-OVERRIDE
"\u202E", // RIGHT-TO-LEFT-OVERRIDE
"\u2066", // LEFT-TO-RIGHT-ISOLATE
Copy link
Collaborator Author

@stevendanna stevendanna Nov 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is important because banning all of these characters tells speakers of RTL languages that their language cannot be represented properly in source code

A great point. Further, even in CRDB, it might get in the way of non-english developers who want to leave notes to themselves in the source code while working on something (even if they plan to remove those notes before submitting the PR).

I also suspect that there's a way to combine homoglyph attacks with bidi manipulation without any explicit RTL control characters. For example, the arabic date separator U+060D can pass for a comma in some fonts and it is flagged for RTL mode.

This check is indeed pretty naive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants