Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lint: add linter for unicode directional formatting characters #72287

Closed
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions pkg/testutils/lint/lint_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -633,6 +633,54 @@ func TestLint(t *testing.T) {
}
})

t.Run("TestDisallowedCharacterSequences", func(t *testing.T) {
t.Parallel()
disallowedCharacterSequences := []string{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did this list of characters come from? We should include a link.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was all of the "explicit" formatting characters listed in: https://www.unicode.org/reports/tr9/#Directional_Formatting_Characters

I believe it also matches what rust added a check for in their advisory.

"\u202A", // LEFT-TO-RIGHT-EMBEDDING
"\u202B", // RIGHT-TO-LEFT-EMBEDDING
"\u202C", // POP-DIRECTIONAL-FORMATTING
"\u202D", // LEFT-TO-RIGHT-OVERRIDE
"\u202E", // RIGHT-TO-LEFT-OVERRIDE
"\u2066", // LEFT-TO-RIGHT-ISOLATE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the "isolate" characters actually dangerous? I understand the problems with the override characters, but I've also had to use them in a real project to get bidi text to render correctly. The isolate characters were introduced more recently (2013, after the last time I had to deal with these issues) and were supposed to be a safer alternative.

This is important because banning all of these characters tells speakers of RTL languages that their language cannot be represented properly in source code. That may be acceptable in CockroachDB since it is already our policy that source code and comments be in English, but it would not be good if this rule spread into compilers and other language-wide tools. If we could narrow the block list to only the override and embedding characters while allowing isolates, that would be a better outcome since it could address the security concerns while still allowing all languages to be properly represented (I think. I can't read any RTL languages so there's still quite a bit of conjecture here).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trojan source paper (https://trojansource.codes/trojan-source.pdf) does have an example of malicious use of right-to-left isolate. (maybe first-strong-isolate is still safe, but I'm not sure).

I also suspect that there's a way to combine homoglyph attacks with bidi manipulation without any explicit RTL control characters. For example, the arabic date separator U+060D can pass for a comma in some fonts and it is flagged for RTL mode.

Copy link
Collaborator Author

@stevendanna stevendanna Nov 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is important because banning all of these characters tells speakers of RTL languages that their language cannot be represented properly in source code

A great point. Further, even in CRDB, it might get in the way of non-english developers who want to leave notes to themselves in the source code while working on something (even if they plan to remove those notes before submitting the PR).

I also suspect that there's a way to combine homoglyph attacks with bidi manipulation without any explicit RTL control characters. For example, the arabic date separator U+060D can pass for a comma in some fonts and it is flagged for RTL mode.

This check is indeed pretty naive.

"\u2067", // RIGHT-TO-LEFT-ISOLATE
"\u2068", // FIRST-STRONG-ISOLATE
"\u2069", // POP-DIRECTIONAL-ISOLATE
}
pattern := strings.Join(disallowedCharacterSequences, "|")
cmd, stderr, filter, err := dirCmd(
crdb.Dir,
"git",
"grep",
"-nE",
pattern,
"--",
":!*.woff2",
":!*.png",
":!*.tgz",
":!pkg/ccl/importccl/testdata/avro/stock-10000.bjson",
":!pkg/ccl/importccl/testdata/avro/stock-10000.ocf",
)
if err != nil {
t.Fatal(err)
}

if err := cmd.Start(); err != nil {
t.Fatal(err)
}

if err := stream.ForEach(filter, func(s string) {
t.Errorf("\n%s <- forbidden use of disallowed character sequence.", s)
}); err != nil {
t.Error(err)
}

if err := cmd.Wait(); err != nil {
if out := stderr.String(); len(out) > 0 {
t.Fatalf("err=%s, stderr=%s", err, out)
}
}
})

t.Run("TestInternalErrorCodes", func(t *testing.T) {
t.Parallel()
cmd, stderr, filter, err := dirCmd(
Expand Down