Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fast path for ASCII in UTF-8 validation #30740

Merged
merged 2 commits into from
Jan 16, 2016

Conversation

bluss
Copy link
Member

@bluss bluss commented Jan 6, 2016

Add fast path for ASCII in UTF-8 validation

This speeds up the ASCII case (and long stretches of ASCII in otherwise
mixed UTF-8 data) when checking UTF-8 validity.

Benchmark results suggest that on purely ASCII input, we can improve
throughput (megabytes verified / second) by a factor of 13 to 14 (smallish input).
On XML and mostly English language input (en.wikipedia XML dump),
throughput improves by a factor 7 (large input).

On mostly non-ASCII input, performance increases slightly or is the
same.

The UTF-8 validation is rewritten to use indexed access; since all
access is preceded by a (mandatory for validation) length check, bounds
checks are statically elided by LLVM and this formulation is in fact the best
for performance. A previous version had losses due to slice to iterator
conversions.

A large credit to Björn Steinbrink who improved this patch immensely,
writing this second version.

Benchmark results on x86-64 (Sandy Bridge) compiled with -C opt-level=3.

Old code is regular, this PR is called fast.

Datasets:

  • ascii is just ASCII (2.5 kB)
  • cyr is cyrillic script with ascii spaces (5 kB)
  • dewik10 is 10MB of a de.wikipedia XML dump
  • enwik8 is 100MB of an en.wikipedia XML dump
  • jawik10 is 10MB of a ja.wikipedia XML dump
test from_utf8_ascii_fast        ... bench:         140 ns/iter (+/- 4) = 18221 MB/s
test from_utf8_ascii_regular     ... bench:       1,932 ns/iter (+/- 19) = 1320 MB/s
test from_utf8_cyr_fast          ... bench:      10,025 ns/iter (+/- 245) = 511 MB/s
test from_utf8_cyr_regular       ... bench:      10,944 ns/iter (+/- 795) = 468 MB/s
test from_utf8_dewik10_fast      ... bench:   6,017,909 ns/iter (+/- 105,755) = 1740 MB/s
test from_utf8_dewik10_regular   ... bench:  11,669,493 ns/iter (+/- 264,045) = 891 MB/s
test from_utf8_enwik8_fast       ... bench:  14,085,692 ns/iter (+/- 1,643,316) = 7000 MB/s
test from_utf8_enwik8_regular    ... bench:  93,657,410 ns/iter (+/- 5,353,353) = 1000 MB/s
test from_utf8_jawik10_fast      ... bench:  29,154,073 ns/iter (+/- 4,659,534) = 340 MB/s
test from_utf8_jawik10_regular   ... bench:  29,112,917 ns/iter (+/- 2,475,123) = 340 MB/s

Co-authored-by: Björn Steinbrink bsteinbr@gmail.com

@rust-highfive
Copy link
Collaborator

r? @brson

(rust_highfive has picked a reviewer for you, use r? to override)

@bluss
Copy link
Member Author

bluss commented Jan 6, 2016

Benchmarks using long texts are here: https://gist.github.com/bluss/bf45e07e711238e22b7a

2-3% slowdown on japanese and cyrillic texts that are mostly non-ascii. I don't have a problem championing that regression, given the speedup on utf-8 validation for predominantly ASCII input. The example texts are pretty arbitrary, the wikipedia texts a /little/ less so.

@@ -468,6 +468,18 @@ fn test_is_utf8() {
assert!(from_utf8(&[0xEF, 0xBF, 0xBF]).is_ok());
assert!(from_utf8(&[0xF0, 0x90, 0x80, 0x80]).is_ok());
assert!(from_utf8(&[0xF4, 0x8F, 0xBF, 0xBF]).is_ok());

// deny embedded in long stretches of ascii
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really know what this specific set of tests is doing.

I always have a bit of a sad when there are these giant "test everything" tests; my personal pref would be another test like is_utf8_is_not_tricked_by_non_ascii_in_long_stretches_of_ascii. No need to add test_, no need to have a comment, a failed test tells you what failed. 😸

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Entirely reasonable, no reason to share test name there, no common setup or anything. Fixed to have its own test function.

let ptr = v.as_ptr();

let mut offset = 0;
if len >= 2 * usize::BYTES {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the 2?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop is unrolled by 2 (reads 2 usize per lap).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, I wasn't very clear. I guess it's a two-part question:

  • Why unroll at all?
  • Why only unroll by 2?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit arbitrary, I've only tried 1, 2, and 4 and compared performance, and it's a trade off. In the memchr code, where this is taken from it's to fill a 16-byte register on x86-64, but that doesn't happen here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you extract the 2 to a const with a descriptive name about unrolling? Since I don't see any hand-unrolling here, I am guessing that the if statement allows the compiler to the unrolling according to the unrolling factor. This is not obvious to me. Can you also add a comment explaining?

Edit: Oh, are the duplicated contains_nonascii calls the loop unrolling?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and the two ptr.offset and deref per iteration.

@shepmaster
Copy link
Member

Pedantically, I'd say it should be ASCII (all caps) when in comments or prose as it's an acronym. Also non_ascii cause I'd normally write non-ASCII. All my comments are at your discretion to take or leave! 😇

}
}

// find the byte after the point the loop stopped
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the result of (x & 0x80808080_80808080) is non-zero, you can "immediately" find which byte it is using leading_zeros() / 8

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

depends on endianness, it works fine with .trailing_zeros() on x86-64. It deserved to be tried for sure, but I couldn't make it be an improvement.

What llvm compiles the current code into, the beast it is, is actually if contains_nonascii(u | v) { break; } which seems to make for a much simpler computation inside the loop, and a tight loop.

I'm not 100% happy with the code in find_nonascii, so any suggestion for improvement would be super welcome, feel free to take the code (from the benchmark link) and find something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I downloaded the gist, but I am having some trouble in getting the datasets you used. Specifically, I assumed that enwik8 should be http://mattmahoney.net/dc/enwik8.zip and that the specific version of the Japanese wiki should not matter much, but I have no idea about big10.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, maybe you can just skip those datasets you don't have though? I could have provided everything better.

big10 is the dataset in http://vaskir.blogspot.ru/2015/09/regular-expressions-rust-vs-f.html

so it's the first 10MB of the unzipped file from https://drive.google.com/open?id=0B8HLQUKik9VtUWlOaHJPdG0xbnM

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brson
Copy link
Contributor

brson commented Jan 12, 2016

Sweet wins. r=me but please do extract 2 to a more descriptive constant.

@brson brson added the relnotes Marks issues that should be documented in the release notes of the next release. label Jan 12, 2016
@bluss
Copy link
Member Author

bluss commented Jan 12, 2016

Ok, I'll look over if there's a neat way to write the unrolling factor

This speeds up the ascii case (and long stretches of ascii in otherwise
mixed UTF-8 data) when checking UTF-8 validity.

Benchmark results suggest that on purely ASCII input, we can improve
throughput (megabytes verified / second) by a factor of 13 to 14!
On xml and mostly english language input (en.wikipedia xml dump),
throughput increases by a factor 7.

On mostly non-ASCII input, performance increases slightly or is the
same.

The UTF-8 validation is rewritten to use indexed access; since all
access is preceded by a (mandatory for validation) length check, they
are statically elided by llvm and this formulation is in fact the best
for performance. A previous version had losses due to slice to iterator
conversions.

A large credit to Björn Steinbrink who improved this patch immensely,
writing this second version.

Benchmark results on x86-64 (Sandy Bridge) compiled with -C opt-level=3.

Old code is `regular`, this PR is called `fast`.

Datasets:

- `ascii` is just ascii (2.5 kB)
- `cyr` is cyrillic script with ascii spaces (5 kB)
- `dewik10` is 10MB of a de.wikipedia xml dump
- `enwik10` is 100MB of an en.wikipedia xml dump
- `jawik10` is 10MB of a ja.wikipedia xml dump

```
test from_utf8_ascii_fast        ... bench:         140 ns/iter (+/- 4) = 18221 MB/s
test from_utf8_ascii_regular     ... bench:       1,932 ns/iter (+/- 19) = 1320 MB/s
test from_utf8_cyr_fast          ... bench:      10,025 ns/iter (+/- 245) = 511 MB/s
test from_utf8_cyr_regular       ... bench:      12,250 ns/iter (+/- 437) = 418 MB/s
test from_utf8_dewik10_fast      ... bench:   6,017,909 ns/iter (+/- 105,755) = 1740 MB/s
test from_utf8_dewik10_regular   ... bench:  11,669,493 ns/iter (+/- 264,045) = 891 MB/s
test from_utf8_enwik8_fast       ... bench:  14,085,692 ns/iter (+/- 1,643,316) = 7000 MB/s
test from_utf8_enwik8_regular    ... bench:  93,657,410 ns/iter (+/- 5,353,353) = 1000 MB/s
test from_utf8_jawik10_fast      ... bench:  29,154,073 ns/iter (+/- 4,659,534) = 340 MB/s
test from_utf8_jawik10_regular   ... bench:  29,112,917 ns/iter (+/- 2,475,123) = 340 MB/s
```

Co-authored-by: Björn Steinbrink <bsteinbr@gmail.com>
@bluss bluss changed the title Add fast path for ascii in UTF-8 validation Add fast path for ASCII in UTF-8 validation Jan 12, 2016
@bluss
Copy link
Member Author

bluss commented Jan 12, 2016

I received an improved version by @dotdash (with permission to incorporate, of course!) and it's an improvement you wouldn't believe.

  • No slowdown for non-ascii cases (cyrillic test case improves for some reason)
  • Less unsafe code
  • Even faster on the pure ascii case.

Updated PR description & benchmarks are in there

@brson I addressed loop unrolling only by adding another comment for it, don't see a nice way to factor it out to a constant

@Gankra
Copy link
Contributor

Gankra commented Jan 12, 2016

Wow, awesome stuff!

@bluss
Copy link
Member Author

bluss commented Jan 12, 2016

Pushed a fix, there was a missing conditional, let's try this in travis. I can't measure any difference in perf.

Oh and the fix actually has a const UNROLL_BY because we needed to repeat that 2 yet another time.

@shepmaster
Copy link
Member

As a bit of "real world" performance information, I pulled this down and used it for SXD.

Parsing a 16M XML file

Valgrind reported that str::from_utf8 took this much of the total run time:

Rust 1.5 This PR
5.47% 0.29%

And I measured a ~1.25% overall speedup in the program.

Parsing a 111M XML file

Rust 1.5 This PR
4.12% 0.22%

And I measured a ~1.1% overall speedup in the program.


Thanks for the awesome performance gains!

@dotdash
Copy link
Contributor

dotdash commented Jan 13, 2016

Thanks a lot @shepmaster! Always encouraging to get that kind of feedback! 😻 And thanks to @bluss for getting this started, I've been completely blind to the masking quick check when I initially looked into this a few weeks ago! 🍻

@bluss
Copy link
Member Author

bluss commented Jan 13, 2016

@shepmaster Awesome to see some numbers! I'm guessing your data files are almost purely ASCII (as a lot of the data in the world is).

@brson This is ready for re-review. It's the same algorithm, indexed access though, and the fast skip ahead loop is simpler, because it's only attempted at aligned locations. The main loop will progress to an aligned location quickly anyway, if the input is mostly ascii.

@shepmaster
Copy link
Member

I'm guessing your data files are almost purely ASCII

Ah, yes, I meant to mention that. They indeed are pure-ASCII.

}
}
// step from the point where the wordwise loop stopped
while offset < len && v[offset] < 128 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading through this, I thought at first that 128 was another number relating to byte widths, then realized it is the ASCII cutoff value. Since this is also used above (first >= 128), perhaps another constant could be in order?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, I don't think it's needed

We need to guard that `len` is large enough for the fast skip loop.
@bluss
Copy link
Member Author

bluss commented Jan 14, 2016

I updated the second commit to use a constant for 2 * usize::BYTES instead, to follow shepmaster's suggestion roughly.

@brson
Copy link
Contributor

brson commented Jan 16, 2016

@bors r+

@bors
Copy link
Contributor

bors commented Jan 16, 2016

📌 Commit cadcd70 has been approved by brson

@bors
Copy link
Contributor

bors commented Jan 16, 2016

⌛ Testing commit cadcd70 with merge e7e4ecc...

bors added a commit that referenced this pull request Jan 16, 2016
Add fast path for ASCII in UTF-8 validation

This speeds up the ASCII case (and long stretches of ASCII in otherwise
mixed UTF-8 data) when checking UTF-8 validity.

Benchmark results suggest that on purely ASCII input, we can improve
throughput (megabytes verified / second) by a factor of 13 to 14 (smallish input).
On XML and mostly English language input (en.wikipedia XML dump),
throughput improves by a factor 7 (large input).

On mostly non-ASCII input, performance increases slightly or is the
same.

The UTF-8 validation is rewritten to use indexed access; since all
access is preceded by a (mandatory for validation) length check, bounds
checks are statically elided by LLVM and this formulation is in fact the best
for performance. A previous version had losses due to slice to iterator
conversions.

A large credit to Björn Steinbrink who improved this patch immensely,
writing this second version.

Benchmark results on x86-64 (Sandy Bridge) compiled with -C opt-level=3.

Old code is `regular`, this PR is called `fast`.

Datasets:

- `ascii` is just ASCII (2.5 kB)
- `cyr` is cyrillic script with ascii spaces (5 kB)
- `dewik10` is 10MB of a de.wikipedia XML dump
- `enwik8` is 100MB of an en.wikipedia XML dump
- `jawik10` is 10MB of a ja.wikipedia XML dump

```
test from_utf8_ascii_fast        ... bench:         140 ns/iter (+/- 4) = 18221 MB/s
test from_utf8_ascii_regular     ... bench:       1,932 ns/iter (+/- 19) = 1320 MB/s
test from_utf8_cyr_fast          ... bench:      10,025 ns/iter (+/- 245) = 511 MB/s
test from_utf8_cyr_regular       ... bench:      10,944 ns/iter (+/- 795) = 468 MB/s
test from_utf8_dewik10_fast      ... bench:   6,017,909 ns/iter (+/- 105,755) = 1740 MB/s
test from_utf8_dewik10_regular   ... bench:  11,669,493 ns/iter (+/- 264,045) = 891 MB/s
test from_utf8_enwik8_fast       ... bench:  14,085,692 ns/iter (+/- 1,643,316) = 7000 MB/s
test from_utf8_enwik8_regular    ... bench:  93,657,410 ns/iter (+/- 5,353,353) = 1000 MB/s
test from_utf8_jawik10_fast      ... bench:  29,154,073 ns/iter (+/- 4,659,534) = 340 MB/s
test from_utf8_jawik10_regular   ... bench:  29,112,917 ns/iter (+/- 2,475,123) = 340 MB/s
```

Co-authored-by: Björn Steinbrink <bsteinbr@gmail.com>
@bors bors merged commit cadcd70 into rust-lang:master Jan 16, 2016
@bluss bluss deleted the ascii-is-the-best branch January 16, 2016 10:53
@bluss
Copy link
Member Author

bluss commented Jan 16, 2016

awesome. Thanks @brson and everyone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
relnotes Marks issues that should be documented in the release notes of the next release.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants