Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fast path for ASCII in UTF-8 validation #30740

Merged
merged 2 commits into from
Jan 16, 2016

Commits on Jan 12, 2016

  1. Add fast path for ASCII in UTF-8 validation

    This speeds up the ascii case (and long stretches of ascii in otherwise
    mixed UTF-8 data) when checking UTF-8 validity.
    
    Benchmark results suggest that on purely ASCII input, we can improve
    throughput (megabytes verified / second) by a factor of 13 to 14!
    On xml and mostly english language input (en.wikipedia xml dump),
    throughput increases by a factor 7.
    
    On mostly non-ASCII input, performance increases slightly or is the
    same.
    
    The UTF-8 validation is rewritten to use indexed access; since all
    access is preceded by a (mandatory for validation) length check, they
    are statically elided by llvm and this formulation is in fact the best
    for performance. A previous version had losses due to slice to iterator
    conversions.
    
    A large credit to Björn Steinbrink who improved this patch immensely,
    writing this second version.
    
    Benchmark results on x86-64 (Sandy Bridge) compiled with -C opt-level=3.
    
    Old code is `regular`, this PR is called `fast`.
    
    Datasets:
    
    - `ascii` is just ascii (2.5 kB)
    - `cyr` is cyrillic script with ascii spaces (5 kB)
    - `dewik10` is 10MB of a de.wikipedia xml dump
    - `enwik10` is 100MB of an en.wikipedia xml dump
    - `jawik10` is 10MB of a ja.wikipedia xml dump
    
    ```
    test from_utf8_ascii_fast        ... bench:         140 ns/iter (+/- 4) = 18221 MB/s
    test from_utf8_ascii_regular     ... bench:       1,932 ns/iter (+/- 19) = 1320 MB/s
    test from_utf8_cyr_fast          ... bench:      10,025 ns/iter (+/- 245) = 511 MB/s
    test from_utf8_cyr_regular       ... bench:      12,250 ns/iter (+/- 437) = 418 MB/s
    test from_utf8_dewik10_fast      ... bench:   6,017,909 ns/iter (+/- 105,755) = 1740 MB/s
    test from_utf8_dewik10_regular   ... bench:  11,669,493 ns/iter (+/- 264,045) = 891 MB/s
    test from_utf8_enwik8_fast       ... bench:  14,085,692 ns/iter (+/- 1,643,316) = 7000 MB/s
    test from_utf8_enwik8_regular    ... bench:  93,657,410 ns/iter (+/- 5,353,353) = 1000 MB/s
    test from_utf8_jawik10_fast      ... bench:  29,154,073 ns/iter (+/- 4,659,534) = 340 MB/s
    test from_utf8_jawik10_regular   ... bench:  29,112,917 ns/iter (+/- 2,475,123) = 340 MB/s
    ```
    
    Co-authored-by: Björn Steinbrink <bsteinbr@gmail.com>
    bluss and dotdash committed Jan 12, 2016
    Configuration menu
    Copy the full SHA
    11e3de3 View commit details
    Browse the repository at this point in the history

Commits on Jan 14, 2016

  1. UTF-8 validation: Add missing if conditional for short input

    We need to guard that `len` is large enough for the fast skip loop.
    bluss committed Jan 14, 2016
    Configuration menu
    Copy the full SHA
    cadcd70 View commit details
    Browse the repository at this point in the history