Vectorize string parsing #1161

purplesyringa · 2024-07-29T09:12:55Z

I've implemented your suggestion here, and I need a bit of your input regarding my approach.

What we really want is to find the first character s.t. ch == b'"' || ch == b'\\' || ch < 0x20. memchr3 supports searches of kind ch == a || ch == b || ch == c, which is almost what we need. SIMD comparisons for equality and non-equality are not significantly different performance-wise or implementation-wise, so memchr could theoretically help us out. I've filed BurntSushi/memchr#157, but chances are this isn't going to be implemented anytime soon.

The second best thing -- and I want to articulate that this is not the more performant solution -- is to first use SIMD to scan for " or \ and then find escapes in an ad-hoc way. However, it turns out that ch < 0x20 in a hot loop is not any faster than ESCAPES[ch], on benchmarks at least. The reason is likely that the table is brought into L1 cache and the string isn't, so the speed of string iteration shadows the concerns about table access.

This forces us to use SIMD in serde-json directly. Luckily, there's a way to implement what we need in an architecture-agnostic way, and it's faster than using intrinsics on my benchmarks. This unfortunately increases code complexity a bit and requires an #[inline(always)] annotation, but this is what we get in return:

                    old (MB/s)  new (MB/s)  perf increase
canada dom          289         289         0%
canada struct       423         433         +2.4%
citm dom            359         361         +0.1%
citm struct         910         886         -2.6%
twitter dom         302         326         +7.9%
twitter struct      611         754         +23%

I'd say this is a good result overall, but the citm pessimization bothers me a bit. It seems like the new implementation is slower when most of the strings are very short (less than 16 characters, perhaps?). In practice, this happens when the data contains lots of objects with many keys but small values, e.g. {"areaId": 205706007, "blockIds": []} x 10k in citm.

So the questions are:

Is increased the mental complexity of the core of serde-json fine?
Is significantly optimizing long strings but pessimizing short strings fine?

This is not backed by benchmarks, but it seems reasonable that we'd be more starved for cache than CPU in IO-bound tasks. It also simplifies code a bit and frees up some memory, which is probably a good thing.

purplesyringa · 2024-07-29T11:17:37Z

Also, just to make sure we're on the same page: as simd-json seems to be serde-compatible, what is the main focus of this crate? Is it supposed to be a simple canonical implementation or are more complicated performance optimizations welcome here?

dtolnay

Thank you, this is terrific work.

I was able to come up with 2 small variations that outperform this PR on my machine even without inline(always).

Variation 1: in the Mycroft's algorithm, move & (ONE_BYTES << 7) and != 0 out of the loop. #1167

Variation 2: skip the memchr. Mycroft's algorithm works for finding any specific byte by xor-ing the original chunk and then testing for a zero byte. #1168

I was measuring using the following program, whose performance is dominated by SliceRead::ignore_str:

const ALIGNMENT_OFFSET: usize = 0;

#[repr(C, align(8))]
struct Static<T: ?Sized = [u8]> {
    offset: [u8; ALIGNMENT_OFFSET],
    json: T,
}

static STATIC: &Static = &Static {
    offset: [0u8; ALIGNMENT_OFFSET],
    json: *include_bytes!("twitter.json"),  // https://github.com/serde-rs/json-benchmark/tree/master/data
};

fn main() {
    let iterations = 10000;
    let json = std::str::from_utf8(&STATIC.json).unwrap();
    let begin = std::time::Instant::now();
    for _ in 0..iterations {
        serde_json::from_str::<serde::de::IgnoredAny>(json).unwrap();
    }
    let bytes = iterations * json.len();
    let throughput = bytes as f64 / 1024.0 / 1024.0 / begin.elapsed().as_secs_f64();
    println!("{:.02} MB/sec", throughput);
}

I am interested whether you find either of the variations promising to incorporate into this implementation, whether you can reproduce them being any faster, and whether they mitigate the small string slowdown at all (which I didn't test).

src/read.rs

tests/test.rs

src/read.rs

dtolnay · 2024-08-05T02:54:17Z

Is increased the mental complexity of the core of serde-json fine?

Yes, this is fine.

Is significantly optimizing long strings but pessimizing short strings fine?

I think this is fine. The set of three benchmark files you already cited above are chosen to be representative enough of real-world performance-sensitive workflows, including with regard to the distribution of string lengths they contain.

Before merging, I would want to try benchmarking another file that disproportionately contains small string to try to get an upper bound on how much such data would be pessimized, but I expect it won't be much.

as simd-json seems to be serde-compatible, what is the main focus of this crate? Is it supposed to be a simple canonical implementation or are more complicated performance optimizations welcome here?

Performance optimizations are welcome. The relevant constraint is that the public API needs to remain good enough and approachable for 99% of JSON use cases. Simd-json's API is not this because so many optimizations have leaked into the public API.

dtolnay · 2024-08-05T02:57:42Z

error: adding items after statements is confusing, since items exist from the start of the scope
   --> src/read.rs:446:9
    |
446 |         const STEP: usize = mem::size_of::<usize>();
    |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |
    = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#items_after_statements
    = note: `-D clippy::items-after-statements` implied by `-D clippy::pedantic`
    = help: to override `-D clippy::pedantic` add `#[allow(clippy::items_after_statements)]`

Feel free to add allow(clippy::items_after_statements) in lib.rs. Your code is fine as written and I don't agree with this lint.

purplesyringa · 2024-08-05T04:10:28Z

I'm incredibly happy with the direction this is going. I'll see if I can incorporate your patches and maybe optimize stuff some more soon.

purplesyringa · 2024-08-11T12:06:11Z

I've pushed a rewrite of #1168 that seems cleaner to me and also seems to produce better x86 code. Compared to the previous version of the PR, the benchmark result is:

                    old (MB/s)  new (MB/s)  perf increase
canada dom          288.0       287.8       -0.1%
canada struct       407.4       425.2       +4.4%
citm dom            382.4       407.4       +6.5%
citm struct         883.0       944.6       +7.0%
twitter dom         324.0       339.4       +4.8%
twitter struct      756.8       799.4       +5.6%

...so there are no regressions.

This was confusing to me at first, because memchr2 is supposed to be faster than the SWAR code. But memory reads are slow, so perhaps doing two runs over the data is slower than just one run. I think this change will also make performance more level by avoiding a performance cliff when the chunk doesn't fit in L1.

I also ran a more precise comparison to the original naive code:

                    old (MB/s)  new (MB/s)  perf increase
canada dom          284.8       287.8       +1.1%
canada struct       424.4       425.2       +0.2%
citm dom            378.6       407.4       +7.6%
citm struct         910.2       944.6       +3.8%
twitter dom         299.8       339.4       +13.2%
twitter struct      613.6       799.4       +30.3%

I'm a bit surprised that it's fast on short strings too. That's probably because we no longer perform an indirect call in memchr2. Also, branch predictor should be able to optimize the loop and the if.

Some commentary on the code:

I removed some unsafety by using chunks_exact and Chunk::from_ne_bytes(chunk.try_into().unwrap()) instead of pointer arithmetic. This is also more efficient than your PR; I'm a bit unsure where the performance increase comes from, but LLVM produces better code this way.
I replaced b'"' as Chunk with Chunk::from(b'"'). This seems cleaner to me.

purplesyringa · 2024-08-11T12:16:16Z

On the flip side, here's the comparison with the "factor out the check" variation:

                    #1167       new (MB/s)  perf increase
canada dom          289.4       287.8       -0.6%
canada struct       406.8       425.2       +4.5%
citm dom            415.8       407.4       -2.0%
citm struct         895.6       944.6       +5.5%
twitter dom         323.6       339.4       +4.9%
twitter struct      756.6       799.4       +5.7%

canada DOM is a bona fide regression, but still better than naive code. citm DOM is noise. I think we should just bite the bullet and drop #1167.

purplesyringa · 2024-08-11T12:23:15Z

For completeness, here's the comparison for the benchmark in #1161 (review) (5 runs performed):

naive   918.20  919.05  921.64  917.83  929.76
#1167   979.75  973.36  983.06  976.38  975.24
old PR  983.56  992.25  975.70  987.34  986.94
new     1342.39 1337.61 1334.17 1333.77 1334.33

purplesyringa · 2024-08-11T14:26:44Z

I think this PR is ready for merging (bar squashing, I'll be happy to squash it if you want).

Unrolling the loop manually twice (this is notably distinct from u128) provides very slightly better performance (around +2% compared to +50% of this PR) on the synthetic benchmark from your comment and <1% speed-up on json-benchmark with some regressions. I don't think it's a viable avenue to pursue.

We'd probably have more luck with wide for further improvements. I can work on that if you'd like me to, but I think it's better to land this PR in the meantime anyway.

Before merging, I would want to try benchmarking another file that disproportionately contains small string to try to get an upper bound on how much such data would be pessimized, but I expect it won't be much.

Data: {"ABC": "", "ABC": "", "ABC": "", ...}.

                5 runs (MB/s)
Naive           619.33  626.66  629.83  626.22  626.77
Current PR      504.49  503.71  505.75  505.99  498.94

This is a 20% regression.

Data: ["0123456789", "0123456789", "0123456789", ...].

                5 runs (MB/s)
Naive           980.35  996.23  996.59  980.63  995.74
Current PR      1083.53 1079.53 1066.18 1083.57 1078.90

This is a 10% improvement.

Data (strings "0123" and "012", chosen randomly): ["0123", "0123", "0123", "012", "012", "0123", "0123", "012", "012", "0123", "012", "012", "012", ...].

                5 runs (MB/s)
Naive           460.70  460.32  460.03  462.20  461.94
Current PR      647.87  641.29  649.03  647.17  649.25

Overall, this PR regresses strings of perfectly predictable very short lengths (the breakeven point seems to be around 6 characters). So the main troublemakers are objects with: a) few, b) short, c) fixed keys and short values.

I'm not sure if this is practical. I cherry-picked some datasets that I expected would be regressed, but none did:

https://blockchain.info/unconfirmed-transactions?format=json: 25% improvement
https://data.police.uk/api/leicestershire/neighbourhoods: 12% improvement
https://raw.githubusercontent.com/Biuni/PokemonGO-Pokedex/master/pokedex.json: 12% improvement

I'm not happy about this, but if this regresses no realistic datasets, we might as well merge it and see who hollers.

purplesyringa · 2024-08-11T14:33:17Z

~~@rustbot review~~

purplesyringa · 2024-08-11T16:01:46Z

I just found a regression. We shouldn't really be looking for short strings as counterexamples, we should be looking for empty strings! We also effectively treat the spaces between consecutive Unicode escapes as empty strings, and this is very common in Unicode text. For JSON-encoded War and Peace in Russian, I get

                5 runs (MB/s)
Naive           651.89  648.86  649.74  649.98  651.92
Current PR      536.03  533.37  538.01  541.41  536.62

purplesyringa · 2024-08-11T16:49:39Z

I've committed a fix that should offset the performance loss from -17% to -1.7%:

                5 runs (MB/s)
Naive           651.89  648.86  649.74  649.98  651.92
Old PR          536.03  533.37  538.01  541.41  536.62
Current PR      641.89  635.57  638.24  641.14  640.60

I have not noticed any significant regressions on json-benchmark (there are some improvements too, but I believe it's all just noise). I don't believe this deoptimization is a concern anymore.

dtolnay

Thanks!

dtolnay · 2024-08-11T17:43:25Z

src/read.rs

+            let contains_ctrl = chars.wrapping_sub(ONE_BYTES * 0x20) & !chars;
+            let chars_quote = chars ^ (ONE_BYTES * Chunk::from(b'"'));
+            let contains_quote = chars_quote.wrapping_sub(ONE_BYTES) & !chars_quote;
+            let chars_backslash = chars ^ (ONE_BYTES * Chunk::from(b'\\'));
+            let contains_backslash = chars_backslash.wrapping_sub(ONE_BYTES) & !chars_backslash;
+            let masked = (contains_ctrl | contains_quote | contains_backslash) & (ONE_BYTES << 7);


A fun followup would be to throw this arithmetic expression into a superoptimizer and see if there is some shorter expression that it is equivalent to. It seems very likely to me that the same quantity can be computed in fewer operations than written here.

Some observations from https://rust.godbolt.org/z/66fGEqY6c:

LLVM already rewrites this as:

- let contains_quote = chars_quote.wrapping_sub(ONE_BYTES) & !chars_quote; - let contains_backslash = chars_backslash.wrapping_sub(ONE_BYTES) & !chars_backslash; + let contains_quote = chars_quote.wrapping_sub(ONE_BYTES) & !chars; + let contains_backslash = chars_backslash.wrapping_sub(ONE_BYTES) & !chars; let masked = (contains_ctrl | contains_quote | contains_backslash) & (ONE_BYTES << 7);

because it has proven that A & !(chars ^ B) & C is equivalent to A & !chars & C whenever B & C == 0, i.e. doing the ^ only affects bits that are later erased by the second &.

Then it factored out the three & !chars into one.

- let contains_ctrl = chars.wrapping_sub(ONE_BYTES * 0x20) & !chars; - let contains_quote = chars_quote.wrapping_sub(ONE_BYTES) & !chars; - let contains_backslash = chars_backslash.wrapping_sub(ONE_BYTES) & !chars; - let masked = (contains_ctrl | contains_quote | contains_backslash) & (ONE_BYTES << 7); + let tmp_ctrl = chars.wrapping_sub(ONE_BYTES * 0x20); + let tmp_quote = chars_quote.wrapping_sub(ONE_BYTES); + let tmp_backslash = chars_backslash.wrapping_sub(ONE_BYTES); + let masked = (tmp_ctrl | tmp_quote | tmp_backslash) & !chars & (ONE_BYTES << 7);

A superoptimizer would do this kind of thing, but better.

Vectorize string parsing

3faae03

purplesyringa marked this pull request as draft July 29, 2024 09:13

purplesyringa added 3 commits July 29, 2024 12:23

Replace ESCAPE array with is_escape fn

03ceee9

This is not backed by benchmarks, but it seems reasonable that we'd be more starved for cache than CPU in IO-bound tasks. It also simplifies code a bit and frees up some memory, which is probably a good thing.

Bring MSRV down

63cb04d

Add better tests

3063d69

purplesyringa marked this pull request as ready for review July 29, 2024 10:23

This was referenced Aug 5, 2024

Move final mask and early return out of control character validation loop #1167

Closed

Combine the search for " and \ with search for control characters #1168

Closed

dtolnay reviewed Aug 5, 2024

View reviewed changes

src/read.rs Outdated Show resolved Hide resolved

tests/test.rs Show resolved Hide resolved

src/read.rs Outdated Show resolved Hide resolved

purplesyringa force-pushed the vectorized-string-parsing branch 2 times, most recently from fce646f to a60f1e1 Compare August 11, 2024 12:05

Inline memchr2 logic into Mycroft's algorithm

5496579

Big endian support

a95d6df

purplesyringa force-pushed the vectorized-string-parsing branch from a60f1e1 to a95d6df Compare August 11, 2024 12:31

purplesyringa added 2 commits August 11, 2024 15:38

Allow clippy::items_after_statements

1f0dcf7

Don't run the slow algorithm from the beginning

8389d8a

purplesyringa requested a review from dtolnay August 11, 2024 14:33

Immediately bail-out on empty strings

e43da5e

purplesyringa force-pushed the vectorized-string-parsing branch from 175cde9 to e43da5e Compare August 11, 2024 16:37

dtolnay approved these changes Aug 11, 2024

View reviewed changes

dtolnay merged commit 859ead8 into serde-rs:master Aug 11, 2024
13 checks passed

purplesyringa deleted the vectorized-string-parsing branch August 11, 2024 22:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorize string parsing #1161

Vectorize string parsing #1161

purplesyringa commented Jul 29, 2024 •

edited

Loading

purplesyringa commented Jul 29, 2024

dtolnay left a comment

dtolnay commented Aug 5, 2024

dtolnay commented Aug 5, 2024

purplesyringa commented Aug 5, 2024

purplesyringa commented Aug 11, 2024

purplesyringa commented Aug 11, 2024

purplesyringa commented Aug 11, 2024

purplesyringa commented Aug 11, 2024

purplesyringa commented Aug 11, 2024

purplesyringa commented Aug 11, 2024 •

edited

Loading

purplesyringa commented Aug 11, 2024 •

edited

Loading

dtolnay left a comment

dtolnay Aug 11, 2024

dtolnay Aug 11, 2024

Vectorize string parsing #1161

Vectorize string parsing #1161

Conversation

purplesyringa commented Jul 29, 2024 • edited Loading

purplesyringa commented Jul 29, 2024

dtolnay left a comment

Choose a reason for hiding this comment

dtolnay commented Aug 5, 2024

dtolnay commented Aug 5, 2024

purplesyringa commented Aug 5, 2024

purplesyringa commented Aug 11, 2024

purplesyringa commented Aug 11, 2024

purplesyringa commented Aug 11, 2024

purplesyringa commented Aug 11, 2024

purplesyringa commented Aug 11, 2024

purplesyringa commented Aug 11, 2024 • edited Loading

purplesyringa commented Aug 11, 2024 • edited Loading

dtolnay left a comment

Choose a reason for hiding this comment

dtolnay Aug 11, 2024

Choose a reason for hiding this comment

dtolnay Aug 11, 2024

Choose a reason for hiding this comment

purplesyringa commented Jul 29, 2024 •

edited

Loading

purplesyringa commented Aug 11, 2024 •

edited

Loading

purplesyringa commented Aug 11, 2024 •

edited

Loading