Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize string parsing #1161

Merged
merged 9 commits into from
Aug 11, 2024
Merged

Conversation

purplesyringa
Copy link
Contributor

@purplesyringa purplesyringa commented Jul 29, 2024

I've implemented your suggestion here, and I need a bit of your input regarding my approach.

What we really want is to find the first character s.t. ch == b'"' || ch == b'\\' || ch < 0x20. memchr3 supports searches of kind ch == a || ch == b || ch == c, which is almost what we need. SIMD comparisons for equality and non-equality are not significantly different performance-wise or implementation-wise, so memchr could theoretically help us out. I've filed BurntSushi/memchr#157, but chances are this isn't going to be implemented anytime soon.

The second best thing -- and I want to articulate that this is not the more performant solution -- is to first use SIMD to scan for " or \ and then find escapes in an ad-hoc way. However, it turns out that ch < 0x20 in a hot loop is not any faster than ESCAPES[ch], on benchmarks at least. The reason is likely that the table is brought into L1 cache and the string isn't, so the speed of string iteration shadows the concerns about table access.

This forces us to use SIMD in serde-json directly. Luckily, there's a way to implement what we need in an architecture-agnostic way, and it's faster than using intrinsics on my benchmarks. This unfortunately increases code complexity a bit and requires an #[inline(always)] annotation, but this is what we get in return:

                    old (MB/s)  new (MB/s)  perf increase
canada dom          289         289         0%
canada struct       423         433         +2.4%
citm dom            359         361         +0.1%
citm struct         910         886         -2.6%
twitter dom         302         326         +7.9%
twitter struct      611         754         +23%

I'd say this is a good result overall, but the citm pessimization bothers me a bit. It seems like the new implementation is slower when most of the strings are very short (less than 16 characters, perhaps?). In practice, this happens when the data contains lots of objects with many keys but small values, e.g. {"areaId": 205706007, "blockIds": []} x 10k in citm.

So the questions are:

  • Is increased the mental complexity of the core of serde-json fine?
  • Is significantly optimizing long strings but pessimizing short strings fine?

@purplesyringa purplesyringa marked this pull request as draft July 29, 2024 09:13
This is not backed by benchmarks, but it seems reasonable that we'd be
more starved for cache than CPU in IO-bound tasks. It also simplifies
code a bit and frees up some memory, which is probably a good thing.
@purplesyringa purplesyringa marked this pull request as ready for review July 29, 2024 10:23
@purplesyringa
Copy link
Contributor Author

Also, just to make sure we're on the same page: as simd-json seems to be serde-compatible, what is the main focus of this crate? Is it supposed to be a simple canonical implementation or are more complicated performance optimizations welcome here?

Copy link
Member

@dtolnay dtolnay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this is terrific work.

I was able to come up with 2 small variations that outperform this PR on my machine even without inline(always).

Variation 1: in the Mycroft's algorithm, move & (ONE_BYTES << 7) and != 0 out of the loop. #1167

Variation 2: skip the memchr. Mycroft's algorithm works for finding any specific byte by xor-ing the original chunk and then testing for a zero byte. #1168

I was measuring using the following program, whose performance is dominated by SliceRead::ignore_str:

const ALIGNMENT_OFFSET: usize = 0;

#[repr(C, align(8))]
struct Static<T: ?Sized = [u8]> {
    offset: [u8; ALIGNMENT_OFFSET],
    json: T,
}

static STATIC: &Static = &Static {
    offset: [0u8; ALIGNMENT_OFFSET],
    json: *include_bytes!("twitter.json"),  // https://github.com/serde-rs/json-benchmark/tree/master/data
};

fn main() {
    let iterations = 10000;
    let json = std::str::from_utf8(&STATIC.json).unwrap();
    let begin = std::time::Instant::now();
    for _ in 0..iterations {
        serde_json::from_str::<serde::de::IgnoredAny>(json).unwrap();
    }
    let bytes = iterations * json.len();
    let throughput = bytes as f64 / 1024.0 / 1024.0 / begin.elapsed().as_secs_f64();
    println!("{:.02} MB/sec", throughput);
}

I am interested whether you find either of the variations promising to incorporate into this implementation, whether you can reproduce them being any faster, and whether they mitigate the small string slowdown at all (which I didn't test).

src/read.rs Outdated Show resolved Hide resolved
tests/test.rs Show resolved Hide resolved
src/read.rs Outdated Show resolved Hide resolved
@dtolnay
Copy link
Member

dtolnay commented Aug 5, 2024

Is increased the mental complexity of the core of serde-json fine?

Yes, this is fine.

Is significantly optimizing long strings but pessimizing short strings fine?

I think this is fine. The set of three benchmark files you already cited above are chosen to be representative enough of real-world performance-sensitive workflows, including with regard to the distribution of string lengths they contain.

Before merging, I would want to try benchmarking another file that disproportionately contains small string to try to get an upper bound on how much such data would be pessimized, but I expect it won't be much.

as simd-json seems to be serde-compatible, what is the main focus of this crate? Is it supposed to be a simple canonical implementation or are more complicated performance optimizations welcome here?

Performance optimizations are welcome. The relevant constraint is that the public API needs to remain good enough and approachable for 99% of JSON use cases. Simd-json's API is not this because so many optimizations have leaked into the public API.

@dtolnay
Copy link
Member

dtolnay commented Aug 5, 2024

error: adding items after statements is confusing, since items exist from the start of the scope
   --> src/read.rs:446:9
    |
446 |         const STEP: usize = mem::size_of::<usize>();
    |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |
    = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#items_after_statements
    = note: `-D clippy::items-after-statements` implied by `-D clippy::pedantic`
    = help: to override `-D clippy::pedantic` add `#[allow(clippy::items_after_statements)]`

Feel free to add allow(clippy::items_after_statements) in lib.rs. Your code is fine as written and I don't agree with this lint.

@purplesyringa
Copy link
Contributor Author

I'm incredibly happy with the direction this is going. I'll see if I can incorporate your patches and maybe optimize stuff some more soon.

@purplesyringa purplesyringa force-pushed the vectorized-string-parsing branch 2 times, most recently from fce646f to a60f1e1 Compare August 11, 2024 12:05
@purplesyringa
Copy link
Contributor Author

I've pushed a rewrite of #1168 that seems cleaner to me and also seems to produce better x86 code. Compared to the previous version of the PR, the benchmark result is:

                    old (MB/s)  new (MB/s)  perf increase
canada dom          288.0       287.8       -0.1%
canada struct       407.4       425.2       +4.4%
citm dom            382.4       407.4       +6.5%
citm struct         883.0       944.6       +7.0%
twitter dom         324.0       339.4       +4.8%
twitter struct      756.8       799.4       +5.6%

...so there are no regressions.

This was confusing to me at first, because memchr2 is supposed to be faster than the SWAR code. But memory reads are slow, so perhaps doing two runs over the data is slower than just one run. I think this change will also make performance more level by avoiding a performance cliff when the chunk doesn't fit in L1.

I also ran a more precise comparison to the original naive code:

                    old (MB/s)  new (MB/s)  perf increase
canada dom          284.8       287.8       +1.1%
canada struct       424.4       425.2       +0.2%
citm dom            378.6       407.4       +7.6%
citm struct         910.2       944.6       +3.8%
twitter dom         299.8       339.4       +13.2%
twitter struct      613.6       799.4       +30.3%

I'm a bit surprised that it's fast on short strings too. That's probably because we no longer perform an indirect call in memchr2. Also, branch predictor should be able to optimize the loop and the if.

Some commentary on the code:

  • I removed some unsafety by using chunks_exact and Chunk::from_ne_bytes(chunk.try_into().unwrap()) instead of pointer arithmetic. This is also more efficient than your PR; I'm a bit unsure where the performance increase comes from, but LLVM produces better code this way.
  • I replaced b'"' as Chunk with Chunk::from(b'"'). This seems cleaner to me.

@purplesyringa
Copy link
Contributor Author

On the flip side, here's the comparison with the "factor out the check" variation:

                    #1167       new (MB/s)  perf increase
canada dom          289.4       287.8       -0.6%
canada struct       406.8       425.2       +4.5%
citm dom            415.8       407.4       -2.0%
citm struct         895.6       944.6       +5.5%
twitter dom         323.6       339.4       +4.9%
twitter struct      756.6       799.4       +5.7%

canada DOM is a bona fide regression, but still better than naive code. citm DOM is noise. I think we should just bite the bullet and drop #1167.

@purplesyringa
Copy link
Contributor Author

For completeness, here's the comparison for the benchmark in #1161 (review) (5 runs performed):

naive   918.20  919.05  921.64  917.83  929.76
#1167   979.75  973.36  983.06  976.38  975.24
old PR  983.56  992.25  975.70  987.34  986.94
new     1342.39 1337.61 1334.17 1333.77 1334.33

@purplesyringa
Copy link
Contributor Author

I think this PR is ready for merging (bar squashing, I'll be happy to squash it if you want).

Unrolling the loop manually twice (this is notably distinct from u128) provides very slightly better performance (around +2% compared to +50% of this PR) on the synthetic benchmark from your comment and <1% speed-up on json-benchmark with some regressions. I don't think it's a viable avenue to pursue.

We'd probably have more luck with wide for further improvements. I can work on that if you'd like me to, but I think it's better to land this PR in the meantime anyway.


Before merging, I would want to try benchmarking another file that disproportionately contains small string to try to get an upper bound on how much such data would be pessimized, but I expect it won't be much.

Data: {"ABC": "", "ABC": "", "ABC": "", ...}.

                5 runs (MB/s)
Naive           619.33  626.66  629.83  626.22  626.77
Current PR      504.49  503.71  505.75  505.99  498.94

This is a 20% regression.

Data: ["0123456789", "0123456789", "0123456789", ...].

                5 runs (MB/s)
Naive           980.35  996.23  996.59  980.63  995.74
Current PR      1083.53 1079.53 1066.18 1083.57 1078.90

This is a 10% improvement.

Data (strings "0123" and "012", chosen randomly): ["0123", "0123", "0123", "012", "012", "0123", "0123", "012", "012", "0123", "012", "012", "012", ...].

                5 runs (MB/s)
Naive           460.70  460.32  460.03  462.20  461.94
Current PR      647.87  641.29  649.03  647.17  649.25

Overall, this PR regresses strings of perfectly predictable very short lengths (the breakeven point seems to be around 6 characters). So the main troublemakers are objects with: a) few, b) short, c) fixed keys and short values.

I'm not sure if this is practical. I cherry-picked some datasets that I expected would be regressed, but none did:

I'm not happy about this, but if this regresses no realistic datasets, we might as well merge it and see who hollers.

@purplesyringa
Copy link
Contributor Author

@rustbot review

@purplesyringa
Copy link
Contributor Author

purplesyringa commented Aug 11, 2024

I just found a regression. We shouldn't really be looking for short strings as counterexamples, we should be looking for empty strings! We also effectively treat the spaces between consecutive Unicode escapes as empty strings, and this is very common in Unicode text. For JSON-encoded War and Peace in Russian, I get

                5 runs (MB/s)
Naive           651.89  648.86  649.74  649.98  651.92
Current PR      536.03  533.37  538.01  541.41  536.62

@purplesyringa
Copy link
Contributor Author

purplesyringa commented Aug 11, 2024

I've committed a fix that should offset the performance loss from -17% to -1.7%:

                5 runs (MB/s)
Naive           651.89  648.86  649.74  649.98  651.92
Old PR          536.03  533.37  538.01  541.41  536.62
Current PR      641.89  635.57  638.24  641.14  640.60

I have not noticed any significant regressions on json-benchmark (there are some improvements too, but I believe it's all just noise). I don't believe this deoptimization is a concern anymore.

Copy link
Member

@dtolnay dtolnay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment on lines +457 to +462
let contains_ctrl = chars.wrapping_sub(ONE_BYTES * 0x20) & !chars;
let chars_quote = chars ^ (ONE_BYTES * Chunk::from(b'"'));
let contains_quote = chars_quote.wrapping_sub(ONE_BYTES) & !chars_quote;
let chars_backslash = chars ^ (ONE_BYTES * Chunk::from(b'\\'));
let contains_backslash = chars_backslash.wrapping_sub(ONE_BYTES) & !chars_backslash;
let masked = (contains_ctrl | contains_quote | contains_backslash) & (ONE_BYTES << 7);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A fun followup would be to throw this arithmetic expression into a superoptimizer and see if there is some shorter expression that it is equivalent to. It seems very likely to me that the same quantity can be computed in fewer operations than written here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some observations from https://rust.godbolt.org/z/66fGEqY6c:

LLVM already rewrites this as:

- let contains_quote = chars_quote.wrapping_sub(ONE_BYTES) & !chars_quote;
- let contains_backslash = chars_backslash.wrapping_sub(ONE_BYTES) & !chars_backslash;
+ let contains_quote = chars_quote.wrapping_sub(ONE_BYTES) & !chars;
+ let contains_backslash = chars_backslash.wrapping_sub(ONE_BYTES) & !chars;
  let masked = (contains_ctrl | contains_quote | contains_backslash) & (ONE_BYTES << 7);

because it has proven that A & !(chars ^ B) & C is equivalent to A & !chars & C whenever B & C == 0, i.e. doing the ^ only affects bits that are later erased by the second &.

Then it factored out the three & !chars into one.

- let contains_ctrl = chars.wrapping_sub(ONE_BYTES * 0x20) & !chars;
- let contains_quote = chars_quote.wrapping_sub(ONE_BYTES) & !chars;
- let contains_backslash = chars_backslash.wrapping_sub(ONE_BYTES) & !chars;
- let masked = (contains_ctrl | contains_quote | contains_backslash) & (ONE_BYTES << 7);
+ let tmp_ctrl = chars.wrapping_sub(ONE_BYTES * 0x20);
+ let tmp_quote = chars_quote.wrapping_sub(ONE_BYTES);
+ let tmp_backslash = chars_backslash.wrapping_sub(ONE_BYTES);
+ let masked = (tmp_ctrl | tmp_quote | tmp_backslash) & !chars & (ONE_BYTES << 7);

A superoptimizer would do this kind of thing, but better.

@dtolnay dtolnay merged commit 859ead8 into serde-rs:master Aug 11, 2024
13 checks passed
@purplesyringa purplesyringa deleted the vectorized-string-parsing branch August 11, 2024 22:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants