Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

indexeddb: Shrink values in inbound_group_sessions store, improving performance all around #3073

Merged
merged 3 commits into from
Jan 31, 2024

Conversation

andybalaam
Copy link
Contributor

@andybalaam andybalaam commented Jan 30, 2024

Fixes #3057.

Change the way we store values in Indexed DB for the crypto store, specifically in the inbound_group_sessions store. Reduce the size of the stored values and replace arrays of numbers with base 64 strings. The rationale and some implementation ideas are documented in #3057. (TL;DR: small values are faster, base64 is faster.)

This change:

  • Introduces a MaybeEncrypted type in matrix_sdk_indexeddb::crypto_store::indexeddb_serializer that encapsulates the fact that we serialize differently if the session is encrypted with a store cipher. (Previously, we just serialized an array of numbers in both cases. This was actually quite confusing since these arrays of numbers were holding encoded versions of very different values.) This enum is untagged in Serde to avoid increasing the size of Indexed DB records.
  • Adds matrix_sdk_store_encryption::EncryptedValueBase64, which works the same way as EncryptedValue, but stores arrays of numbers as base 64.
  • Adds a migration to the new schema to matrix_sdk_indexeddb::crypto_store::migrations, including adding the backed_up_to property, so we can support a fix for #26892 (see this comment) without doing another migration.

I also wrote a performance test that runs in a web browser. In the checked-in version, it runs for 2K records, but when I manually increased the number on my machine to 200K, I got these results:

Inserting 200000 records with v8 schema took 216,745ms.
Inserting 200000 records with v10 schema took 34,875ms.

Counting 200000 records with v8 schema took 11,786ms.
Counting 200000 records with v10 schema took 469ms.

So the speedup is impressive.

@andybalaam andybalaam marked this pull request as ready for review January 30, 2024 13:47
@andybalaam andybalaam requested a review from a team as a code owner January 30, 2024 13:48
@andybalaam andybalaam requested review from bnjbvr and removed request for a team January 30, 2024 13:48
Copy link

codecov bot commented Jan 30, 2024

Codecov Report

Attention: 10 lines in your changes are missing coverage. Please review.

Comparison is base (18eefcf) 83.72% compared to head (e3da325) 83.68%.
Report is 20 commits behind head on main.

Files Patch % Lines
crates/matrix-sdk-store-encryption/src/lib.rs 68.75% 10 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3073      +/-   ##
==========================================
- Coverage   83.72%   83.68%   -0.05%     
==========================================
  Files         222      222              
  Lines       23357    23389      +32     
==========================================
+ Hits        19556    19573      +17     
- Misses       3801     3816      +15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@andybalaam andybalaam requested review from BillCarsonFr and Hywan and removed request for bnjbvr January 30, 2024 17:09
Copy link
Member

@BillCarsonFr BillCarsonFr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, thx!
LGTM to me. I agree that it's worth it to include the future backed_up_to field at this point.

Copy link
Member

@Hywan Hywan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks excellent to me. I've a couple of really tiny feedback, but overall, great work!

@andybalaam andybalaam force-pushed the andybalaam/shrink-inbound_group_sessions branch from 64fb436 to 56ac6df Compare January 31, 2024 12:00
@andybalaam andybalaam force-pushed the andybalaam/shrink-inbound_group_sessions branch from 56ac6df to e3da325 Compare January 31, 2024 12:16
@andybalaam andybalaam merged commit f64af12 into main Jan 31, 2024
34 checks passed
@andybalaam andybalaam deleted the andybalaam/shrink-inbound_group_sessions branch January 31, 2024 12:29
@poljar
Copy link
Contributor

poljar commented Feb 8, 2024

Oh no: #718.

@dkasak
Copy link
Member

dkasak commented Feb 8, 2024

Great work with the speed-up!

I'm a bit late to the party, but what role does base64-encoding the payload serve in this? It seems that the trick is to replace an array with lots of values with a single value (a string), which could've just as well been a JSON string. Which would mean the base64 step only serves to add overhead on top of this?

The only way the base64 encoding would make sense is if we opted for a more compact binary representation of the array of numbers (which I think was the ultimate goal of #718 there which poljar linked).

@andybalaam
Copy link
Contributor Author

Oh no: #718.

@poljar hmm. I think we could move the JSON-specific stuff into the indexeddb crate if we wanted to?

@andybalaam
Copy link
Contributor Author

andybalaam commented Feb 8, 2024

Great work with the speed-up!

I'm a bit late to the party, but what role does base64-encoding the payload serve in this? It seems that the trick is to replace an array with lots of values with a single value (a string), which could've just as well been a JSON string. Which would mean the base64 step only serves to add overhead on top of this?

The only way the base64 encoding would make sense is if we opted for a more compact binary representation of the array of numbers (which I think was the ultimate goal of #718 there which poljar linked).

@dkasak the base64 encoding is to transform an array of bytes into a string, because I found that strings are much faster than arrays of numbers to store inside Indexed DB. When you say a JSON string, do you mean treat the bytes as characters directly? That would result in invalid UTF-8, which is not allowed in JSON JavaScript, right?

@poljar
Copy link
Contributor

poljar commented Feb 8, 2024

Oh no: #718.

@poljar hmm. I think we could move the JSON-specific stuff into the indexeddb crate if we wanted to?

Yeah, maybe, will think about it.

@dkasak
Copy link
Member

dkasak commented Feb 8, 2024

When you say a JSON string, do you mean treat the bytes as characters directly? That would result in invalid UTF-8, which is not allowed in JSON JavaScript, right?

What I mean is using serde_json::to_string (which produces an UTF-8 encoded string representation of the JSON) instead of serde_json::to_vec (which produces a vector of bytes of the same representation). And then skipping the base64 encoding step altogether.

@andybalaam
Copy link
Contributor Author

We are already encoding to JSON using to_string the question is what are we encoding? Previously it was a Vec<u8> and now it's a String, which contains base64.

@andybalaam
Copy link
Contributor Author

If I remember correctly, the Vec<u8> we have is a pickled version of an InboundGroupSession.

@dkasak
Copy link
Member

dkasak commented Feb 8, 2024

Just to write down the conclusion: after a quick chat with @andybalaam, only the unencrypted case unnecessarily base64-encodes the payload before storing. The encrypted case a couple of lines above does what is expected, so that's why we were talking past each other.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Indexed DB performance is slow because of large values
5 participants