Element-R | Resetting key backup takes a long time; and blocks the whole application #26892

BillCarsonFr · 2024-01-18T12:26:50Z

Step to reproduce

Have an account with a lot of keys in local database, wait for all of them to be uploaded to backup
Go to Security & Privacy and click reset, and follow the step

=> Starting now if you type a message, it won't be sent after several minutes depending on the number of room keys in database

What is happening

See here #26783 (comment)

Basically reseting backup will have to do a big write transaction that will lock the inbound_group_sessions2 object store for several minutes.
It will also lock all the database due to the rust save_changes_lock that is taken when saving changes after a sync

=> Sync will be blocked, sending will be blocked

ANOTHER PROBLEM: If you close the tab or refresh the page, the transaction won't be retried so the keys won't be marked to be uploaded to the new backup

Difference with legacy

The inbound_group_sessions2 rust object store is bigger (encrypted byte array of lenght ~3700) than the legacy one (json with pickled key of around (string of lenght ~700).
So the rust db gets slower faster

The text was updated successfully, but these errors were encountered:

richvdh · 2024-01-18T13:05:27Z

Is this not a duplicate of #26873?

BillCarsonFr · 2024-01-18T13:23:12Z

Is this not a duplicate of #26873?

Could be one of the cause. This one is specifically after a backup reset.

richvdh · 2024-01-18T14:17:36Z

I actually meant #26783.

But the issue is "Message sending is slow when there a lot of keys to back up", right? We should combine the two issues, imho.

richvdh · 2024-01-18T15:08:40Z

But the issue is "Message sending is slow when there a lot of keys to back up", right? We should combine the two issues, imho.

no, I misunderstood. The issue here is that the actual transaction to reset the key backup takes a very long time. (And if you restart during that time, everything is messed up)

andybalaam · 2024-01-23T15:10:38Z

Plan:

Change "needs_backup" field into "most_recent_backup_to_include_this" so we don't need to modify all records when the current backup version changes.
Reduce the size of the values in inbound_group_sessions2 so everything is faster. Do this by avoiding so many JSON re-encodes.

andybalaam · 2024-01-23T15:39:48Z

Previous plan doesn't work without a migration because we can't efficiently query for records that don't have a matching most_recent... field.

Next idea: assign each backup version an order number, so each time we see a new backup version from the server, we set a current_backup_order value to the previous order + 1. We store against each inbound_group_session the order of the most recent backup that contains this session under e.g. backed_up_to. (0 == never backed up)

In the backup loop we query inbound_group_sessions where backed_up_to < current_backup_order.

The above would require a migration.

Since we don't want to combine this with a migration to reduce value size, we can avoid a further migration by preparing now.

Prep means: add the backed_up_to field and set it to -1. This means these records will appear in the backup loop, and we will migrate them during that loop (instead of in a migration).

The logic within the backup loop looks like::

if backed_up_to == -1:
    if needs_backup:
        backed_up_to = 0
    else:
        backed_up_to = 1

if backed_up_to < current_backup_order:
    back_up_this_session()
    backed_up_to = current_backup_order

[Updated to use backed_up_to:-1 instead of a separate property]

BillCarsonFr · 2024-01-24T14:38:23Z

Some remarks on

Next idea: assign each backup version an order number, so each time we see a new backup version from the server, we set a current_backup_order value to the previous order + 1.

What happens in situations were new backups version are created (without deleting the previous ones) without the client having beeing on to be aware of them. Then we start to delete them.

Something like

Create a version "1"
Create a version "2"
Create a version "3"
Create a version "4"

The client is aware of version "1" and give it order 0
Then it misses version "2" and "3"
But get version "4", it will have order 1 then.

Now version "4" is deleted, making the old "3" as current.
The client gets to know it and will give it order 2

I guess it still works? but can there be some edge cases due to that?

richvdh · 2024-01-24T14:55:55Z

Now version "4" is deleted, making the old "3" as current.
The client gets to know it and will give it order 2

That's fine. The most recent backup is then version "3", and we will try to upload all of our keys to it.

Of course, version "3" might already have all of our keys, so that would be somewhat wasted effort. But it'll sort itself out in the end.

andybalaam · 2024-01-26T10:45:00Z

Not blocking Element R release because if we do #26930 we will be on-par with legacy crypto. (But we should still do this later.)

BillCarsonFr · 2024-02-04T20:11:41Z

FTR, with the new db format it took 3mn to reset all my keys (a bit less than 300k keys) instead of several 10s of minutes previously.

andybalaam · 2024-02-22T13:55:08Z

To speed up key backup resets:

Store current backup order number and ID in Indexed DB, probably in the core store, or in backup_keys. Default it to 1 and the current backup ID if it is empty.
When we receive a new backup version (or when we reset backup?), increment the current backup order number and change the current ID to match the new ID
In the backup loop, follow the algorithm from my above comment

To remove legacy code and data:

Wait until most users have probably migrated most of their keys
Change the logic from the above comment to remove the first if. This means we don't need the not_backed_up property any more. Any non-migrated keys will be backed up again, but this won't do harm except being slow.
Create a DB migration that removes the not_backed_up property and its related index.

andybalaam · 2024-02-23T12:49:39Z

I just had a conversation with @poljar and @richvdh about how we can implement this. Here are my notes from our conversation:

Modify inbound_group_sessions_for_backup and mark_inbound_group_sessions_as_backed_up so they take in the backup version. We checked where they are called, and we do have access to the version in both places. It's stored somewhere deep inside PendingBackupRequest. We discussed and agreed that this is a sensible API change for all crypto storage backends, because in both cases we are talking about the "current backup", which was implicit, and this allows us to be more explicit (maybe even avoiding races) by saying exactly which backup we mean. https://github.com/element-hq/crypto-internal/issues/206
Create a new object store backup_versions in Indexed DB that will hold a mapping of backup version -> backup order number. We decided we should keep all of them, not just the latest, so that if some old backup is still running we won't somehow understand that as a new backup and bump our version again https://github.com/element-hq/crypto-internal/issues/207
(Added by Andy later) Validate the algorithm using the MemoryStore https://github.com/element-hq/crypto-internal/issues/208
Implement the algorithm https://github.com/element-hq/crypto-internal/issues/209:
- In mark_inbound_group_sessions_as_backed_up:
  - check the version we have been given against the versions we know in backup_versions.
  - If it doesn't exist in there, add it, giving it an order number greater than all the other entries.
  - Look up the order number for this version, and write that into the backed_up_to field for each entry we are saving.
- In inbound_group_sessions_for_backup:
  - look up the version we have been given in backup_versions.
  - If it exists, use its order as current_backup_order in the algorithm from the above comment.
  - If it doesn't exist then this is a new backup: just return a batch of any entries, since none of them are backed up yet in this new backup. (The entry will get added when the batch succeeds, and mark_inbound_group_sessions_as_backed_up is called.)

Thinking through the case when this has never happened before:

First we run a schema upgrade to add the backup_versions object store
Then we call inbound_group_sessions_for_backup and find no entry for the current backup. In this special case where backup_versions is empty, we should default to 1 as its order number. Then the algorithm from the comment above will still work.
Then we call mark_inbound_group_sessions_as_backed_up. We will add an entry to backup_versions for this backup version. We should choose order number 1 because there are no other entries in the store. From now on things should work as planned because order number 1 means "the backup was current when we started this process" when we don't know better.

andybalaam · 2024-04-09T13:14:25Z

Had a further conversation with @richvdh today where we realised we can simplify this a bit, but still make use of the backed_up_to field we added in the last schema upgrade.

No need for backup_versions mapping

We realised we didn't need the backup_versions mapping. It had 2 purposes:

To make version "1" special: backed_up_to==-1 && !needs_backup => assume backed_up_to==1. Here 1 has the special meaning of "the first backup version we encounter after we start using the new schema". I describe below how we track the "pre-existing backup".
To allow us to compare versions using > or <, to prevent us "downgrading" a session to be backed up by an older version. During our conversation today we decided this can never happen, as I describe below.

So instead of storing a backup order in backed_up_to we can directly store the backup_version in there. The name is already OK, and while it's slightly weird that backup_up_to is already populated with -1 it won't actually do any harm to store strings in there.

No need to compare orders

Previously, we were concerned about receiving an acknowledgement of a backed-up session for an old backup version, when actually a new backup was already in progress.

Today, we reviewed the code in backup() and mark_request_as_sent() in crates/matrix-sdk-crypto/src/backups/mod.rs and convinced ourselves that because the BackupMachine keeps track of the current backup request, it will always reject any outdated response because it doesn't match the request_id we are expecting. So old backup acks will be ignored.

Thus, when we receive a backup ack we can safely assume it is for the latest version, and store that version next to the session so we know it has been backed up. Also, when looking for sessions to back up we can also just look for rows whose backed_up_to != the supplied version.

We also considered the case where the user deletes an new backup and we start re-using an old one. In this case, the order comparison we suggested was working against us, because we would refuse to back up a session to an "older" backup than the current one, but in the case where the new one has been deleted, we actually want to back up to the older one. So again, the correct comparison is supplied_backup_version != backed_up_to.

Pre-existing backup

If we are looking for sessions to back up and we encounter one with backed_up_to==-1 this means it existed before we started writing to backed_up_to. (I.e. before the code I am working on now.)

In our previous idea we decided to treat backed_up_to==-1 && !needs_backup as if backed_up_to==1, and we knew that the first time we encountered a new backup_version we would set its order to 1.

Instead, we can simply track whether we are currently still on the first backup, or not. So, before reset_backup_state() is called, we assume that backed_up_to==-1 && !needs_backup means the session was backed up in the current backup, and otherwise it was not. So we only need special logic to handle backed_up_to==-1 when we have never reset the backup.

Note: when another client resets, deletes or adds a backup, we will call reset_backup_state(), which is exactly what we need here.

I will describe the code we will write in the next comment.

andybalaam · 2024-04-09T13:15:41Z

Code

DB migration

Create a store called backup_reset with no index.

inbound_group_sessions_for_backup

Check inside the backup_reset store for a backup_reset property. If it does not exist or its value is != true then set local variable backup_reset to false. Otherwise set it to true.

Search for sessions whose backed_up_to is != backup_version. On indexeddb this will require us to iterate everything < backup_version and then > backup_version but this is do-able.

If !backup_reset then filter the list, excluding entries where backed_up_to==-1 && !needs_backup.

Process and limit the list as before, and return it.

mark_inbound_group_sessions_as_backed_up

To mark a session as backed up, remove its needs_backup flag as before, and also stored backed_up_to=backup_version.

Add a comment to this code explaining that if we receive a call to this function with an old version then it can't be a race because of the request_id mechanism in BackupMachine, so it must be that the newer backup was deleted, so writing the supplied version is always correct.

reset_backup_state

Insert an entry into the backup_reset store with field backup_reset and set it to true. Optional: only do this if it is not already set to true.

BillCarsonFr added the A-Element-R Issues affecting the port of Element's crypto layer to Rust label Jan 18, 2024

github-actions bot added the Z-Labs label Jan 18, 2024

BillCarsonFr added the Z-Element-R-Blocker A blocker for enabling Element R by default label Jan 18, 2024

richvdh changed the title ~~Element-R | After reseting backup on big accounts, Element will stop syncing, sending message until all keys are marked as not backed-up~~ Element-R | Resetting key backup takes a long time; and blocks the whole application Jan 18, 2024

andybalaam self-assigned this Jan 18, 2024

element-hq deleted a comment from dosubot bot Jan 18, 2024

andybalaam mentioned this issue Jan 26, 2024

Indexed DB performance is slow because of large values matrix-org/matrix-rust-sdk#3057

Closed

andybalaam removed the Z-Element-R-Blocker A blocker for enabling Element R by default label Jan 26, 2024

github-actions bot removed the Z-Labs label Jan 26, 2024

andybalaam mentioned this issue Jan 30, 2024

indexeddb: Shrink values in inbound_group_sessions store, improving performance all around matrix-org/matrix-rust-sdk#3073

Merged

andybalaam mentioned this issue Apr 5, 2024

crypto: MemoryStore uses backup versions to track which sessions are backed up matrix-org/matrix-rust-sdk#3306

Closed

andybalaam mentioned this issue Apr 11, 2024

crypto: MemoryStore uses backup versions to track which sessions are backed up matrix-org/matrix-rust-sdk#3320

Merged

1 task

richvdh mentioned this issue Jun 18, 2024

Ensure that existing megolm sessions are re-shared after a wedged olm session element-hq/element-meta#2389

Closed

This was referenced Sep 4, 2024

inbound group session update can race with reset backup matrix-org/matrix-rust-sdk#3943

Open

Unable to verify other session #18934

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Element-R | Resetting key backup takes a long time; and blocks the whole application #26892

Element-R | Resetting key backup takes a long time; and blocks the whole application #26892

BillCarsonFr commented Jan 18, 2024 •

edited

Loading

richvdh commented Jan 18, 2024

BillCarsonFr commented Jan 18, 2024

richvdh commented Jan 18, 2024 •

edited

Loading

richvdh commented Jan 18, 2024

andybalaam commented Jan 23, 2024

andybalaam commented Jan 23, 2024 •

edited

Loading

BillCarsonFr commented Jan 24, 2024

richvdh commented Jan 24, 2024

andybalaam commented Jan 26, 2024

BillCarsonFr commented Feb 4, 2024

andybalaam commented Feb 22, 2024

andybalaam commented Feb 23, 2024 •

edited

Loading

andybalaam commented Apr 9, 2024

andybalaam commented Apr 9, 2024 •

edited

Loading

Element-R | Resetting key backup takes a long time; and blocks the whole application #26892

Element-R | Resetting key backup takes a long time; and blocks the whole application #26892

Comments

BillCarsonFr commented Jan 18, 2024 • edited Loading

Step to reproduce

What is happening

Difference with legacy

richvdh commented Jan 18, 2024

BillCarsonFr commented Jan 18, 2024

richvdh commented Jan 18, 2024 • edited Loading

richvdh commented Jan 18, 2024

andybalaam commented Jan 23, 2024

andybalaam commented Jan 23, 2024 • edited Loading

BillCarsonFr commented Jan 24, 2024

richvdh commented Jan 24, 2024

andybalaam commented Jan 26, 2024

BillCarsonFr commented Feb 4, 2024

andybalaam commented Feb 22, 2024

andybalaam commented Feb 23, 2024 • edited Loading

andybalaam commented Apr 9, 2024

No need for backup_versions mapping

No need to compare orders

Pre-existing backup

andybalaam commented Apr 9, 2024 • edited Loading

Code

DB migration

inbound_group_sessions_for_backup

mark_inbound_group_sessions_as_backed_up

reset_backup_state

BillCarsonFr commented Jan 18, 2024 •

edited

Loading

richvdh commented Jan 18, 2024 •

edited

Loading

andybalaam commented Jan 23, 2024 •

edited

Loading

andybalaam commented Feb 23, 2024 •

edited

Loading

andybalaam commented Apr 9, 2024 •

edited

Loading