Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EDR takes minutes to send messages after importing offline key backup. #26783

Closed
ara4n opened this issue Dec 16, 2023 · 6 comments
Closed

EDR takes minutes to send messages after importing offline key backup. #26783

ara4n opened this issue Dec 16, 2023 · 6 comments
Assignees
Labels
A-Element-R Issues affecting the port of Element's crypto layer to Rust O-Occasional Affects or can be seen by some users regularly or most users rarely S-Major Severely degrades major functionality or product features, with no satisfactory workaround T-Defect Z-Labs

Comments

@ara4n
Copy link
Member

ara4n commented Dec 16, 2023

Steps to reproduce

  1. I successfully imported my offline keybackup (420K keys) into EDR, yay - probably means Importing a large (300MB) megolm backup into EDR fails element-desktop#1394 is fixed.
  2. Message sending now takes ~5 minutes in E2EE rooms. Rageshakes can also take ~5m to send. There isn't anything consuming CPU.
  3. Disabling keybackup (by running await mxMatrixClientPeg.get().getCrypto().backupManager.disableKeyBackup() at the JS console) fixes it.

There's a chance that the slow send isn't related to me having importing the offline keybackup, but by having connected to the online backup. However, EWR is fine after connecting to online backup.

As a final symptom: i'm seeing a lot of E2EE traffic coming in from online backup (with grey shields) rather than direct.

This may be related to #26375

Outcome

What did you expect?

Fast send

What happened instead?

Agonisingly slow send.

Operating system

No response

Application version

Element Nightly version: 2023121601 Crypto version: Rust SDK 0.6.0 (438880a), Vodozemac 0.5.0

How did you install the app?

No response

Homeserver

No response

Will you send logs?

Yes

@ara4n ara4n added the T-Defect label Dec 16, 2023
@ara4n ara4n added the A-Element-R Issues affecting the port of Element's crypto layer to Rust label Dec 16, 2023
@t3chguy t3chguy transferred this issue from element-hq/element-web Dec 16, 2023
@ara4n ara4n transferred this issue from element-hq/element-desktop Dec 18, 2023
@ara4n
Copy link
Member Author

ara4n commented Dec 18, 2023

might be related to #26488

@BillCarsonFr BillCarsonFr self-assigned this Dec 18, 2023
@dbkr dbkr added S-Major Severely degrades major functionality or product features, with no satisfactory workaround O-Occasional Affects or can be seen by some users regularly or most users rarely labels Dec 19, 2023
@BillCarsonFr
Copy link
Member

Analysis of the problem:

What happens is that when group sessions are mark as backep_up the sdk start a readwrite transaction on inbound_group_sessions2.

If now a new message is sent from a room that needs a new session, it will create an outbound session and the inbound that it wants to store.

=> Calls save changes that will also want to start a readwrite transaction on inbound_group_sessions2

So the sending of the message is locked until the mark as backed-up as complete.
But save_change also take the save_changes_lock => Thus blocking any other save change that could occur (lock the syncs for example)

Now that save_changes_lock is taken, it will also block sendings in other rooms even if the inbound_group_sessions2 is not needed. It will lock the updates of the olm sessions used to encrypt for example.

The fact that mark_inbound_group_sessions_as_backed_up is a very slow transaction easily creates this lock.
The question is why it is so slow.

Can be reproduced with any call that readwrite on inbound_group_sessions2

For example, type this in the console:
mxMatrixClientPeg.get().getCrypto().backupManager.disableKeyBackup()

This will also be slow and locks the inbound group session store.
Now on a room:

  • Type /discardsession
  • Send a message

=> The sending will be super slow (depending on your number of keys)
=> Message sending on other rooms will be blocked too

@BillCarsonFr
Copy link
Member

update:

The fact that mark_inbound_group_sessions_as_backed_up is a very slow transaction easily creates this lock.
The question is why it is so slow.

After a closer look it's not mark_inbound_group_sessions_as_backed_up that is very slow, it's the enclosing call BackupMachine#mark_request_as_sent due to the fact that for tracing purpose we count the backed-up keys. On a test setup, the total spent time is ~20s for mark_as_sent but the actual mark_inbound_group_sessions_as_backed_up is 300ms.
The count query is read only so shouldn't block, have yet to figure out what could lock for so long and create this scenario.

backupManager.disableKeyBackup() will still create it has it's a very long write transaction

@richvdh
Copy link
Member

richvdh commented Jan 12, 2024

We don't consider this a release blocker for Element R, because once your imported keys are (eventually) backed up, performance recovers.

@BillCarsonFr
Copy link
Member

Some more findings:

When the inbound_group_session2 database gets bigger, some operation start to be very slow. It's particularly the case with the queries that counts the number of keys. And when such a query is beeing processed, message sending will be blocked.

Can be reproduced by typing this in the console:

window.indexedDB.open("matrix-js-sdk::matrix-sdk-crypto", 8).onsuccess = (event) => {
    const db = event.target.result;
    const start = Date.now();
    const tx = db.transaction(["inbound_group_sessions2"], "readonly");
    const object_store = tx.objectStore("inbound_group_sessions2");
    const req = object_store.count()
    req.onsuccess = (event) => { 
        console.log("## Count", event.target.result);
    };
    req.onerror  = (event) => { console.info(" ## fail to count")};
    tx.oncomplete = (() => {
        const now = Date.now();
        console.info("## Perf: count took:", now - start, "ms")
    });
}

Just after starting this, you can try to send a message. It will be blocked until the count request has completed.

The count performances are particularly bad on Firefox from my testing:

Firefox:

23:07:18,566 ## Count 190003 bundle.js:262316:19
23:07:18,579 ## Perf: count took: 39182 ms

23:09:07,297 ## Count 190003 bundle.js:262316:19
23:09:07,322 ## Perf: count took: 38755 ms

The same database on Chrome:

## Count 190003
rageshake.ts:77 ## Perf: count took: 1842 ms

## Count 190003
rageshake.ts:77 ## Perf: count took: 1794 ms

So in my case as backuploop is running, a call is made after each chunk to count the remaining keys for signaling. And if you send a message it would be delayed by around ~30s on firefox.

On legacy the inbound_group_sessions are not saved in the same way in the database. And count performances are a lot better on legacy with the same database (1s instead of 38s).
On legacy inbound group sessions are stored as a json InboundGroupSessionData (~800 chars)
On rust indexeddb its a { pickled_session: Array(~3600) } (encrypted payload)

I did some experiment on the IndexedDB keys used by rust to see if it impacts count performances, but didn't see any noticable improvement. Like reducing key clustering (sessionId|roomid) instead of (roomId|sessionId) or just sessionId.

So maybe related to the size of the data?

Anyhow, for now we might want to stop counting the rows during the key loop as a temporary improvement

@ara4n
Copy link
Member Author

ara4n commented Jan 23, 2024

Confirmed fixed by matrix-org/matrix-js-sdk#4015 \o/

@ara4n ara4n closed this as completed Jan 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Element-R Issues affecting the port of Element's crypto layer to Rust O-Occasional Affects or can be seen by some users regularly or most users rarely S-Major Severely degrades major functionality or product features, with no satisfactory workaround T-Defect Z-Labs
Projects
None yet
Development

No branches or pull requests

4 participants