Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Audit device list updates #4115

Closed
erikjohnston opened this issue Oct 30, 2018 · 3 comments
Closed

Audit device list updates #4115

erikjohnston opened this issue Oct 30, 2018 · 3 comments
Labels
A-Performance Performance, both client-facing and admin-facing

Comments

@erikjohnston
Copy link
Member

We're spending quite a bit of CPU and DB time handling device list updates (both local and remote). We should look at whether things are working as expected or if there are some optimisations we can make.

  • Ensure that we're only sending out device list updates over federation when necessary (e.g. jki.re sends to be sending device list updates even when I've not created devices)
  • _prune_old_outbound_device_pokes takes a lot of CPU and DB time
  • Consider not ignoring devices that don't have encryption keys
  • There are a few device lists tables that are huge, do we need to be inserting so much data?
@spantaleev
Copy link
Contributor

I'm observing slow logins due to Synapse having to announce device updates over federation.

matrix-corporal, having to manage settings for lots of users on a homeserver, frequently logs in with each user account, does some work and soon after logs out.

It sends login requests like this:

{"type": "m.login.password", "user": "..", "password": "..", "device_id": "Matrix-Corporal-Reconciler"}

(Authentication always succeeds due to the use of the matrix-synapse-shared-secret-auth password provider. This, however, is irrelevant).

Because this is a new device_id, Synapse creates it and announces it (yield self.notify_device_update(user_id, [device_id])).

For users in lots of big rooms, this triggers some log entry like this:

2019-01-26 10:14:52,299 - synapse.handlers.device - 282 - INFO - POST-90393 - Sending device list update notif to: { ... 500+ hostnames here ... }

I'm guessing the actual federation transmission happens later (somewhat slowly), but still, one or all of these seems to take time:

  • discovering the rooms the user is in
  • building a distinct list of hostnames for these rooms
  • putting many federation requests in the queue

As a result, a single /login (for such a user belonging to large rooms) would take something like 2-5 seconds.

matrix-corporal would then proceed to do some work with the access token.


Once matrix-corporal is done (a second or two later), it would play nice and perform a /logout. Since the access token is associated with a device_id, the logout process will call DeviceHandler.delete_device, which would trigger yet another DeviceHandler.notify_device_update, possibly spawning 500+ more federation requests.

Logging out appears to take around 1 second for such user accounts.


It would be nice if:

  • Synapse would record logging-in devices and process them in an async manner later on (not as part of the login request)
  • even with enqueuing, Synapse would delay processing for a logging-in device for a while, rather than processing it immediately
  • if a device gets destroyed (/logout) soon enough, it would get removed from the update queue (undoing the actions above). Federation requests will not get scheduled at all.

I guess the above is bit of a special case. I'm not sure how many other automation projects need to log in and create such short-lived devices.

Still, it would be great for everyone if all the scheduling + federation work happened asynchronously (even without the ability to cancel-it-out).


Another thing that would help out my use case is if /logout would not destroy the device. Leaving an orphan device without an associated access token would prevent device updates from being scheduled, and would let a subsequent /login not suffer for the same reason. This feels somewhat dirty though.. I wonder if Soft Logout (#4280) aims for something like that.

Alternatively, supporting device-less /login would also work. Looking at the code though, it seems like it's discouraging device_id = None, so I guess this won't be happening.

@richvdh richvdh removed the p1 label Dec 3, 2019
@richvdh richvdh changed the title Audit device lists Audit device list updates Nov 21, 2020
@richvdh
Copy link
Member

richvdh commented Feb 21, 2022

@erikjohnston how much of this do you still think is relevant?

@erikjohnston
Copy link
Member Author

I think this can be closed now, we've done a bunch of work touching device list updates in the past couple of years and they don't seem to be causing too much issues recently.

If we do find there are still problems then lets open new issues with more details

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Performance Performance, both client-facing and admin-facing
Projects
None yet
Development

No branches or pull requests

4 participants