MSC2638: Ability for clients to request homeservers to resync device lists #2638

babolivier · 2020-06-15T19:40:14Z

…lists

proposals/2638-client-request-device-list.md

turt2live

This looks to be probably the correct solution for the time being, though the lack of flood protection specified by the MSC is a bit concerning.

turt2live · 2020-07-01T04:45:45Z

proposals/2638-client-request-device-list.md

+without the proper retry mechanisms. It can also happen if a homeserver was down
+for enough time to miss out on device list updates.


It can also happen if a homeserver was down for enough time to miss out on device list updates.

Surely this is an implementation bug if it doesn't send them once the server returns? The spec doesn't say to give up.

Implementations are free to decide when they retry. Currently Synapse isn't too clever about that (related: matrix-org/synapse#2526) - afaik it retries with a backoff meaning that if you've been down for long enough you could have to wait a long time (I think we go up to 24h but I'm not 100% certain) without getting the transactions it's missed. Also I don't think the "retry as soon as the server has returned" behaviour is mandated in the spec so it doesn't entirely sounds like an implementation bug to me.

The spec doesn't say to give up.

Well surely implementations will give up at some point. It sounds like a gigantic waste of resources to e.g. retry 100s of servers that have been down for weeks. And if the spec doesn't specifically mandate not to give up (which again would feel a bit meh to me) then imho it just means that the homeserver is free to decide whether to give up and when to do so.

These all pass through the transaction endpoints in the federation spec, which does recommend backing off but doesn't say that the sending server should ditch the request and start at a more convenient place.

I'm still not seeing how that makes the issue this MSC tries to solve disappear TBH. Both backing off for hours/days or considering the server down and stopping sending it transactions entirely if it's been down for long enough don't sound unreasonable nor do they look like implementation bugs to me.

i agree with @turt2live that it feels wrong that we have to bodge around devicelists being unreliable by exposing an endpoint for clients to manually refresh them. It'd be better for device lists to be reliable in the first place, which means fixing matrix-org/synapse#2526 - which feels like a much better use of time. Are there other bugs that would cause device lists to get out of sync that we know of beyond that bug? (cc @kegsay)

fixing matrix-org/synapse#2526 - which feels like a much better use of time

Is that (retrying pending transactions as soon as a server comes back to life) something that's enforced in the spec? If so I agree. Otherwise it feels like we're fixing the issue only for Synapse and ignoring people using an implementation that's handling that in another way (either because it's less advanced or by design) which feels wrong to me.

it feels wrong that we have to bodge around device lists being unreliable by exposing an endpoint for clients to manually refresh them

I tend to agree; my concern is that it's a feature that's easy to get wrong (as shown by the multiple Synapse bugs related to it), and any mishap ruins a whole part of the E2EE UX for a user - my opinion was that it could be nice to have a fallback just in case an implementation, whether it's Synapse or another one, screws up. Implementing inbound device list updates over federation requires to implement some things (e.g. retries) the spec doesn't mention, and screwing it up will likely leave the homeserver with a bunch of stale device lists it can't trivially fix even if the code gets fixed.

The proposal also addresses another issue, which is how to fix a server screwing up on top of just fixing the bug in the implementation. For example, with the multiple Synapse bugs, it's fair to say we're likely to have quite a few stale device lists, with no other way to fix them other than having the remote user rename a device (which isn't intuitive at all). Homeservers can't really figure out which device lists are stale without resyncing the device list of every single remote user they knows of, whereas it's (imho) much more trivial for clients to figure out when a device list is likely to be out of date.
FWIW the main cause of user grumpiness following the issues we had in Synapse was that users that aren't operating their own homeserver basically can't fix the stale device lists they could see for remote users, making the E2EE experience quite disappointing. It seems likely to me that this happens with other implementations in the future (or even in Synapse if we introduce a regression in that code by mistake).

I guess my point is that implementing device list updates is quite prone to bugs and design flaws and either can be very annoying to both server admins and users. If this is an issue then we should probably explore alternative designs for this feature.

Are there other bugs that would cause device lists to get out of sync that we know of beyond that bug?

There is one I know of (though apparently I haven't taken the time to open an issue for it yet) which is that Synapse expects updates to arrive sorted on their stream_id, which the spec doesn't enforce - so if Synapse receives updates in the wrong order and the other homeserver fails to respond to the resync request the device list will become stale until the remote server starts to successfully resync again (or we're retrying again, which can be hampered due to the current backoff and to matrix-org/synapse#2526).

I'm unaware of any other issues which can cause them to get out of sync. However, I would push for an explicit resync mechanism over the proposed "just make /send reliable".

The core issue here is one of timing - much like matrix-org/synapse#2526 - but there's also issues around unbounded growth of messages to send to servers. Unlike PDUs, servers don't have to store all these device list updates in their DB unless they are forced to, so making them free to discard them is desirable.

With regards to timing, the options are:

Sending server immediately informs other servers of device list updates and backoff retries if they are down. This causes flooding issues - Performance of device list updates could be improved (aka /login is unusably slow) synapse#7721 - and the retry mechanism has been a source of bugs and subtleties around the unbounded growth of messages e.g Synapse relies on unspecced behaviour for the remote server to resync device lists after prune_outbound_device_list_pokes synapse#7173 - I regard this option as eagerly sending.

Receiving client requests keys as and when they determine that they need the device keys. This doesn't cause flooding issues when someone updates their devices as people will gradually request keys. However, sometimes clients will not know that they are missing keys (e.g signing in on a new device) so it would need to be augmented with some sort of ping/poke to inform clients that there are new devices. We already have this mechanism in the CS API (device_lists.changed in /sync) but no such mechanism exists in Federation. I regard this option as lazy sending as the keys are only fetched when requested by a client, even though they've been eagerly notified that there is a change in the device list (though we could be smarter about when we notify servers about the new device as we know the metadata on events). This would have also averted Cross-signing signatures not being always federated correctly synapse#7418 because instead of being told "Alice has a new device and here is the key" you'd be told "Alice has a new device" and then have to fetch the list (which would pull in any missing keys).

As for the proposed endpoint, I don't think it's neccessary. We already have https://matrix.org/docs/spec/client_server/r0.6.1#post-matrix-client-r0-keys-query so just use that? We may need to add a refresh flag to stop servers from serving up local stale copies though.

turt2live · 2020-07-01T04:49:45Z

proposals/2638-client-request-device-list.md

+The proposal descibed here fixes this issue by adding a new endpoint to the
+client-server API, allowing clients to request homeservers to resync device
+lists. Clients could then either expose a button a user can use when necessary,
+or call this endpoint on specific actions, or both.


Counter proposal (that I'm not convinced entirely on): implementations could be smarter about how they cache/detect failure. If a client hits the device list endpoint several times (where several is subjective), the server could realize that something may be wrong and re-sync. Additionally, as mentioned above, the server would already be able to detect some trivial cases of self-imposed corruption, such as having keys but no signatures, and request it automatically.

Additionally, as mentioned above, the server would already be able to detect some trivial cases of self-imposed corruption, such as having keys but no signatures

I don't mention anywhere that servers detect that kind of things.

implementations could be smarter about how they cache/detect failure

I don't think that would solve the issue. If we miss a bunch of updates because we went down, when we come back up there's no way to know what we've missed, and I wouldn't be surprised if there were other situations where it would be tricky for the homeserver to figure out when to resync. This proposal is meant as a way for clients to tell the server "you didn't detect this failure correctly, but it failed, please resync".

fwiw this counter proposal is already described in the "Alternatives" section of this MSC, with a summary of the explanation above.

turt2live · 2020-07-01T04:51:43Z

proposals/2638-client-request-device-list.md

+The homeserver responds to a request to this endpoint with a 202 Accepted
+response code and an empty body (`{}`).


This should also support rate limiting (strongly) and error codes/error ranges for typical failures, like the homeserver being invalid, dead, etc or the user being invalid, local, etc.

If this is meant to work for local users too, that should be specified.

@babolivier this is where the rate limiting needs to be mentioned.

Right, thanks. That sounded like a security thing to me so I wrote it in the security section. I'll move it here.

turt2live · 2020-07-01T04:52:23Z

proposals/2638-client-request-device-list.md

+XXX: I'm unsure whether the spec should be mandating when a client uses this
+endpoint or whether this is left as an implementation detail.


implementation detail. The spec is a toolbag, and it's your responsibility to grab a hammer to put a nail in the wall.

Thanks for the clarification!

babolivier · 2020-07-01T09:07:34Z

the lack of flood protection

This is already covered in the "Security considerations" section (including rate-limiting), is that not the right place for this kind of things?

proposals/2638-client-request-device-list.md

neilalexander · 2020-07-29T11:45:16Z

proposals/2638-client-request-device-list.md

+If the request to the aforementioned federation endpoint succeeds (i.e. the
+destination responds with a 200 response code and a device list for the target
+user), then the homeserver would include the updated device list in sync
+responses in order to relay it to clients.


The fact that you have to hit the endpoint and then wait for something to come down /sync is still quite potentially racy and feels like a bad smell. What is the client supposed to do in the meantime? Will a client be able to retry the sync and not lose the device keys? Is the response to /devices undefined in the meantime - how do you know if it has updated?

Co-authored-by: Kegsay <kegan@matrix.org>

babolivier added proposal-in-review proposal A matrix spec change proposal kind:feature MSC for not-core and not-maintenance stuff labels Jun 15, 2020

turt2live added kind:maintenance MSC which clarifies/updates existing spec and removed kind:feature MSC for not-core and not-maintenance stuff labels Jun 15, 2020

turt2live changed the title ~~MSCXXXX: Ability for clients to request homeservers to resync device lists~~ MSC2638: Ability for clients to request homeservers to resync device lists Jun 15, 2020

babolivier force-pushed the babolivier/device_lists_clients branch from bc03f4d to 505d24a Compare June 15, 2020 19:42

turt2live self-requested a review June 15, 2020 19:42

turt2live added kind:core MSC which is critical to the protocol's success kind:feature MSC for not-core and not-maintenance stuff and removed kind:maintenance MSC which clarifies/updates existing spec kind:core MSC which is critical to the protocol's success labels Jun 15, 2020

MSC2638: Ability for clients to request homeservers to resync device …

bc02e25

…lists

babolivier force-pushed the babolivier/device_lists_clients branch from 505d24a to bc02e25 Compare June 15, 2020 19:45

babolivier mentioned this pull request Jun 15, 2020

New homeserver doesn't know about cross-signing keys created before it was set up matrix-org/synapse#7504

Closed

timokoesters reviewed Jun 15, 2020

View reviewed changes

proposals/2638-client-request-device-list.md Outdated Show resolved Hide resolved

babolivier added 3 commits June 15, 2020 22:52

Update route

1c45e80

Fix unstable prefix

0f55227

Require the endpoint to be authenticated

b60e426

turt2live reviewed Jul 1, 2020

View reviewed changes

t3chguy mentioned this pull request Jul 1, 2020

Provide a Button to refetch the device list element-hq/element-web#14272

Open

jryans mentioned this pull request Jul 2, 2020

Cross-signing verification of a remote user is incomplete element-hq/element-web#13824

Open

kegsay reviewed Jul 29, 2020

View reviewed changes

proposals/2638-client-request-device-list.md Outdated Show resolved Hide resolved

neilalexander reviewed Jul 29, 2020

View reviewed changes

Update proposals/2638-client-request-device-list.md

5b174e0

Co-authored-by: Kegsay <kegan@matrix.org>

turt2live added the needs-implementation This MSC does not have a qualifying implementation for the SCT to review. The MSC cannot enter FCP. label Jun 8, 2021

turt2live force-pushed the old_master branch from e895827 to dca99ee Compare August 30, 2021 22:34

turt2live removed the proposal-in-review label May 5, 2022

babolivier closed this Mar 1, 2024

turt2live added the abandoned A proposal where the author/shepherd is not responsive label Mar 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MSC2638: Ability for clients to request homeservers to resync device lists #2638

MSC2638: Ability for clients to request homeservers to resync device lists #2638

babolivier commented Jun 15, 2020 •

edited by turt2live

Loading

turt2live left a comment

turt2live Jul 1, 2020

babolivier Jul 1, 2020 •

edited

Loading

turt2live Jul 1, 2020

babolivier Jul 1, 2020

ara4n Jul 29, 2020

babolivier Jul 29, 2020

kegsay Jul 29, 2020

turt2live Jul 1, 2020 •

edited

Loading

babolivier Jul 1, 2020

babolivier Jul 1, 2020

turt2live Jul 1, 2020

turt2live Jul 1, 2020

babolivier Jul 1, 2020

turt2live Jul 1, 2020

babolivier Jul 1, 2020

babolivier commented Jul 1, 2020

neilalexander Jul 29, 2020 •

edited

Loading

		without the proper retry mechanisms. It can also happen if a homeserver was down
		for enough time to miss out on device list updates.

		The homeserver responds to a request to this endpoint with a 202 Accepted
		response code and an empty body (`{}`).

		XXX: I'm unsure whether the spec should be mandating when a client uses this
		endpoint or whether this is left as an implementation detail.

MSC2638: Ability for clients to request homeservers to resync device lists #2638

MSC2638: Ability for clients to request homeservers to resync device lists #2638

Conversation

babolivier commented Jun 15, 2020 • edited by turt2live Loading

turt2live left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

babolivier Jul 1, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

turt2live Jul 1, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

babolivier commented Jul 1, 2020

neilalexander Jul 29, 2020 • edited Loading

Choose a reason for hiding this comment

babolivier commented Jun 15, 2020 •

edited by turt2live

Loading

babolivier Jul 1, 2020 •

edited

Loading

turt2live Jul 1, 2020 •

edited

Loading

neilalexander Jul 29, 2020 •

edited

Loading