Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC2638: Ability for clients to request homeservers to resync device lists #2638

Closed
wants to merge 5 commits into from

Conversation

babolivier
Copy link
Contributor

@babolivier babolivier commented Jun 15, 2020

@babolivier babolivier added proposal-in-review proposal A matrix spec change proposal kind:feature MSC for not-core and not-maintenance stuff labels Jun 15, 2020
@turt2live turt2live added kind:maintenance MSC which clarifies/updates existing spec and removed kind:feature MSC for not-core and not-maintenance stuff labels Jun 15, 2020
@turt2live turt2live changed the title MSCXXXX: Ability for clients to request homeservers to resync device lists MSC2638: Ability for clients to request homeservers to resync device lists Jun 15, 2020
@babolivier babolivier force-pushed the babolivier/device_lists_clients branch from bc03f4d to 505d24a Compare June 15, 2020 19:42
@turt2live turt2live self-requested a review June 15, 2020 19:42
@turt2live turt2live added kind:core MSC which is critical to the protocol's success kind:feature MSC for not-core and not-maintenance stuff and removed kind:maintenance MSC which clarifies/updates existing spec kind:core MSC which is critical to the protocol's success labels Jun 15, 2020
Copy link
Member

@turt2live turt2live left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks to be probably the correct solution for the time being, though the lack of flood protection specified by the MSC is a bit concerning.

Comment on lines +12 to +13
without the proper retry mechanisms. It can also happen if a homeserver was down
for enough time to miss out on device list updates.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can also happen if a homeserver was down for enough time to miss out on device list updates.

Surely this is an implementation bug if it doesn't send them once the server returns? The spec doesn't say to give up.

Copy link
Contributor Author

@babolivier babolivier Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementations are free to decide when they retry. Currently Synapse isn't too clever about that (related: matrix-org/synapse#2526) - afaik it retries with a backoff meaning that if you've been down for long enough you could have to wait a long time (I think we go up to 24h but I'm not 100% certain) without getting the transactions it's missed. Also I don't think the "retry as soon as the server has returned" behaviour is mandated in the spec so it doesn't entirely sounds like an implementation bug to me.

The spec doesn't say to give up.

Well surely implementations will give up at some point. It sounds like a gigantic waste of resources to e.g. retry 100s of servers that have been down for weeks. And if the spec doesn't specifically mandate not to give up (which again would feel a bit meh to me) then imho it just means that the homeserver is free to decide whether to give up and when to do so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These all pass through the transaction endpoints in the federation spec, which does recommend backing off but doesn't say that the sending server should ditch the request and start at a more convenient place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not seeing how that makes the issue this MSC tries to solve disappear TBH. Both backing off for hours/days or considering the server down and stopping sending it transactions entirely if it's been down for long enough don't sound unreasonable nor do they look like implementation bugs to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree with @turt2live that it feels wrong that we have to bodge around devicelists being unreliable by exposing an endpoint for clients to manually refresh them. It'd be better for device lists to be reliable in the first place, which means fixing matrix-org/synapse#2526 - which feels like a much better use of time. Are there other bugs that would cause device lists to get out of sync that we know of beyond that bug? (cc @kegsay)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixing matrix-org/synapse#2526 - which feels like a much better use of time

Is that (retrying pending transactions as soon as a server comes back to life) something that's enforced in the spec? If so I agree. Otherwise it feels like we're fixing the issue only for Synapse and ignoring people using an implementation that's handling that in another way (either because it's less advanced or by design) which feels wrong to me.

it feels wrong that we have to bodge around device lists being unreliable by exposing an endpoint for clients to manually refresh them

I tend to agree; my concern is that it's a feature that's easy to get wrong (as shown by the multiple Synapse bugs related to it), and any mishap ruins a whole part of the E2EE UX for a user - my opinion was that it could be nice to have a fallback just in case an implementation, whether it's Synapse or another one, screws up. Implementing inbound device list updates over federation requires to implement some things (e.g. retries) the spec doesn't mention, and screwing it up will likely leave the homeserver with a bunch of stale device lists it can't trivially fix even if the code gets fixed.

The proposal also addresses another issue, which is how to fix a server screwing up on top of just fixing the bug in the implementation. For example, with the multiple Synapse bugs, it's fair to say we're likely to have quite a few stale device lists, with no other way to fix them other than having the remote user rename a device (which isn't intuitive at all). Homeservers can't really figure out which device lists are stale without resyncing the device list of every single remote user they knows of, whereas it's (imho) much more trivial for clients to figure out when a device list is likely to be out of date.
FWIW the main cause of user grumpiness following the issues we had in Synapse was that users that aren't operating their own homeserver basically can't fix the stale device lists they could see for remote users, making the E2EE experience quite disappointing. It seems likely to me that this happens with other implementations in the future (or even in Synapse if we introduce a regression in that code by mistake).

I guess my point is that implementing device list updates is quite prone to bugs and design flaws and either can be very annoying to both server admins and users. If this is an issue then we should probably explore alternative designs for this feature.

Are there other bugs that would cause device lists to get out of sync that we know of beyond that bug?

There is one I know of (though apparently I haven't taken the time to open an issue for it yet) which is that Synapse expects updates to arrive sorted on their stream_id, which the spec doesn't enforce - so if Synapse receives updates in the wrong order and the other homeserver fails to respond to the resync request the device list will become stale until the remote server starts to successfully resync again (or we're retrying again, which can be hampered due to the current backoff and to matrix-org/synapse#2526).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unaware of any other issues which can cause them to get out of sync. However, I would push for an explicit resync mechanism over the proposed "just make /send reliable".

The core issue here is one of timing - much like matrix-org/synapse#2526 - but there's also issues around unbounded growth of messages to send to servers. Unlike PDUs, servers don't have to store all these device list updates in their DB unless they are forced to, so making them free to discard them is desirable.

With regards to timing, the options are:

  • Sending server immediately informs other servers of device list updates and backoff retries if they are down. This causes flooding issues - Performance of device list updates could be improved (aka /login is unusably slow) synapse#7721 - and the retry mechanism has been a source of bugs and subtleties around the unbounded growth of messages e.g Synapse relies on unspecced behaviour for the remote server to resync device lists after prune_outbound_device_list_pokes synapse#7173 - I regard this option as eagerly sending.
  • Receiving client requests keys as and when they determine that they need the device keys. This doesn't cause flooding issues when someone updates their devices as people will gradually request keys. However, sometimes clients will not know that they are missing keys (e.g signing in on a new device) so it would need to be augmented with some sort of ping/poke to inform clients that there are new devices. We already have this mechanism in the CS API (device_lists.changed in /sync) but no such mechanism exists in Federation. I regard this option as lazy sending as the keys are only fetched when requested by a client, even though they've been eagerly notified that there is a change in the device list (though we could be smarter about when we notify servers about the new device as we know the metadata on events). This would have also averted Cross-signing signatures not being always federated correctly synapse#7418 because instead of being told "Alice has a new device and here is the key" you'd be told "Alice has a new device" and then have to fetch the list (which would pull in any missing keys).

As for the proposed endpoint, I don't think it's neccessary. We already have https://matrix.org/docs/spec/client_server/r0.6.1#post-matrix-client-r0-keys-query so just use that? We may need to add a refresh flag to stop servers from serving up local stale copies though.

Comment on lines 37 to 40
The proposal descibed here fixes this issue by adding a new endpoint to the
client-server API, allowing clients to request homeservers to resync device
lists. Clients could then either expose a button a user can use when necessary,
or call this endpoint on specific actions, or both.
Copy link
Member

@turt2live turt2live Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Counter proposal (that I'm not convinced entirely on): implementations could be smarter about how they cache/detect failure. If a client hits the device list endpoint several times (where several is subjective), the server could realize that something may be wrong and re-sync. Additionally, as mentioned above, the server would already be able to detect some trivial cases of self-imposed corruption, such as having keys but no signatures, and request it automatically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, as mentioned above, the server would already be able to detect some trivial cases of self-imposed corruption, such as having keys but no signatures

I don't mention anywhere that servers detect that kind of things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implementations could be smarter about how they cache/detect failure

I don't think that would solve the issue. If we miss a bunch of updates because we went down, when we come back up there's no way to know what we've missed, and I wouldn't be surprised if there were other situations where it would be tricky for the homeserver to figure out when to resync. This proposal is meant as a way for clients to tell the server "you didn't detect this failure correctly, but it failed, please resync".

fwiw this counter proposal is already described in the "Alternatives" section of this MSC, with a summary of the explanation above.

Comment on lines +51 to +52
The homeserver responds to a request to this endpoint with a 202 Accepted
response code and an empty body (`{}`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also support rate limiting (strongly) and error codes/error ranges for typical failures, like the homeserver being invalid, dead, etc or the user being invalid, local, etc.

If this is meant to work for local users too, that should be specified.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@babolivier this is where the rate limiting needs to be mentioned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, thanks. That sounded like a security thing to me so I wrote it in the security section. I'll move it here.

Comment on lines +77 to +78
XXX: I'm unsure whether the spec should be mandating when a client uses this
endpoint or whether this is left as an implementation detail.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implementation detail. The spec is a toolbag, and it's your responsibility to grab a hammer to put a nail in the wall.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification!

@babolivier
Copy link
Contributor Author

the lack of flood protection

This is already covered in the "Security considerations" section (including rate-limiting), is that not the right place for this kind of things?

If the request to the aforementioned federation endpoint succeeds (i.e. the
destination responds with a 200 response code and a device list for the target
user), then the homeserver would include the updated device list in sync
responses in order to relay it to clients.
Copy link
Contributor

@neilalexander neilalexander Jul 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that you have to hit the endpoint and then wait for something to come down /sync is still quite potentially racy and feels like a bad smell. What is the client supposed to do in the meantime? Will a client be able to retry the sync and not lose the device keys? Is the response to /devices undefined in the meantime - how do you know if it has updated?

Co-authored-by: Kegsay <kegan@matrix.org>
@turt2live turt2live added the needs-implementation This MSC does not have a qualifying implementation for the SCT to review. The MSC cannot enter FCP. label Jun 8, 2021
@babolivier babolivier closed this Mar 1, 2024
@turt2live turt2live added the abandoned A proposal where the author/shepherd is not responsive label Mar 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
abandoned A proposal where the author/shepherd is not responsive kind:feature MSC for not-core and not-maintenance stuff needs-implementation This MSC does not have a qualifying implementation for the SCT to review. The MSC cannot enter FCP. proposal A matrix spec change proposal
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants