Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Handling of differing set_presence values between devices causes sync tightlooping #16057

Closed
t3chguy opened this issue Aug 3, 2023 · 7 comments · Fixed by #16066
Closed

Handling of differing set_presence values between devices causes sync tightlooping #16057

t3chguy opened this issue Aug 3, 2023 · 7 comments · Fixed by #16066
Assignees
Labels
O-Frequent Affects or can be seen by most users regularly or impacts most users' first experience S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@t3chguy
Copy link
Member

t3chguy commented Aug 3, 2023

Description

Synapse is handling /sync set_presence incorrectly.
The spec says

Controls whether the client is automatically marked as online by polling this API. If this parameter is omitted then the client is automatically marked as online when it uses this API. Otherwise if the parameter is set to “offline” then the client is not marked as being online when it uses this API. When set to “unavailable”, the client is marked as being idle.

This implies the server is responsible for mixing all the various sync streams' set_presence values into one overall value for the user. It seems like it is just doing the equivalent of calling PUT /presence which is known to clobber things between clients for a given user.

Screen.Recording.2023-08-03.at.09.39.44.mov

Steps to reproduce

  1. Have one client session (Device/Token A) run a /sync stream with set_presence=unavailable
  2. Have a different client session (Device/Token B) run a /sync stream without set_presence or set_presence=online
  3. Observe the clients making /sync requests as quickly as possible with the only data they receive being the status update of the other client
  4. (optional) from a different user observe the presence of the first user toggling as fast as possible and also have your sync stream destroyed by all the presence updates

Homeserver

localhost

Synapse Version

{"server_version":"1.83.0","python_version":"3.11.3"}

Installation Method

Docker (matrixdotorg/synapse)

Database

SQLite

Workers

Single process

Platform

Docker on macOS

Configuration

Presence enabled

Relevant log output

Lots of sync calls, nothing useful in the logs

Anything else that would be useful to know?

Related #16039
Related #15980
Related element-hq/element-web#25900

@praet0ri4n
Copy link

Hi all, just reporting the same issue here.

Since yesterday I got one particular user with a blinking online/away icon all the time, except he is working on the computer.
Later I found out he had two open sessions, one as the Windows Element app and the another one as Firefox tab. By closing the FF tab now the status settled to online for now.

Will observe it further more.

@t3chguy
Copy link
Member Author

t3chguy commented Aug 3, 2023

The same can happen with an Element Web in the background (sending set_presence=unavailable) and any other client syncing without set_presence as the default = online, e.g. a mobile phone, so this is quite easy to encounter on a presence-enabled Synapse (default)

@praet0ri4n
Copy link

Thank you for details t3chguy. Just a question about how to handle the presence correctly. We did not experience such behavior yet and users on our server run mobile and destkop apps with browser tabs simultaneously.
At the same time looking through similar issues here I see the major part beeing opened yesterday..
The only bigger update that went through is the FF update which caused several users to have to restart browsers and sessions. Can this issue be narrowed down to recent browser update and the web Element client?

@visubesy
Copy link

visubesy commented Aug 3, 2023

Thank you for details t3chguy. Just a question about how to handle the presence correctly. We did not experience such behavior yet and users on our server run mobile and destkop apps with browser tabs simultaneously. At the same time looking through similar issues here I see the major part beeing opened yesterday.. The only bigger update that went through is the FF update which caused several users to have to restart browsers and sessions. Can this issue be narrowed down to recent browser update and the web Element client?

The problem appears since a recent Element update (element-hq/element-web#25900) but the actual cause is a bug in Synapse, as I understand it.

@t3chguy
Copy link
Member Author

t3chguy commented Aug 3, 2023

@praet0ri4n Element Web v1.11.37 switched from using PUT /presence to /sync set_presence - the former clobbers presence between all your devices, so for example if you walked away from your computer and opened a Matrix client elsewhere (e.g your phone) after 3 minutes Element Web would call PUT /presence and clobber your account-wide presence (and status_msg) with unavailable. set_presence is the proper way to do it as per the spec I quoted in the OP, it lets the server do the mixing for when multiple clients are connected simultaneously.

@t3chguy t3chguy changed the title Handling of differing set_presence between devices causes sync tightlooping Handling of differing set_presence values between devices causes sync tightlooping Aug 3, 2023
@erikjohnston erikjohnston added S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. O-Frequent Affects or can be seen by most users regularly or impacts most users' first experience S-Major Major functionality / product severely impaired, no satisfactory workaround. and removed S-Minor Blocks non-critical functionality, workarounds exist. labels Aug 9, 2023
@clokep clokep self-assigned this Aug 9, 2023
@clokep clokep linked a pull request Aug 9, 2023 that will close this issue
@clokep
Copy link
Member

clokep commented Aug 9, 2023

I did some manual testing of #16066 and it seems to fix the issue (it also has a test case that seems to illustrate the problem). @erikjohnston and I are trying to figure out the proper way to handle some per-device presence information and define the behavior before we go fixing stuff.

Things left to do here include:

  • Finish defining the behavior.
  • Write more tests for the various behaviors.
  • Ensure workers are handled properly.
  • Ensure devices "timeout" properly.

tl;dr Progress is happening, but this is a very complicated, brittle piece of code we're working in.

@clokep
Copy link
Member

clokep commented Aug 9, 2023

Write more tests for the various behaviors.

@realtyem sent me 8e1e5bd which might be a jumping point for some tests.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
O-Frequent Affects or can be seen by most users regularly or impacts most users' first experience S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants