Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

gappy syncs appear to be slower than you might expect #10786

Open
ara4n opened this issue Sep 8, 2021 · 28 comments
Open

gappy syncs appear to be slower than you might expect #10786

ara4n opened this issue Sep 8, 2021 · 28 comments
Labels
A-Sync defects related to /sync O-Occasional Affects or can be seen by some users regularly or most users rarely S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@ara4n
Copy link
Member

ara4n commented Sep 8, 2021

empirically my ios app takes aaaages to incremental sync in the morning or after any significant time spent offline. manu and others have spotted it too. need to check average incr sync times to see when the step change was?

@anoadragon453
Copy link
Member

I don't really see an obvious step change across all synchrotrons in the last 50 days on matrix.org. I do however see one synchrotron hitting 100% CPU, before it relaxes and another one jumps to 100% CPU.

image

It looks as if one account is causing one synchrotron to become slow. However, people have mentioned incremental syncs becoming slow across more homeservers than just matrix.org. So I think this is just a red herring (although one still worth investigating separately).

@ara4n
Copy link
Member Author

ara4n commented Sep 9, 2021

the timeframe of change could easily be >50 days here.

@anoadragon453
Copy link
Member

Result of the discussion is that it would be helpful to have detailed client logs of /sync timings before digging too far into this on the server side. Is the problem purely waiting for the server to return /sync data, or is there an increase in local processing requirements? Has the size of /sync data increased or become more complex? Are /sync requests timing out due to long server-side processing times or simply because there are no messages?

I see a few mentions in features/bugfixes over the past few Synapse releases regarding /sync, though it's not entirely clear where to start looking.

@ara4n
Copy link
Member Author

ara4n commented Sep 9, 2021

It might be worth profiling some of my catchup syncs via jaeger (which I think is enabled for my account).

For instance, I just got a great instance of this on iOS after an hour of being offline - see the linked rageshake.

timezones are UTC+1

I closed the app at 2021-09-09 20:54:16.755, and launched it again at 2021-09-09 21:41:07.810 (console.log.gz). Then, it starts an incremental sync at:

2021-09-09 21:41:12.418 Riot[19788:9287787]  [MXSession] Do a server sync (catching up) from token: s2297526719_757284961_9800653_1015632397_934029780_2607818_323407029_1852203500_207666
2021-09-09 21:41:12.418 Riot[19788:9287787]  [MXHTTPClient] #6 - GET _matrix/client/r0/sync

...which completes after 36 seconds?!

2021-09-09 21:41:48.399 Riot[19788:9287787]  [MXHTTPClient] #6 - GET _matrix/client/r0/sync completed in 35980ms
2021-09-09 21:41:48.399 Riot[19788:9287787]  [MXHTTPClient] cleanupBackgroundTask
2021-09-09 21:41:48.399 Riot[19788:9287787]  [MXBackgroundTask] Stop background task #5 - [MXHTTPClient] startBackgroundTask after 36020.350ms
2021-09-09 21:41:48.407 Riot[19788:9287787]  [MXSession] Received sync response in 35988ms
2021-09-09 21:41:48.407 Riot[19788:9287787]  [MXBaseProfiler] Task startup - incrementalSync for 112 units completed in 35988.647ms

So, let's go marry this up with jaeger and see what on earth it was doing. My phone should be NTP synced, it was on excellent connectivity, we have the since token, so it should be easy to find out wtf synapse was doing.

@ara4n
Copy link
Member Author

ara4n commented Sep 9, 2021

synchrotron32.log:2021-09-09 20:41:48,329 - synapse.access.http.8163 - 410 - INFO - GET-9076082 - 92.233.2.138 - 8163 - {@matthew:matrix.org} Processed request: 35.347sec/0.547sec (3.833sec, 0.645sec) (13.595sec/28.527sec/3657) 357488B 200 "GET /_matrix/client/r0/sync?filter=139&since=s2297526719_757284961_9800653_1015632397_934029780_2607818_323407029_1852203500_207666&timeout=0 HTTP/1.1" "Riot/1.5.3 (iPhone; iOS 14.7.1; Scale/3.00)" [356 dbevts]

@ara4n
Copy link
Member Author

ara4n commented Sep 9, 2021

@ara4n
Copy link
Member Author

ara4n commented Sep 9, 2021

From a quick look at the trace, 30s of the 36s is spent in db.get_users_in_room, which in turn is spending a bunch of time in the DB.

Assuming the calling pattern of db.get_users_in_room hasn't changed, this feels like the data might just not be cached on the synchrotron... and it seems my account is moving around a lot between synchrotrons. grepping for @matthew:.*iOS shows it to have popped up on 9, 28 and 32 so far today. Now, the slow request is the first time it turned up on 32 today. So I'm wondering if we've changed our stickiness at some point in the last N months, such that the chances of hitting an entirely cold cache have increased? Alternatively, have we changed the cache sizing, causing them to expire more quickly?

@ara4n
Copy link
Member Author

ara4n commented Sep 9, 2021

Huh, the DB requests which are chewing all the time are simply

SELECT state_key FROM current_state_events WHERE type = 'm.room.member' AND room_id = %s AND membership = %s

...which takes 231ms?! (looking at https://jaeger.int.matrix.org/trace/9c7d5a56eb8ced14?uiFind=2c9e84f049e51f33).

Perhaps this is just a DB indexing problem. It shouldn't take 231ms to pull a single state_key out of the current_state_events table, surely?

EDIT: oh, it must be 231ms because the room has a bazillion users and it's just got to churn through a lot of records to get them.

@ara4n
Copy link
Member Author

ara4n commented Sep 9, 2021

I think the 231ms is just a particularly bad one, but there are still 6328 calls to get_users_in_room, roughly twice for every room on my account, which seems completely crazy. I also can't see any part of the sync handler which actually leans hard on it; it's only ever called by _generate_sync_entry_for_device_list for newly joined rooms, newly left rooms, and for presence in _generate_sync_entry_for_presence (which is surely disabled).

So I wonder if there's been a terrible regression on calculating newly joined/left rooms...

@ara4n
Copy link
Member Author

ara4n commented Sep 9, 2021

grafana confirms that it's the get_users_in_room in _generate_sync_entry_for_device_list hammering the DB:

Screenshot 2021-09-10 at 00 34 30

so something is indeed going nuts and making newly_joined_rooms and/or newly_left_rooms go through the roof - even though i don't think i joined or left any rooms in the hour that i was offline.

i've had a quick look in the code and tried to reproduce against a local non-worker synapse, but it looks like it needs something more than just a gappy sync to repro. I also tried adding in some join/parts from other users during the gap, but that didn't trigger it either (they correctly turned up on the device_list, and newly_joined_rooms and newly_parted_rooms look to have been blank). Perhaps you have to join/part something yourself during the gap.

It's worth noting that I /think/ i started noticing this around July, which is roughly when knocking landed, which made a bunch of changes around the newly_joined code paths? Anyway, someone in backendland should probably pick this up, especially if it's causing a ~6x slowdown (30s of 36s doing unnecessary work).

@ara4n
Copy link
Member Author

ara4n commented Sep 10, 2021

bah, couldn't stop thinking about this. reproduced it after disabling caching on my local server (which is why I couldn't see this before). I wonder whether the thing which has exposed the bug is the 30 minute expiration on caches on matrix.org (as well as the lack of stickiness).

repro steps:

  • disable caching on synapse
  • log into element web twice as a test user in a few rooms
  • pause one element web in the JS debugger
  • join a room in the other one
  • unpause the first one and inspect its most recent /sync response, and correlate it with the synapse logs
  • observe that the sync block hammered get_users_in_room for every room you are in while calling _generate_sync_entry_for_device_list
2021-09-10 01:49:22,846 - synapse.util.metrics - 137 - DEBUG - GET-145 - Entering block _generate_sync_entry_for_device_list
2021-09-10 01:49:22,847 - synapse.storage.txn - 535 - DEBUG - GET-145 - [TXN START] {get_rooms_for_user_with_stream_ordering-62b}
2021-09-10 01:49:22,847 - synapse.storage.SQL - 311 - DEBUG - GET-145 - [SQL] {get_rooms_for_user_with_stream_ordering-62b} SELECT room_id, e.instance_name, e.stream_ordering FROM current_state_events AS c INNER JOIN events AS e USING (room_id, event_id) WHERE c.type = 'm.room.member' AND s
tate_key = ? AND c.membership = ?
2021-09-10 01:49:22,848 - synapse.storage.SQL - 316 - DEBUG - GET-145 - [SQL values] {get_rooms_for_user_with_stream_ordering-62b} ('@matthew:bellerophon', 'join')
2021-09-10 01:49:22,849 - synapse.storage.SQL - 337 - DEBUG - GET-145 - [SQL time] {get_rooms_for_user_with_stream_ordering-62b} 0.000063 sec
2021-09-10 01:49:22,849 - synapse.storage.txn - 638 - DEBUG - GET-145 - [TXN END] {get_rooms_for_user_with_stream_ordering-62b} 0.002834 sec
2021-09-10 01:49:22,851 - synapse.storage.txn - 535 - DEBUG - GET-145 - [TXN START] {get_users_in_room-62c}
2021-09-10 01:49:22,851 - synapse.storage.SQL - 311 - DEBUG - GET-145 - [SQL] {get_users_in_room-62c} SELECT state_key FROM current_state_events WHERE type = 'm.room.member' AND room_id = ? AND membership = ?
2021-09-10 01:49:22,852 - synapse.storage.SQL - 316 - DEBUG - GET-145 - [SQL values] {get_users_in_room-62c} ('!dprKjMDJTQbVbutttb:bellerophon', 'join')
2021-09-10 01:49:22,853 - synapse.storage.SQL - 337 - DEBUG - GET-145 - [SQL time] {get_users_in_room-62c} 0.000041 sec
2021-09-10 01:49:22,853 - synapse.storage.txn - 638 - DEBUG - GET-145 - [TXN END] {get_users_in_room-62c} 0.002855 sec
2021-09-10 01:49:22,854 - synapse.storage.txn - 535 - DEBUG - GET-145 - [TXN START] {get_users_in_room-62d}
2021-09-10 01:49:22,855 - synapse.storage.SQL - 311 - DEBUG - GET-145 - [SQL] {get_users_in_room-62d} SELECT state_key FROM current_state_events WHERE type = 'm.room.member' AND room_id = ? AND membership = ?
2021-09-10 01:49:22,856 - synapse.storage.SQL - 316 - DEBUG - GET-145 - [SQL values] {get_users_in_room-62d} ('!mQSqavcZeWYZncQwJh:bellerophon', 'join')
2021-09-10 01:49:22,857 - synapse.storage.SQL - 337 - DEBUG - GET-145 - [SQL time] {get_users_in_room-62d} 0.000053 sec
2021-09-10 01:49:22,857 - synapse.storage.txn - 638 - DEBUG - GET-145 - [TXN END] {get_users_in_room-62d} 0.002847 sec
2021-09-10 01:49:22,858 - synapse.storage.txn - 535 - DEBUG - GET-145 - [TXN START] {get_users_in_room-62e}
2021-09-10 01:49:22,859 - synapse.storage.SQL - 311 - DEBUG - GET-145 - [SQL] {get_users_in_room-62e} SELECT state_key FROM current_state_events WHERE type = 'm.room.member' AND room_id = ? AND membership = ?
2021-09-10 01:49:22,860 - synapse.storage.SQL - 316 - DEBUG - GET-145 - [SQL values] {get_users_in_room-62e} ('!JjMuBUGyRFIHpUvqBp:bellerophon', 'join')
2021-09-10 01:49:22,860 - synapse.storage.SQL - 337 - DEBUG - GET-145 - [SQL time] {get_users_in_room-62e} 0.000034 sec
2021-09-10 01:49:22,861 - synapse.storage.txn - 638 - DEBUG - GET-145 - [TXN END] {get_users_in_room-62e} 0.002630 sec
2021-09-10 01:49:22,862 - synapse.storage.txn - 535 - DEBUG - GET-145 - [TXN START] {get_users_in_room-62f}
2021-09-10 01:49:22,862 - synapse.storage.SQL - 311 - DEBUG - GET-145 - [SQL] {get_users_in_room-62f} SELECT state_key FROM current_state_events WHERE type = 'm.room.member' AND room_id = ? AND membership = ?
2021-09-10 01:49:22,863 - synapse.storage.SQL - 316 - DEBUG - GET-145 - [SQL values] {get_users_in_room-62f} ('!nLFjWkXmrEScbXhLmI:bellerophon', 'join')
2021-09-10 01:49:22,864 - synapse.storage.SQL - 337 - DEBUG - GET-145 - [SQL time] {get_users_in_room-62f} 0.000030 sec
2021-09-10 01:49:22,864 - synapse.storage.txn - 638 - DEBUG - GET-145 - [TXN END] {get_users_in_room-62f} 0.002385 sec
2021-09-10 01:49:22,865 - synapse.storage.txn - 535 - DEBUG - GET-145 - [TXN START] {get_users_in_room-630}
2021-09-10 01:49:22,866 - synapse.storage.SQL - 311 - DEBUG - GET-145 - [SQL] {get_users_in_room-630} SELECT state_key FROM current_state_events WHERE type = 'm.room.member' AND room_id = ? AND membership = ?
2021-09-10 01:49:22,867 - synapse.storage.SQL - 316 - DEBUG - GET-145 - [SQL values] {get_users_in_room-630} ('!dAHSpVBCjyiitQGwRn:bellerophon', 'join')
2021-09-10 01:49:22,868 - synapse.storage.SQL - 337 - DEBUG - GET-145 - [SQL time] {get_users_in_room-630} 0.000035 sec
2021-09-10 01:49:22,868 - synapse.storage.txn - 638 - DEBUG - GET-145 - [TXN END] {get_users_in_room-630} 0.003061 sec
2021-09-10 01:49:22,869 - synapse.storage.txn - 535 - DEBUG - GET-145 - [TXN START] {get_users_in_room-631}
2021-09-10 01:49:22,870 - synapse.storage.SQL - 311 - DEBUG - GET-145 - [SQL] {get_users_in_room-631} SELECT state_key FROM current_state_events WHERE type = 'm.room.member' AND room_id = ? AND membership = ?
2021-09-10 01:49:22,870 - synapse.storage.SQL - 316 - DEBUG - GET-145 - [SQL values] {get_users_in_room-631} ('!DKgHddIbuazlkglUdu:bellerophon', 'join')
2021-09-10 01:49:22,871 - synapse.storage.SQL - 337 - DEBUG - GET-145 - [SQL time] {get_users_in_room-631} 0.000030 sec
2021-09-10 01:49:22,871 - synapse.storage.txn - 638 - DEBUG - GET-145 - [TXN END] {get_users_in_room-631} 0.002362 sec
2021-09-10 01:49:22,872 - synapse.storage.txn - 535 - DEBUG - GET-145 - [TXN START] {get_users_in_room-632}
2021-09-10 01:49:22,873 - synapse.storage.SQL - 311 - DEBUG - GET-145 - [SQL] {get_users_in_room-632} SELECT state_key FROM current_state_events WHERE type = 'm.room.member' AND room_id = ? AND membership = ?
2021-09-10 01:49:22,874 - synapse.storage.SQL - 316 - DEBUG - GET-145 - [SQL values] {get_users_in_room-632} ('!aTKzWEOvZRvouBktlN:bellerophon', 'join')
2021-09-10 01:49:22,874 - synapse.storage.SQL - 337 - DEBUG - GET-145 - [SQL time] {get_users_in_room-632} 0.000030 sec
2021-09-10 01:49:22,875 - synapse.storage.txn - 638 - DEBUG - GET-145 - [TXN END] {get_users_in_room-632} 0.002457 sec
2021-09-10 01:49:22,876 - synapse.storage.txn - 535 - DEBUG - GET-145 - [TXN START] {get_users_in_room-633}
2021-09-10 01:49:22,876 - synapse.storage.SQL - 311 - DEBUG - GET-145 - [SQL] {get_users_in_room-633} SELECT state_key FROM current_state_events WHERE type = 'm.room.member' AND room_id = ? AND membership = ?
2021-09-10 01:49:22,877 - synapse.storage.SQL - 316 - DEBUG - GET-145 - [SQL values] {get_users_in_room-633} ('!tvslgMmeyHGMpWGGIX:bellerophon', 'join')
2021-09-10 01:49:22,877 - synapse.storage.SQL - 337 - DEBUG - GET-145 - [SQL time] {get_users_in_room-633} 0.000031 sec
2021-09-10 01:49:22,878 - synapse.storage.txn - 638 - DEBUG - GET-145 - [TXN END] {get_users_in_room-633} 0.002365 sec
2021-09-10 01:49:22,879 - synapse.storage.txn - 535 - DEBUG - GET-145 - [TXN START] {get_users_in_room-634}
2021-09-10 01:49:22,879 - synapse.storage.SQL - 311 - DEBUG - GET-145 - [SQL] {get_users_in_room-634} SELECT state_key FROM current_state_events WHERE type = 'm.room.member' AND room_id = ? AND membership = ?
2021-09-10 01:49:22,880 - synapse.storage.SQL - 316 - DEBUG - GET-145 - [SQL values] {get_users_in_room-634} ('!FbTgWTlLSDXqPspmzq:bellerophon', 'join')
2021-09-10 01:49:22,881 - synapse.storage.SQL - 337 - DEBUG - GET-145 - [SQL time] {get_users_in_room-634} 0.000076 sec
2021-09-10 01:49:22,882 - synapse.storage.txn - 638 - DEBUG - GET-145 - [TXN END] {get_users_in_room-634} 0.003240 sec
2021-09-10 01:49:22,883 - synapse.storage.txn - 535 - DEBUG - GET-145 - [TXN START] {get_users_in_room-635}
2021-09-10 01:49:22,884 - synapse.storage.SQL - 311 - DEBUG - GET-145 - [SQL] {get_users_in_room-635} SELECT state_key FROM current_state_events WHERE type = 'm.room.member' AND room_id = ? AND membership = ?
2021-09-10 01:49:22,884 - synapse.storage.SQL - 316 - DEBUG - GET-145 - [SQL values] {get_users_in_room-635} ('!EgeuEnoxEjtDFiCGyq:bellerophon', 'join')
2021-09-10 01:49:22,885 - synapse.storage.SQL - 337 - DEBUG - GET-145 - [SQL time] {get_users_in_room-635} 0.000033 sec
2021-09-10 01:49:22,885 - synapse.storage.txn - 638 - DEBUG - GET-145 - [TXN END] {get_users_in_room-635} 0.002419 sec
2021-09-10 01:49:22,886 - synapse.storage.txn - 535 - DEBUG - GET-145 - [TXN START] {get_users_in_room-636}
2021-09-10 01:49:22,887 - synapse.storage.SQL - 311 - DEBUG - GET-145 - [SQL] {get_users_in_room-636} SELECT state_key FROM current_state_events WHERE type = 'm.room.member' AND room_id = ? AND membership = ?
2021-09-10 01:49:22,887 - synapse.storage.SQL - 316 - DEBUG - GET-145 - [SQL values] {get_users_in_room-636} ('!IiIuPTaEyjYtlzVpWs:bellerophon', 'join')
2021-09-10 01:49:22,888 - synapse.storage.SQL - 337 - DEBUG - GET-145 - [SQL time] {get_users_in_room-636} 0.000031 sec
2021-09-10 01:49:22,889 - synapse.storage.txn - 638 - DEBUG - GET-145 - [TXN END] {get_users_in_room-636} 0.002365 sec
2021-09-10 01:49:22,890 - synapse.storage.txn - 535 - DEBUG - GET-145 - [TXN START] {get_users_in_room-637}
2021-09-10 01:49:22,890 - synapse.storage.SQL - 311 - DEBUG - GET-145 - [SQL] {get_users_in_room-637} SELECT state_key FROM current_state_events WHERE type = 'm.room.member' AND room_id = ? AND membership = ?
2021-09-10 01:49:22,891 - synapse.storage.SQL - 316 - DEBUG - GET-145 - [SQL values] {get_users_in_room-637} ('!mQSqavcZeWYZncQwJh:bellerophon', 'join')
2021-09-10 01:49:22,891 - synapse.storage.SQL - 337 - DEBUG - GET-145 - [SQL time] {get_users_in_room-637} 0.000030 sec
2021-09-10 01:49:22,892 - synapse.storage.txn - 638 - DEBUG - GET-145 - [TXN END] {get_users_in_room-637} 0.002469 sec
2021-09-10 01:49:22,893 - synapse.util.metrics - 145 - DEBUG - GET-145 - Exiting block _generate_sync_entry_for_device_list

@ara4n
Copy link
Member Author

ara4n commented Sep 10, 2021

Right, got to the bottom of it. The thing which is hammering get_users_in_room is actually get_users_who_share_room_with_user - the first thing that _generate_sync_entry_for_device_list calls, in order to get the list of users whose device list may have changed. It's nothing to do with newly joined/left rooms. get_users_who_share_room_with_user by definition has to iterate over all your rooms and get the membership of all of them - something like 450,000 users for my matrix.org account.

So, I think this has become way heavier because we expire caches after 30 minutes, hence making things feel bad.

A better solution might be to calculate the users who need to be told about devicelist changes whenever a devicelist change happens, rather than checking every possible user every time we call /sync.

@anoadragon453
Copy link
Member

Another storage method in the sync process, get_users_whose_devices_changed just took 3s on an incremental sync for me.

@ara4n
Copy link
Member Author

ara4n commented Sep 10, 2021

yup, there are other bad ones too - i just picked the worst (30s in get_users_who_share_room_with_user) according to the jaeger trace. (Call-tree view in jaeger is great)

@anoadragon453
Copy link
Member

anoadragon453 commented Sep 13, 2021

Annoyingly get_users_who_share_room_with_user is a particularly difficult query to cache as it stands as it's potentially invalidated every time a room membership changes.

@ara4n
Copy link
Member Author

ara4n commented Sep 13, 2021

presumably one could cache the per-room results, and so avoid hitting all the per-room queries every time one room membership changes but only query the one which changed?

@callahad
Copy link
Contributor

As a temporary patch, let's try exempting just the relevant cache from time-based eviction and see if that improves things on matrix.org. Assigning to Erik (knows the code) and David (current maintainer)

@DMRobertson
Copy link
Contributor

Cache entries on get_users_who_share_room_with_user are invalidated when the relevant entries in get_rooms_for_user and get_users_in_room are invalidated. So whenever you join or see a join, your next sync is going to have to query the DB directly.

So turning off time based eviction might help, but only to a certain extent (because other people joining will nuke synapses' cache entries).

@DMRobertson
Copy link
Contributor

Note: expiry time is currently 1h on matrix.org.

@anoadragon453
Copy link
Member

If cache invalidation for get_users_who_share_room_with_user is based on any membership changes occurring on the homeserver, I'd actually be surprised if entries made it to being 1hr old.

@DMRobertson
Copy link
Contributor

If cache invalidation for get_users_who_share_room_with_user is based on any membership changes occurring on the homeserver

For completness/context, I had in mind the use of on_invalidate in this snippet:

@cached(max_entries=500000, cache_context=True, iterable=True)
async def get_users_who_share_room_with_user(
self, user_id: str, cache_context: _CacheContext
) -> Set[str]:
"""Returns the set of users who share a room with `user_id`"""
room_ids = await self.get_rooms_for_user(
user_id, on_invalidate=cache_context.invalidate
)
user_who_share_room = set()
for room_id in room_ids:
user_ids = await self.get_users_in_room(
room_id, on_invalidate=cache_context.invalidate
)
user_who_share_room.update(user_ids)
return user_who_share_room

@erikjohnston
Copy link
Member

A better solution might be to calculate the users who need to be told about devicelist changes whenever a devicelist change happens, rather than checking every possible user every time we call /sync.

Another potential solution here is to record which rooms need to be told about a device list change, which saves us from ever having to call get_users_who_share_room_with_user at any point processing device list changes. This would be efficient as a) get_users_in_room is fairly quick and b) in the sync code we would already have loaded the list of rooms the user is currently in anyway.

@erikjohnston erikjohnston added S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Sep 22, 2021
@richvdh
Copy link
Member

richvdh commented Nov 23, 2021

anyone remember what we're doing about this? it seems like the subject needs an update.

@callahad
Copy link
Contributor

The decision was to let this sit where it is for the time being: we don't have the bandwidth to dig deeper, and any investment will likely be rendered moot by Sync V3 next year.

@callahad callahad added P4 (OBSOLETE: use S- labels.) Okay backlog: will not schedule, will accept patches and removed X-Needs-Discussion labels Nov 25, 2021
@callahad
Copy link
Contributor

If Sync V3 starts looking like it won't deliver its anticipated improvements, or its delivery date gets pushed too far out, then we'll re-prioritize this work. We'd similarly re-prioritize if the problem becomes significantly worse in the interim.

@ShadowJonathan
Copy link
Contributor

I have this same problem, but I developed #9798 for it and applied it to my own server. However, i soon dropped it because it required a bunch of changes in sytest, which i wasn't qualified to do.

@richvdh richvdh changed the title anecdata says gappy syncs have slowed down by 3-4x at some point in the last N months. gappy syncs appear slower than you might expect Nov 26, 2021
@richvdh richvdh changed the title gappy syncs appear slower than you might expect gappy syncs appear to be slower than you might expect Nov 26, 2021
@richvdh
Copy link
Member

richvdh commented Nov 26, 2021

I've updated the subject as best I can.

erikjohnston added a commit that referenced this issue Jul 18, 2022
See #10826 and #10786 for context as to why we had to disable pruning on
those caches.

Now that `get_users_who_share_room_with_user` is called frequently only
for presence, we just need to make calls to it less frequent and then we
can remove the various levels of caching that is going on.
erikjohnston added a commit that referenced this issue Jul 18, 2022
See #10826 and #10786 for context as to why we had to disable pruning on
those caches.

Now that `get_users_who_share_room_with_user` is called frequently only
for presence, we just need to make calls to it less frequent and then we
can remove the various levels of caching that is going on.
erikjohnston added a commit that referenced this issue Jul 18, 2022
See #10826 and #10786 for context as to why we had to disable pruning on
those caches.

Now that `get_users_who_share_room_with_user` is called frequently only
for presence, we just need to make calls to it less frequent and then we
can remove the various levels of caching that is going on.
@MadLittleMods MadLittleMods added the A-Sync defects related to /sync label Jul 19, 2022
erikjohnston added a commit that referenced this issue Jul 22, 2022
See #10826 and #10786 for context as to why we had to disable pruning on
those caches.

Now that `get_users_who_share_room_with_user` is called frequently only
for presence, we just need to make calls to it less frequent and then we
can remove the various levels of caching that is going on.
erikjohnston added a commit that referenced this issue Jul 25, 2022
See #10826 and #10786 for context as to why we had to disable pruning on
those caches.

Now that `get_users_who_share_room_with_user` is called frequently only
for presence, we just need to make calls to it less frequent and then we
can remove the various levels of caching that is going on.
@MadLittleMods
Copy link
Contributor

MadLittleMods commented Oct 27, 2022

Just experienced this with a 6-month old Element session (which doesn't handle it well, element-hq/element-web#23637, element-hq/element-web#9198).

Some of my /sync requests took 2629.03s (43 minutes) to complete and the rest slowly made its way down to something where the client wouldn't time out anymore. All the time is being spent in get_device_list_changes_in_rooms (takes 2490s of that request). The SQL for that was SELECT DISTINCT user_id FROM device_lists_changes_in_room WHERE room_id = ANY(?) AND stream_id >= ?

(each dot is a /sync request)

The trace is https://jaeger.proxy.matrix.org/trace/657fa919e14b7196 for those that catch it in the few days before it expires. Otherwise you can drag and drop this synapse-long-sync-trace-657fa919e14b7196.json onto Jaeger.

The rest of the long /sync requests were just waiting for the response cache (ResponseCache[sync].wait). I assume this is some work de-duplication we have.

@MadLittleMods MadLittleMods added O-Occasional Affects or can be seen by some users regularly or most users rarely and removed P4 (OBSOLETE: use S- labels.) Okay backlog: will not schedule, will accept patches labels Oct 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Sync defects related to /sync O-Occasional Affects or can be seen by some users regularly or most users rarely S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

No branches or pull requests

8 participants