Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Speed up push actions/unread counting #13846

Open
Fizzadar opened this issue Sep 20, 2022 · 7 comments
Open

Speed up push actions/unread counting #13846

Fizzadar opened this issue Sep 20, 2022 · 7 comments
Labels
A-Database DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db O-Frequent Affects or can be seen by most users regularly or impacts most users' first experience S-Minor Blocks non-critical functionality, workarounds exist. T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.

Comments

@Fizzadar
Copy link
Contributor

Fizzadar commented Sep 20, 2022

I've been spending a bunch of time looking into push action processing & badge counting and I think there'd be a real benefit to separating out push actions from summaries (notification/unread/highlight counts). There is a lot of code and complexity introduced in the mechanism of rolling event push actions into summaries, and the push actions table conflates push notifications and push counters. Suggestions:

Push actions

Firstly push actions stay as is using event_push_actions, this gets deleted on receipt and read/deleted by the pusher instances, no need for any background work. No longer used for any badge counting.

Badge counts

For badge counts we add a new table, event_push_counts, that looks roughly like (pseudo-SQL):

CREATE TABLE event_push_counts (
    user_id text,
    room_id text,
    thread_id text,
    event_stream_ordering bigint,
    notifs bigint,
    unreads bigint,
    highlights bigint,
)
ALTER TABLE `event_push_counts` ADD CONSTRAINT uniq (user_id, room_id, thread_id, event_stream_ordering);

The key here is that this is not unique per user/room/thread but also event stream ordering. This means that as events come in new rows simply get appended according to the push actions per user. This prevents any contention issues during the critical event insertion path.

This makes counting a users total unreads very simple - instead of the current loop & count per room, simply:

# Counting all events for push badges
SELECT SUM(unreads)
FROM event_push_counts
WHERE user_id = '@blah:matrix.org'

# Counting unread rooms for push badges
SELECT COUNT(room_id)
FROM event_push_counts
WHERE user_id = '@blah:matrix.org'
GROUP BY room_id

# Separated room/thread counts for sync responses
SELECT SUM(unreads), SUM(notifs), SUM(highlights), room_id, thread_id
FROM event_push_counts
WHERE user_id = '@blah:matrix.org'
AND room_id IN ('!abc:matrix.org', '!def:beeper.com')
GROUP BY room_id, thread_id

The same applies to counting unreads for rooms in sync responses.

It is still possible to summarise these by merging rows into a higher stream ordering. Like the current system this doesn't account for receipts not at the latest stream ordering, but the summarisaton could be delayed to provide a window of support for this if desired. The table is leaner than the push actions table so this shouldn't be such an issue (but still important to do to keep the table fast).

Finally, rows could be cleaned out either on receipt on as a background job processing receipts. If there was sufficient (24h?) delay before any summarisation phase, deleting on receipt shouldn't result in much contention on the table.


(If this seems sensible, I can invest time to implement over the next few weeks)

@clokep
Copy link
Member

clokep commented Sep 20, 2022

@Fizzadar Just a heads up that I've been doing very invasive work on push actions & summaries to add support for threads and would hate to see us generate massive conflicts! See #12550 for a jumping off point.

@H-Shay H-Shay added T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. A-Database DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db O-Frequent Affects or can be seen by most users regularly or impacts most users' first experience T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. S-Minor Blocks non-critical functionality, workarounds exist. labels Sep 20, 2022
@Fizzadar
Copy link
Contributor Author

@clokep 👍 ty for the pointer! Have updated my post to reflect inclusion of thread_id column and query to fetch the sync unread counts!

@erikjohnston
Copy link
Member

instead of the current loop & count per room

Argh that is terrible loop and I want to kill it with fire. Though surely we can do something better with the current set up? Though might be annoyingly complicated.

Finally, rows could be cleaned out either on receipt on as a background job processing receipts. If there was sufficient (24h?) delay before any summarisation phase, deleting on receipt shouldn't result in much contention on the table.

Ah, so the penny has finally dropped in my head. This is very similar to what we had before (except re-using the event_push_actions table and doing two queries for lookup), but it had an annoying edge case that absolutely tanked performance: if a user sends a bunch of messages into a room with a huge number of members it resulted in a huge number of entries in event_push_actions, which was enough to cause serious slowness when summing over the table.

Having a dedicated event_push_counts would likely be a bit better (since it'd be smaller), but I think would suffer the same failure mode, if I'm understanding correctly.


As a side note: I wonder if most of the complexity comes from the sheer amount of compat code going on, and if just clearing that out would make things much easier to reason about?

@Fizzadar
Copy link
Contributor Author

Fizzadar commented Oct 4, 2022

Argh that is terrible loop and I want to kill it with fire. Though surely we can do something better with the current set up? Though might be annoyingly complicated.

I think this is a good point, I did briefly look into this I wonder how we could get the aggregate unread count for a user in a single DB txn vs. the loop.

Having a dedicated event_push_counts would likely be a bit better (since it'd be smaller), but I think would suffer the same failure mode, if I'm understanding correctly.

Yeah this sounds about right, smaller table but would still have to be regularly aggregated for performance (I think) which would roughly end up where we are anyway.

As a side note: I wonder if most of the complexity comes from the sheer amount of compat code going on, and if just clearing that out would make things much easier to reason about?

I think so! Wonder if removing some of that would open new doors for optimising the current tables ...


Also thinking moving some of the deletion logic into receipts would help simplify things here (#13834), as would no longer need the background worker to process receipts. Need to confirm that the change doesn't break things though.

@Fizzadar Fizzadar changed the title Separate push actions & summary table Speed up push actions/unread counting Oct 10, 2022
@Fizzadar
Copy link
Contributor Author

Pulled together the aggregate query to remove that loop in #14255. I think the queries there highlight well the complex nature of the current table -> summarisation table setup. I don't have any idea of how to simplify it yet though!

@clokep
Copy link
Member

clokep commented Feb 9, 2023

@Fizzadar Is this all done then with the work in #14255 or do you think there's more to do?

@Fizzadar
Copy link
Contributor Author

@clokep I'd still like to PR my suggested schema in the original post. I have an experimental implementation in our fork (not currently in use for users, but POC it works). Still believe separating out the counting from push actions is the right direction here, just need to find some time to pull together something more concrete :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Database DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db O-Frequent Affects or can be seen by most users regularly or impacts most users' first experience S-Minor Blocks non-critical functionality, workarounds exist. T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.
Projects
None yet
Development

No branches or pull requests

4 participants