Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

it's possible for stream_ids to go backwards in a replication stream #7360

Closed
richvdh opened this issue Apr 28, 2020 · 8 comments · Fixed by #7648
Closed

it's possible for stream_ids to go backwards in a replication stream #7360

richvdh opened this issue Apr 28, 2020 · 8 comments · Fixed by #7648
Assignees
Labels
A-Workers Problems related to running Synapse in Worker Mode (or replication) z-bug (Deprecated Label) z-p2 (Deprecated Label)

Comments

@richvdh
Copy link
Member

richvdh commented Apr 28, 2020

It's possible to receive an RDATA <stream> 1 after a POSITION <stream> 2 - which is to say that the stream id goes backwards. It's not obvious that the replication clients always handle this correctly, and is probably a thing we should make sure doesn't happen.

@clokep
Copy link
Member

clokep commented May 19, 2020

What needs to be done here? Is it adding some assertions or do we believe there's a bug to fix?

@richvdh
Copy link
Member Author

richvdh commented May 19, 2020

I believe we should make sure this doesn't happen, with some tests. That said, I can't remember what led me to believe it could happen :/

@richvdh
Copy link
Member Author

richvdh commented Jun 4, 2020

right, a little context on this here, though the upshot is:

This test file contains three lines labelled "workaround for #7360". I don't believe they should be required, but if you remove them, the tests fail.

The problem is that if the replication_notifier isn't running often enough (which could happen in then the replication source could end up sending POSITION events 108 followed by RDATA for events 7-108. This isn't really compliant with the way the replication is meant to work and could lead to bugs down the line.

I'm not quite sure why it's happening off-hand.

@erikjohnston
Copy link
Member

Ah, I think this is probably stemming from the fact that sending a POSITION can come de-synced from sending RDATA for a stream. The RDATA gets sent by looping around fetching updates from the database and then sending them, so if we send a POSITION with the current token that may be sent before the rows before the token actually have been pulled from the database and sent. I think in production the window for this happening is really small, but it does exist.

Two solutions:

  1. Ask the sending loop for the current position when sending POSITION (or get it to send the POSITION).
  2. Have clients drop RDATA which have tokens from before any previous POSITION

I think 2. would be nice to have in general, though it is complicated by the fact that stream IDs can reset back to zero for ephemeral streams like typing so its not quite that simple. Though given the resetting of stream IDs handling is fragile atm I would be inclined to still take option 2 and figure out a better way of doing ephemeral streams/resetting stream IDs.

@clokep clokep self-assigned this Jun 5, 2020
@clokep
Copy link
Member

clokep commented Jun 5, 2020

I'm going to take a look at this.

@clokep
Copy link
Member

clokep commented Jun 5, 2020

though it is complicated by the fact that stream IDs can reset back to zero for ephemeral streams like typing so its not quite that simple

We had a brief chat about this:

but I think it should be safe to assert stream ID doesn't go backwards on RDATA, unless we get a POSITION with a smaller stream ID first?
I'd probably be inclined to scream loudly if a stream goes backwards we don't expect to

@clokep
Copy link
Member

clokep commented Jun 5, 2020

And this applies to the typing and federation streams only.

@erikjohnston
Copy link
Member

And this applies to the typing and federation streams only.

Possibly presence too. And maybe some others I'm forgetting.

@richvdh richvdh added the A-Workers Problems related to running Synapse in Worker Mode (or replication) label Feb 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Workers Problems related to running Synapse in Worker Mode (or replication) z-bug (Deprecated Label) z-p2 (Deprecated Label)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants