progress tracking for keep alive reply #64

DaemonSnake · 2024-05-22T09:34:36Z

context:

When receiving a keep-alive request from Postgres, replying with Posgres' current wal_end + 1
notifies the Postgres server that we fully processed up-to its wal_end+1.

issue:

As Replication.Server communicates with Replication.Publisher asynchronously this means that,
if an error occurs during the processing of a message,
we can acknowledged many more messages than we actually processed.

For a durable slot, this means that when the durable slot will restart, it will do so at the last wal_end+1
we replied with (loosing events).
The longer the transaction takes to be fully processed (many records, etc.) the higher the risk of this happening

solution proposed:

Adding a Replication.Progress Agent that stores in a :gb_sets (ordered set) the LSN of transactions.
When we start receiving the transaction it we push it
We drops it when the processing is done.
In the Replication.Server, we then only need to get the wal_end of the smallest LSN in progress as keep-alive reply.
If no transaction is in progress we can return the received wal_end+1 instead as currently.

this way we can find what wal_end are still being processed This will be useful if some transaction have long processing time and to improve the keep-alive reply

…ssed Currently we always return the server's wal_end+1 This approach has a few limits, can cause some issues. If we start processing wal_end A, then receive a keep-alive request with wal_end=A+x and reply A+x+1, this before A if finished processing, then if our process crashes the server will restart at A+x+1 and will never receive it again. To avoid this, this PR returns wal_end+1 only if there are no processing transactions if there are it returns the wal_end of the oldest transaction still in progress. This way we will be guaranteed to start back from there(A and not A+x+1) if we crash.

DaemonSnake · 2024-05-22T09:39:34Z

oh forgot that the tests uses Registry.child_spec() from the other PR

cpursley · 2024-05-22T14:42:20Z

This is a good idea, what else is needed? I just merged your other small tweak branch.

DaemonSnake · 2024-05-22T16:44:12Z

oh sorry, I thought I had sent a comment with the closure of the PR

There a few things where I'm not enough certain of the outcome.
I think it might be wrong to send 3 times the same value.

Int64 -> The location of the last WAL byte + 1 received and written to disk in the standby.
Int64 -> The location of the last WAL byte + 1 flushed to disk in the standby.
Int64 ->The location of the last WAL byte + 1 applied in the standby.

It's not that clear but I'm afraid that replying with the current wal_end +1 for the first field will cause Postgres re-send packets.
I need to investigate further

cpursley · 2024-05-23T17:05:14Z

I see. Do you mind opening a Discussion on this topic? Maybe others have some insights.

DaemonSnake added 3 commits May 22, 2024 10:10

Changes.Transaction: add lsn

eda69aa

progress: store in gb_sets the wal_end of in-progress transactions

74c03ec

this way we can find what wal_end are still being processed This will be useful if some transaction have long processing time and to improve the keep-alive reply

DaemonSnake marked this pull request as draft May 22, 2024 09:39

DaemonSnake closed this May 22, 2024

DaemonSnake deleted the keep-alive branch May 22, 2024 13:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

progress tracking for keep alive reply #64

progress tracking for keep alive reply #64

DaemonSnake commented May 22, 2024

DaemonSnake commented May 22, 2024

cpursley commented May 22, 2024

DaemonSnake commented May 22, 2024

cpursley commented May 23, 2024 •

edited

Loading

progress tracking for keep alive reply #64

progress tracking for keep alive reply #64

Conversation

DaemonSnake commented May 22, 2024

context:

issue:

solution proposed:

DaemonSnake commented May 22, 2024

cpursley commented May 22, 2024

DaemonSnake commented May 22, 2024

cpursley commented May 23, 2024 • edited Loading

cpursley commented May 23, 2024 •

edited

Loading