-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DPE-3684] Reinitialise raft #611
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #611 +/- ##
==========================================
+ Coverage 70.89% 72.08% +1.18%
==========================================
Files 12 12
Lines 3048 3238 +190
Branches 539 593 +54
==========================================
+ Hits 2161 2334 +173
- Misses 771 780 +9
- Partials 116 124 +8 ☔ View full report in Codecov by Sentry. |
4c7e201
to
29aaf18
Compare
e69800c
to
adbf7eb
Compare
|
||
|
||
@pytest.mark.group(1) | ||
@markers.juju3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only tested on juju 3 for the moment. Should be able to reuse the wrapper around force removal to enable the tests on juju 2
): | ||
logger.info("%s is raft candidate" % self.charm.unit.name) | ||
data_flags["raft_candidate"] = "True" | ||
self.charm.unit_peer_data.update(data_flags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial flags that trigger the recovery attempt. If we need to disable this (stable releases?), it would be easiest from here.
for attempt in Retrying(wait=wait_fixed(5)): | ||
with attempt: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to define sane timeout for the cases when the recovery hijacks execution.
partner_addrs=self.charm.async_replication.get_partner_addresses() | ||
if not no_peers | ||
else [], | ||
peers_ips=self.peers_ips if not no_peers else set(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignoring the other nodes when recovering, they should rejoin afterwards
self.app_peer_data.pop("raft_selected_candidate", None) | ||
self.app_peer_data.pop("raft_followers_stopped", None) | ||
|
||
def _raft_reinitialisation(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's only one unit (sync standby and leader), this should execute in one go.
# Check whether raft is stuck. | ||
if self.has_raft_keys(): | ||
self._raft_reinitialisation() | ||
logger.debug("Early exit on_peer_relation_changed: stuck raft recovery") | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will hijack execution until recovery completes. We should think of a ways to detect manual recovery.
except Exception: | ||
logger.warning("Remove raft member: Unable to get health status") | ||
health_status = {} | ||
if health_status.get("role") in ("leader", "master") or health_status.get( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't have a stuck cluster with leader role, but just in case.
if not candidate: | ||
logger.warning("Stuck raft has no candidate") | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't proceed with automatic recovery if there's no sync standby, since the first unit on the new raft cluster will be leader and there may be data loss if promoting an async replica. We should consider setting a status and ways for manual recovery, if the user wants to promote a given replica anyway.
if not raft_status["has_quorum"] and ( | ||
not raft_status["leader"] or raft_status["leader"].host == member_ip | ||
): | ||
logger.warning("Remove raft member: Stuck raft cluster detected") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should most probably set a status here to indicate what's going on.
@@ -710,6 +731,57 @@ def primary_changed(self, old_primary: str) -> bool: | |||
primary = self.get_primary() | |||
return primary != old_primary | |||
|
|||
def remove_raft_data(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can most probably converge this with https://github.com/canonical/postgresql-operator/blob/main/src/relations/async_replication.py#L695 as a followup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I just have one theoretical question:
Suppose a cluster with 5 nodes: 1 leader, 2 sync replicas and 2 async replicas. Lets say for some reason that this cluster loses connection in a weird way and now it got split in 2 partitions: leader + sync + async (partition 1), sync + async (partition 2).
Since we eliminated the need for a quorum when electing a new leader, couldn't the sync node in partition 2 elect itself as new leader, partition 1 keeps working the same way bc of quorum, and now we have 2 distinct clusters?
It's a weird edge case yes, just curious if we have any answer to it.
|
||
await ops_test.model.wait_for_idle(status="active", timeout=600, idle_period=45) | ||
|
||
await are_writes_increasing(ops_test, secondary) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another theoretical question here: In a 2-node cluster, primary + sync replica, if the sync replica goes down/gets unresponsive, how can the primary continue to accept write requests (writes increasing)? Isn't the whole point of having a sync replica to make sure that every write operation gets recognized by another node besides the primary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we eliminated the need for a quorum when electing a new leader, couldn't the sync node in partition 2 elect itself as new leader, partition 1 keeps working the same way bc of quorum, and now we have 2 distinct clusters?
We rely on Juju to know the full number of units. If not all units detect a loss of quorum, we don't reinitialise the raft cluster. The Juju leader selects the new primary when looping over the potential candidates and puts it in the app peer data.
Another theoretical question here: In a 2-node cluster, primary + sync replica, if the sync replica goes down/gets unresponsive, how can the primary continue to accept write requests (writes increasing)? Isn't the whole point of having a sync replica to make sure that every write operation gets recognized by another node besides the primary?
There's logic to maintain a list of units in the peer data, that should detect that the unit is gone and downgrade the cluster to single node.
Syncobj RAFT implementation, used as a standalone DCS for Patroni, cannot elect a leader if the cluster loses quorum and becomes read only. This will prevent Patroni from automatically switching over, even in cases where sync_standbys are available in the cluster and could take over as primary.
The PR adds logic to detect when the RAFT cluster becomes read only and to reinitialise it, if a sync_standby is available to become a primary.