Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DPE-3684] Reinitialise raft #611

Open
wants to merge 90 commits into
base: main
Choose a base branch
from
Open

Conversation

dragomirp
Copy link
Contributor

@dragomirp dragomirp commented Sep 3, 2024

Syncobj RAFT implementation, used as a standalone DCS for Patroni, cannot elect a leader if the cluster loses quorum and becomes read only. This will prevent Patroni from automatically switching over, even in cases where sync_standbys are available in the cluster and could take over as primary.

The PR adds logic to detect when the RAFT cluster becomes read only and to reinitialise it, if a sync_standby is available to become a primary.

Copy link

codecov bot commented Sep 3, 2024

Codecov Report

Attention: Patch coverage is 85.07463% with 30 lines in your changes missing coverage. Please review.

Project coverage is 72.08%. Comparing base (7219d1e) to head (546549b).

Files with missing lines Patch % Lines
src/charm.py 80.64% 15 Missing and 9 partials ⚠️
src/cluster.py 92.20% 4 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #611      +/-   ##
==========================================
+ Coverage   70.89%   72.08%   +1.18%     
==========================================
  Files          12       12              
  Lines        3048     3238     +190     
  Branches      539      593      +54     
==========================================
+ Hits         2161     2334     +173     
- Misses        771      780       +9     
- Partials      116      124       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.



@pytest.mark.group(1)
@markers.juju3
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only tested on juju 3 for the moment. Should be able to reuse the wrapper around force removal to enable the tests on juju 2

):
logger.info("%s is raft candidate" % self.charm.unit.name)
data_flags["raft_candidate"] = "True"
self.charm.unit_peer_data.update(data_flags)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial flags that trigger the recovery attempt. If we need to disable this (stable releases?), it would be easiest from here.

Comment on lines +763 to +764
for attempt in Retrying(wait=wait_fixed(5)):
with attempt:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to define sane timeout for the cases when the recovery hijacks execution.

Comment on lines +607 to +610
partner_addrs=self.charm.async_replication.get_partner_addresses()
if not no_peers
else [],
peers_ips=self.peers_ips if not no_peers else set(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring the other nodes when recovering, they should rejoin afterwards

self.app_peer_data.pop("raft_selected_candidate", None)
self.app_peer_data.pop("raft_followers_stopped", None)

def _raft_reinitialisation(self) -> None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's only one unit (sync standby and leader), this should execute in one go.

Comment on lines +650 to +654
# Check whether raft is stuck.
if self.has_raft_keys():
self._raft_reinitialisation()
logger.debug("Early exit on_peer_relation_changed: stuck raft recovery")
return False
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will hijack execution until recovery completes. We should think of a ways to detect manual recovery.

except Exception:
logger.warning("Remove raft member: Unable to get health status")
health_status = {}
if health_status.get("role") in ("leader", "master") or health_status.get(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't have a stuck cluster with leader role, but just in case.

Comment on lines +533 to +535
if not candidate:
logger.warning("Stuck raft has no candidate")
return
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't proceed with automatic recovery if there's no sync standby, since the first unit on the new raft cluster will be leader and there may be data loss if promoting an async replica. We should consider setting a status and ways for manual recovery, if the user wants to promote a given replica anyway.

@dragomirp dragomirp changed the title [WIP][DPE-3684] Reinitialise raft [DPE-3684] Reinitialise raft Sep 20, 2024
if not raft_status["has_quorum"] and (
not raft_status["leader"] or raft_status["leader"].host == member_ip
):
logger.warning("Remove raft member: Stuck raft cluster detected")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should most probably set a status here to indicate what's going on.

@dragomirp dragomirp marked this pull request as ready for review September 20, 2024 10:33
@dragomirp dragomirp requested review from a team, taurus-forever, marceloneppel and lucasgameiroborges and removed request for a team September 20, 2024 10:33
@@ -710,6 +731,57 @@ def primary_changed(self, old_primary: str) -> bool:
primary = self.get_primary()
return primary != old_primary

def remove_raft_data(self) -> None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@lucasgameiroborges lucasgameiroborges left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I just have one theoretical question:

Suppose a cluster with 5 nodes: 1 leader, 2 sync replicas and 2 async replicas. Lets say for some reason that this cluster loses connection in a weird way and now it got split in 2 partitions: leader + sync + async (partition 1), sync + async (partition 2).

Since we eliminated the need for a quorum when electing a new leader, couldn't the sync node in partition 2 elect itself as new leader, partition 1 keeps working the same way bc of quorum, and now we have 2 distinct clusters?

It's a weird edge case yes, just curious if we have any answer to it.


await ops_test.model.wait_for_idle(status="active", timeout=600, idle_period=45)

await are_writes_increasing(ops_test, secondary)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another theoretical question here: In a 2-node cluster, primary + sync replica, if the sync replica goes down/gets unresponsive, how can the primary continue to accept write requests (writes increasing)? Isn't the whole point of having a sync replica to make sure that every write operation gets recognized by another node besides the primary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we eliminated the need for a quorum when electing a new leader, couldn't the sync node in partition 2 elect itself as new leader, partition 1 keeps working the same way bc of quorum, and now we have 2 distinct clusters?

We rely on Juju to know the full number of units. If not all units detect a loss of quorum, we don't reinitialise the raft cluster. The Juju leader selects the new primary when looping over the potential candidates and puts it in the app peer data.

Another theoretical question here: In a 2-node cluster, primary + sync replica, if the sync replica goes down/gets unresponsive, how can the primary continue to accept write requests (writes increasing)? Isn't the whole point of having a sync replica to make sure that every write operation gets recognized by another node besides the primary?

There's logic to maintain a list of units in the peer data, that should detect that the unit is gone and downgrade the cluster to single node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants