[DPE-3684] Reinitialise raft #611

dragomirp · 2024-09-03T19:09:12Z

Syncobj RAFT implementation, used as a standalone DCS for Patroni, cannot elect a leader if the cluster loses quorum and becomes read only. This will prevent Patroni from automatically switching over, even in cases where sync_standbys are available in the cluster and could take over as primary.

The PR adds logic to detect when the RAFT cluster becomes read only and to reinitialise it, if a sync_standby is available to become a primary.

codecov · 2024-09-03T19:12:19Z

Codecov Report

Attention: Patch coverage is 85.07463% with 30 lines in your changes missing coverage. Please review.

Project coverage is 72.08%. Comparing base (7219d1e) to head (546549b).

Files with missing lines	Patch %	Lines
src/charm.py	80.64%	15 Missing and 9 partials ⚠️
src/cluster.py	92.20%	4 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #611      +/-   ##
==========================================
+ Coverage   70.89%   72.08%   +1.18%     
==========================================
  Files          12       12              
  Lines        3048     3238     +190     
  Branches      539      593      +54     
==========================================
+ Hits         2161     2334     +173     
- Misses        771      780       +9     
- Partials      116      124       +8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dragomirp · 2024-09-20T10:10:32Z

tests/integration/ha_tests/test_scaling.py

+
+
+@pytest.mark.group(1)
+@markers.juju3


Only tested on juju 3 for the moment. Should be able to reuse the wrapper around force removal to enable the tests on juju 2

dragomirp · 2024-09-20T10:15:18Z

src/cluster.py

+            ):
+                logger.info("%s is raft candidate" % self.charm.unit.name)
+                data_flags["raft_candidate"] = "True"
+            self.charm.unit_peer_data.update(data_flags)


Initial flags that trigger the recovery attempt. If we need to disable this (stable releases?), it would be easiest from here.

dragomirp · 2024-09-20T10:18:44Z

src/cluster.py

+        for attempt in Retrying(wait=wait_fixed(5)):
+            with attempt:


We have to define sane timeout for the cases when the recovery hijacks execution.

dragomirp · 2024-09-20T10:19:47Z

src/cluster.py

+            partner_addrs=self.charm.async_replication.get_partner_addresses()
+            if not no_peers
+            else [],
+            peers_ips=self.peers_ips if not no_peers else set(),


Ignoring the other nodes when recovering, they should rejoin afterwards

dragomirp · 2024-09-20T10:21:33Z

src/charm.py

+        self.app_peer_data.pop("raft_selected_candidate", None)
+        self.app_peer_data.pop("raft_followers_stopped", None)
+
+    def _raft_reinitialisation(self) -> None:


If there's only one unit (sync standby and leader), this should execute in one go.

dragomirp · 2024-09-20T10:23:32Z

src/charm.py

+        # Check whether raft is stuck.
+        if self.has_raft_keys():
+            self._raft_reinitialisation()
+            logger.debug("Early exit on_peer_relation_changed: stuck raft recovery")
+            return False


This will hijack execution until recovery completes. We should think of a ways to detect manual recovery.

dragomirp · 2024-09-20T10:25:56Z

src/cluster.py

+            except Exception:
+                logger.warning("Remove raft member: Unable to get health status")
+                health_status = {}
+            if health_status.get("role") in ("leader", "master") or health_status.get(


We shouldn't have a stuck cluster with leader role, but just in case.

dragomirp · 2024-09-20T10:30:28Z

src/charm.py

+        if not candidate:
+            logger.warning("Stuck raft has no candidate")
+            return


We can't proceed with automatic recovery if there's no sync standby, since the first unit on the new raft cluster will be leader and there may be data loss if promoting an async replica. We should consider setting a status and ways for manual recovery, if the user wants to promote a given replica anyway.

dragomirp · 2024-09-20T10:33:01Z

src/cluster.py

+        if not raft_status["has_quorum"] and (
+            not raft_status["leader"] or raft_status["leader"].host == member_ip
+        ):
+            logger.warning("Remove raft member: Stuck raft cluster detected")


We should most probably set a status here to indicate what's going on.

dragomirp · 2024-09-20T12:36:47Z

src/cluster.py

@@ -710,6 +731,57 @@ def primary_changed(self, old_primary: str) -> bool:
        primary = self.get_primary()
        return primary != old_primary

+    def remove_raft_data(self) -> None:


We can most probably converge this with https://github.com/canonical/postgresql-operator/blob/main/src/relations/async_replication.py#L695 as a followup

lucasgameiroborges

LGTM! I just have one theoretical question:

Suppose a cluster with 5 nodes: 1 leader, 2 sync replicas and 2 async replicas. Lets say for some reason that this cluster loses connection in a weird way and now it got split in 2 partitions: leader + sync + async (partition 1), sync + async (partition 2).

Since we eliminated the need for a quorum when electing a new leader, couldn't the sync node in partition 2 elect itself as new leader, partition 1 keeps working the same way bc of quorum, and now we have 2 distinct clusters?

It's a weird edge case yes, just curious if we have any answer to it.

lucasgameiroborges · 2024-09-23T22:40:06Z

tests/integration/ha_tests/test_scaling.py

+
+    await ops_test.model.wait_for_idle(status="active", timeout=600, idle_period=45)
+
+    await are_writes_increasing(ops_test, secondary)


Another theoretical question here: In a 2-node cluster, primary + sync replica, if the sync replica goes down/gets unresponsive, how can the primary continue to accept write requests (writes increasing)? Isn't the whole point of having a sync replica to make sure that every write operation gets recognized by another node besides the primary?

Since we eliminated the need for a quorum when electing a new leader, couldn't the sync node in partition 2 elect itself as new leader, partition 1 keeps working the same way bc of quorum, and now we have 2 distinct clusters?

We rely on Juju to know the full number of units. If not all units detect a loss of quorum, we don't reinitialise the raft cluster. The Juju leader selects the new primary when looping over the potential candidates and puts it in the app peer data.

Another theoretical question here: In a 2-node cluster, primary + sync replica, if the sync replica goes down/gets unresponsive, how can the primary continue to accept write requests (writes increasing)? Isn't the whole point of having a sync replica to make sure that every write operation gets recognized by another node besides the primary?

There's logic to maintain a list of units in the peer data, that should detect that the unit is gone and downgrade the cluster to single node.

dragomirp added 7 commits August 26, 2024 12:58

Use syncobj lib directly

05833c7

Bump libs

5753d1f

Add raft removal test

e2b082e

Merge branch 'main' into dpe-3684-syncobj

df0c37e

Merge branch 'main' into dpe-3684-syncobj

c31c0d5

Merge branch 'main' into dpe-3684-syncobj

6ae0dcc

Reinitialise RAFT WIP

538f721

github-actions bot added the Libraries: OK label Sep 3, 2024

Initial scaling test

29aaf18

dragomirp force-pushed the dpe-3684-reinitialise-raft branch from 4c7e201 to 29aaf18 Compare September 3, 2024 19:36

dragomirp added 17 commits September 3, 2024 23:21

Endless loop

d050e6e

Try to do fast shutdown

7626c99

Merge branch 'main' into dpe-3684-reinitialise-raft

c869331

Correct rerendering of yaml

774ea95

Unit tests

76bbe84

Ignore connection error

bdae203

Catch exception

fb922e5

Sync replica stereo test

e0bbb08

Fix test

895a8c4

No down unit

e739a56

Merge branch 'main' into dpe-3684-reinitialise-raft

6f71b36

Add check_writes to test

ed56f3d

Unit synchronisation WIP

7c8c95c

Tweak synchronisation

c7b26d2

Add logging

a4f0b39

Update endpoints after raft nuke

02d08ff

Fix unit tests

b454d06

dragomirp force-pushed the dpe-3684-reinitialise-raft branch from e69800c to adbf7eb Compare September 10, 2024 23:56

Track down unit

a281f7a

dragomirp added 2 commits September 19, 2024 21:47

Sync on reset primary

2ee87ed

Fix tests

aa61f37

dragomirp commented Sep 20, 2024

View reviewed changes

dragomirp changed the title ~~[WIP][DPE-3684] Reinitialise raft~~ [DPE-3684] Reinitialise raft Sep 20, 2024

dragomirp commented Sep 20, 2024

View reviewed changes

dragomirp marked this pull request as ready for review September 20, 2024 10:33

dragomirp requested review from a team, taurus-forever, marceloneppel and lucasgameiroborges and removed request for a team September 20, 2024 10:33

Add test to remove both async replicas

5396607

dragomirp commented Sep 20, 2024

View reviewed changes

Handle missing member in patroni cluster

ca127cb

delgod approved these changes Sep 20, 2024

View reviewed changes

dragomirp added 4 commits September 20, 2024 22:29

Merge branch 'main' into dpe-3684-reinitialise-raft

cf4e8ff

increase idle period

88edde1

Try to skip hooks that take too long

d11b87e

Merge branch 'main' into dpe-3684-reinitialise-raft

546549b

github-actions bot added Libraries: OK and removed Libraries: Out of sync labels Sep 23, 2024

lucasgameiroborges approved these changes Sep 23, 2024

View reviewed changes

lucasgameiroborges reviewed Sep 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DPE-3684] Reinitialise raft #611

[DPE-3684] Reinitialise raft #611

dragomirp commented Sep 3, 2024 •

edited

Loading

codecov bot commented Sep 3, 2024 •

edited

Loading

dragomirp Sep 20, 2024

dragomirp Sep 20, 2024

dragomirp Sep 20, 2024

dragomirp Sep 20, 2024

dragomirp Sep 20, 2024

dragomirp Sep 20, 2024

dragomirp Sep 20, 2024

dragomirp Sep 20, 2024

dragomirp Sep 20, 2024

dragomirp Sep 20, 2024

lucasgameiroborges left a comment

lucasgameiroborges Sep 23, 2024

dragomirp Sep 24, 2024


		await ops_test.model.wait_for_idle(status="active", timeout=600, idle_period=45)

		await are_writes_increasing(ops_test, secondary)



		@pytest.mark.group(1)
		@markers.juju3

[DPE-3684] Reinitialise raft #611

Are you sure you want to change the base?

[DPE-3684] Reinitialise raft #611

Conversation

dragomirp commented Sep 3, 2024 • edited Loading

codecov bot commented Sep 3, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lucasgameiroborges left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dragomirp commented Sep 3, 2024 •

edited

Loading

codecov bot commented Sep 3, 2024 •

edited

Loading