Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DPE-3684] Reinitialise raft #611

Open
wants to merge 90 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
90 commits
Select commit Hold shift + click to select a range
05833c7
Use syncobj lib directly
dragomirp Aug 26, 2024
5753d1f
Bump libs
dragomirp Aug 26, 2024
e2b082e
Add raft removal test
dragomirp Aug 26, 2024
df0c37e
Merge branch 'main' into dpe-3684-syncobj
dragomirp Aug 29, 2024
c31c0d5
Merge branch 'main' into dpe-3684-syncobj
dragomirp Sep 2, 2024
6ae0dcc
Merge branch 'main' into dpe-3684-syncobj
dragomirp Sep 3, 2024
538f721
Reinitialise RAFT WIP
dragomirp Sep 3, 2024
29aaf18
Initial scaling test
dragomirp Sep 3, 2024
d050e6e
Endless loop
dragomirp Sep 3, 2024
7626c99
Try to do fast shutdown
dragomirp Sep 4, 2024
c869331
Merge branch 'main' into dpe-3684-reinitialise-raft
dragomirp Sep 4, 2024
774ea95
Correct rerendering of yaml
dragomirp Sep 5, 2024
76bbe84
Unit tests
dragomirp Sep 5, 2024
bdae203
Ignore connection error
dragomirp Sep 5, 2024
fb922e5
Catch exception
dragomirp Sep 5, 2024
e0bbb08
Sync replica stereo test
dragomirp Sep 5, 2024
895a8c4
Fix test
dragomirp Sep 5, 2024
e739a56
No down unit
dragomirp Sep 5, 2024
6f71b36
Merge branch 'main' into dpe-3684-reinitialise-raft
dragomirp Sep 9, 2024
ed56f3d
Add check_writes to test
dragomirp Sep 9, 2024
7c8c95c
Unit synchronisation WIP
dragomirp Sep 10, 2024
c7b26d2
Tweak synchronisation
dragomirp Sep 10, 2024
a4f0b39
Add logging
dragomirp Sep 10, 2024
02d08ff
Update endpoints after raft nuke
dragomirp Sep 10, 2024
b454d06
Fix unit tests
dragomirp Sep 10, 2024
a281f7a
Track down unit
dragomirp Sep 10, 2024
0f9ec81
Debug down unit exclusion
dragomirp Sep 11, 2024
c1f9596
Reraise
dragomirp Sep 11, 2024
47b2b6a
Debug excess key
dragomirp Sep 11, 2024
ad5b8e8
Ignore down unit
dragomirp Sep 11, 2024
3951708
Add logging to writes increasing check
dragomirp Sep 11, 2024
6afe354
Tweaks writes check
dragomirp Sep 11, 2024
243f1fb
Fix down unit skipping
dragomirp Sep 11, 2024
73abcd9
Merge branch 'main' into dpe-3684-reinitialise-raft
dragomirp Sep 11, 2024
15dfeb7
Cleanup and ip cache
dragomirp Sep 12, 2024
1717f27
Add tests
dragomirp Sep 12, 2024
d08eb1d
Add logging on early exit
dragomirp Sep 12, 2024
862660b
Protect against None peers
dragomirp Sep 12, 2024
c3bcac1
Fix quatro test
dragomirp Sep 12, 2024
90e0d83
Majority removal test
dragomirp Sep 12, 2024
507665d
Log cluster roles
dragomirp Sep 12, 2024
0dfa199
Test sleeps
dragomirp Sep 12, 2024
1b8c2d8
Try to remove the intialised flag
dragomirp Sep 12, 2024
ad2d7a9
Use private address
dragomirp Sep 13, 2024
a7bbe67
Log ip
dragomirp Sep 13, 2024
89c368f
Use primary key
dragomirp Sep 13, 2024
9efc6ca
Check for None leader
dragomirp Sep 13, 2024
ec6f88c
Try to reuse general health call
dragomirp Sep 13, 2024
de838b5
Force emit event changed
dragomirp Sep 14, 2024
1c56adf
Force emit event
dragomirp Sep 14, 2024
85d0ba3
Use app data to sync rejoin
dragomirp Sep 16, 2024
c317104
Missing var
dragomirp Sep 16, 2024
887f008
App data sync for stuck cluster detection
dragomirp Sep 16, 2024
24a3fe0
Cleanup logic
dragomirp Sep 16, 2024
ac7e217
Rename app candidate flag
dragomirp Sep 16, 2024
26a2bc6
Wrong condition
dragomirp Sep 16, 2024
63b5d2c
Check all raft_ keys
dragomirp Sep 17, 2024
3a4eda7
Rework rejoin logic
dragomirp Sep 17, 2024
bca3e3a
Cleanup cleanup
dragomirp Sep 17, 2024
0e9686a
Delete peers just once
dragomirp Sep 17, 2024
3974ea2
Increase timout
dragomirp Sep 17, 2024
4048015
Force start on rejoin
dragomirp Sep 17, 2024
819748f
Supress db reinit
dragomirp Sep 17, 2024
761d28d
Higher coverage
dragomirp Sep 17, 2024
0163017
Unit tests
dragomirp Sep 17, 2024
db875ad
Try not to reset members cache
dragomirp Sep 17, 2024
89a1fee
Get selected candidate from peer data
dragomirp Sep 18, 2024
95c5884
Reset members
dragomirp Sep 18, 2024
09b3fea
Rework synchronising cluster stop
dragomirp Sep 18, 2024
5d253c7
Make sure unit data is clean before cleaning the app data
dragomirp Sep 18, 2024
92dd704
No double reinitialisation
dragomirp Sep 18, 2024
723306d
Only fire in relation changed
dragomirp Sep 18, 2024
c86e7c7
Still exit if secret event and raft flags
dragomirp Sep 18, 2024
42049d3
Try to simplifly coordinated stopping
dragomirp Sep 18, 2024
4fae252
Add logging
dragomirp Sep 18, 2024
c9b1211
Try to set status
dragomirp Sep 19, 2024
cf77dfe
Log event
dragomirp Sep 19, 2024
d59626e
Supress transient secrets events
dragomirp Sep 19, 2024
400529a
Merge branch 'main' into dpe-3684-reinitialise-raft
dragomirp Sep 19, 2024
45f98e6
Multiple down units
dragomirp Sep 19, 2024
c413f76
Conditional cleanup
dragomirp Sep 19, 2024
5a57952
Reenable secret hooks check
dragomirp Sep 19, 2024
2ee87ed
Sync on reset primary
dragomirp Sep 19, 2024
aa61f37
Fix tests
dragomirp Sep 19, 2024
5396607
Add test to remove both async replicas
dragomirp Sep 20, 2024
ca127cb
Handle missing member in patroni cluster
dragomirp Sep 20, 2024
cf4e8ff
Merge branch 'main' into dpe-3684-reinitialise-raft
dragomirp Sep 20, 2024
88edde1
increase idle period
dragomirp Sep 20, 2024
d11b87e
Try to skip hooks that take too long
dragomirp Sep 22, 2024
546549b
Merge branch 'main' into dpe-3684-reinitialise-raft
dragomirp Sep 23, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 40 additions & 1 deletion poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ pydantic = "^1.10.18"
poetry-core = "^1.9.0"
pyOpenSSL = "^24.2.1"
jinja2 = "^3.1.4"
pysyncobj = "^0.3.12"
psutil = "^6.0.0"

[tool.poetry.group.charm-libs.dependencies]
# data_platform_libs/v0/data_interfaces.py
Expand Down
191 changes: 180 additions & 11 deletions src/charm.py
Original file line number Diff line number Diff line change
Expand Up @@ -418,6 +418,10 @@ def _on_peer_relation_departed(self, event: RelationDepartedEvent) -> None:
logger.debug("Early exit on_peer_relation_departed: Skipping departing unit")
return

if self.has_raft_keys():
logger.debug("Early exit on_peer_relation_departed: Raft recovery in progress")
return

# Remove the departing member from the raft cluster.
try:
departing_member = event.departing_unit.name.replace("/", "-")
Expand All @@ -429,6 +433,12 @@ def _on_peer_relation_departed(self, event: RelationDepartedEvent) -> None:
)
event.defer()
return
except RetryError:
logger.warning(
"Early exit on_peer_relation_departed: Cannot get %s member IP"
% event.departing_unit.name
)
return

# Allow leader to update the cluster members.
if not self.unit.is_leader():
Expand Down Expand Up @@ -508,20 +518,163 @@ def _on_pgdata_storage_detaching(self, _) -> None:
if self.primary_endpoint:
self._update_relation_endpoints()

def _on_peer_relation_changed(self, event: HookEvent):
"""Reconfigure cluster members when something changes."""
def _stuck_raft_cluster_check(self) -> None:
"""Check for stuck raft cluster and reinitialise if safe."""
raft_stuck = False
all_units_stuck = True
candidate = self.app_peer_data.get("raft_selected_candidate")
for key, data in self._peers.data.items():
if key == self.app:
continue
if "raft_stuck" in data:
raft_stuck = True
else:
all_units_stuck = False
if not candidate and "raft_candidate" in data:
candidate = key

if not raft_stuck:
return

if not all_units_stuck:
logger.warning("Stuck raft not yet detected on all units")
return

if not candidate:
logger.warning("Stuck raft has no candidate")
return
Comment on lines +543 to +545
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't proceed with automatic recovery if there's no sync standby, since the first unit on the new raft cluster will be leader and there may be data loss if promoting an async replica. We should consider setting a status and ways for manual recovery, if the user wants to promote a given replica anyway.

if "raft_selected_candidate" not in self.app_peer_data:
logger.info("%s selected for new raft leader" % candidate.name)
self.app_peer_data["raft_selected_candidate"] = candidate.name

def _stuck_raft_cluster_rejoin(self) -> None:
"""Reconnect cluster to new raft."""
primary = None
for key, data in self._peers.data.items():
if key == self.app:
continue
if "raft_primary" in data:
primary = key
break
if primary and "raft_reset_primary" not in self.app_peer_data:
logger.info("Updating the primary endpoint")
self.app_peer_data.pop("members_ips", None)
self._add_to_members_ips(self._get_unit_ip(primary))
self.app_peer_data["raft_reset_primary"] = "True"
self._update_relation_endpoints()
if (
"raft_rejoin" not in self.app_peer_data
and "raft_followers_stopped" in self.app_peer_data
and "raft_reset_primary" in self.app_peer_data
):
logger.info("Notify units they can rejoin")
self.app_peer_data["raft_rejoin"] = "True"

def _stuck_raft_cluster_stopped_check(self) -> None:
"""Check that the cluster is stopped."""
if "raft_followers_stopped" in self.app_peer_data:
return

for key, data in self._peers.data.items():
if key == self.app:
continue
if "raft_stopped" not in data:
return

logger.info("Cluster is shut down")
self.app_peer_data["raft_followers_stopped"] = "True"

def _stuck_raft_cluster_cleanup(self) -> None:
for key, data in self._peers.data.items():
if key == self.app:
continue
for flag in data.keys():
if flag.startswith("raft_"):
return

logger.info("Cleaning up raft app data")
self.app_peer_data.pop("raft_rejoin", None)
self.app_peer_data.pop("raft_reset_primary", None)
self.app_peer_data.pop("raft_selected_candidate", None)
self.app_peer_data.pop("raft_followers_stopped", None)

def _raft_reinitialisation(self) -> None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's only one unit (sync standby and leader), this should execute in one go.

"""Handle raft cluster loss of quorum."""
# Skip to cleanup if rejoining
if "raft_rejoin" not in self.app_peer_data:
if self.unit.is_leader():
self._stuck_raft_cluster_check()

if (
candidate := self.app_peer_data.get("raft_selected_candidate")
) and "raft_stopped" not in self.unit_peer_data:
self.unit_peer_data.pop("raft_stuck", None)
self.unit_peer_data.pop("raft_candidate", None)
self._patroni.remove_raft_data()
logger.info("Stopping %s" % self.unit.name)
self.unit_peer_data["raft_stopped"] = "True"

if self.unit.is_leader():
self._stuck_raft_cluster_stopped_check()

if (
candidate == self.unit.name
and "raft_primary" not in self.unit_peer_data
and "raft_followers_stopped" in self.app_peer_data
):
logger.info("Reinitialising %s as primary" % self.unit.name)
self._patroni.reinitialise_raft_data()
self.unit_peer_data["raft_primary"] = "True"

if self.unit.is_leader():
self._stuck_raft_cluster_rejoin()

if "raft_rejoin" in self.app_peer_data:
logger.info("Cleaning up raft unit data")
self.unit_peer_data.pop("raft_primary", None)
self.unit_peer_data.pop("raft_stopped", None)
self._patroni.start_patroni()

if self.unit.is_leader():
self._stuck_raft_cluster_cleanup()

def has_raft_keys(self):
"""Checks for the presence of raft recovery keys in peer data."""
for key in self.app_peer_data.keys():
if key.startswith("raft_"):
return True

for key in self.unit_peer_data.keys():
if key.startswith("raft_"):
return True
return False

def _peer_relation_changed_checks(self, event: HookEvent) -> bool:
"""Split of to reduce complexity."""
# Prevents the cluster to be reconfigured before it's bootstrapped in the leader.
if "cluster_initialised" not in self._peers.data[self.app]:
logger.debug("Deferring on_peer_relation_changed: cluster not initialized")
event.defer()
return
return False

# Check whether raft is stuck.
if self.has_raft_keys():
self._raft_reinitialisation()
logger.debug("Early exit on_peer_relation_changed: stuck raft recovery")
return False
Comment on lines +660 to +664
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will hijack execution until recovery completes. We should think of a ways to detect manual recovery.


# If the unit is the leader, it can reconfigure the cluster.
if self.unit.is_leader() and not self._reconfigure_cluster(event):
event.defer()
return
return False

if self._update_member_ip():
return False
return True

def _on_peer_relation_changed(self, event: HookEvent):
"""Reconfigure cluster members when something changes."""
if not self._peer_relation_changed_checks(event):
return

# Don't update this member before it's part of the members list.
Expand Down Expand Up @@ -563,7 +716,8 @@ def _on_peer_relation_changed(self, event: HookEvent):
# Restart the workload if it's stuck on the starting state after a timeline divergence
# due to a backup that was restored.
if (
not self.is_primary
not self.has_raft_keys()
and not self.is_primary
and not self.is_standby_leader
and (
self._patroni.member_replication_lag == "unknown"
Expand Down Expand Up @@ -712,14 +866,16 @@ def add_cluster_member(self, member: str) -> None:
def _get_unit_ip(self, unit: Unit) -> Optional[str]:
"""Get the IP address of a specific unit."""
# Check if host is current host.
ip = None
if unit == self.unit:
return str(self.model.get_binding(PEER).network.bind_address)
ip = self.model.get_binding(PEER).network.bind_address
# Check if host is a peer.
elif unit in self._peers.data:
return str(self._peers.data[unit].get("private-address"))
ip = self._peers.data[unit].get("private-address")
# Return None if the unit is not a peer neither the current unit.
else:
return None
if ip:
return str(ip)
return None

@property
def _hosts(self) -> set:
Expand Down Expand Up @@ -911,6 +1067,10 @@ def _on_leader_elected(self, event: LeaderElectedEvent) -> None:
if self.get_secret(APP_SCOPE, key) is None:
self.set_secret(APP_SCOPE, key, new_password())

if self.has_raft_keys():
self._raft_reinitialisation()
return

# Update the list of the current PostgreSQL hosts when a new leader is elected.
# Add this unit to the list of cluster members
# (the cluster should start with only this member).
Expand Down Expand Up @@ -1371,6 +1531,10 @@ def _can_run_on_update_status(self) -> bool:
if "cluster_initialised" not in self._peers.data[self.app]:
return False

if self.has_raft_keys():
logger.debug("Early exit on_update_status: Raft recovery in progress")
return False

if not self.upgrade.idle:
logger.debug("Early exit on_update_status: upgrade in progress")
return False
Expand Down Expand Up @@ -1417,7 +1581,8 @@ def _handle_workload_failures(self) -> bool:
return False

if (
not is_primary
not self.has_raft_keys()
and not is_primary
and not is_standby_leader
and not self._patroni.member_started
and "postgresql_restarted" in self._peers.data[self.unit]
Expand Down Expand Up @@ -1618,7 +1783,7 @@ def _can_connect_to_postgresql(self) -> bool:
return False
return True

def update_config(self, is_creating_backup: bool = False) -> bool:
def update_config(self, is_creating_backup: bool = False, no_peers: bool = False) -> bool:
"""Updates Patroni config file based on the existence of the TLS files."""
enable_tls = self.is_tls_enabled
limit_memory = None
Expand All @@ -1644,7 +1809,11 @@ def update_config(self, is_creating_backup: bool = False) -> bool:
self.app_peer_data.get("require-change-bucket-after-restore", None)
),
parameters=pg_parameters,
no_peers=no_peers,
)
if no_peers:
return True

if not self._is_workload_running:
# If Patroni/PostgreSQL has not started yet and TLS relations was initialised,
# then mark TLS as enabled. This commonly happens when the charm is deployed
Expand Down
Loading
Loading