The MySQL cluster is not recovered automatically under certain scenarios. #485

peterctl · 2024-08-18T19:29:04Z

Killing multiple MySQL pods at the same time without waiting for them to fully come online will put the cluster in an unhealthy state, but it will not trigger the reboot cluster from complete outage flow to recover the cluster.
This will also happen during a networking outage, which in this case was simulated by taking down all NICs on all the microk8s machines in a single AZ.

Steps to reproduce

By killing pods.
a. Kill multiple mysql pods at the same time.
```
kubectl delete mysql-0
kubectl delete mysql-1
```

By simulating a network outage.
a. Take down all NICs of the microk8s machines.

juju machines -m microk8s | awk '/AZ3/ {print $1}' | while read machine; do
  juju ssh $machine '
    for nic in bond0 bondM; do  # adjust as necessary
      sudo ip link set dev $nic down
    done
  '
done

b. Wait 15 minutes for the network outage to affect mysql, then reboot all the machines in the AZ to bring the network back online.

Expected behavior

The cluster will go offline and the reboot cluster from complete outage flow will be triggered to recover the cluster.

Actual behavior

The leader unit will not go offline and the reboot cluster flow will not be triggered, leaving the cluster in an inconsistent state.

mysql/0          maintenance  idle       10.1.1.10         offline
mysql/1          maintenance  idle       10.1.1.11         offline
mysql/2*         maintenance  idle       10.1.1.12         Unable to get member state

Versions

Operating system: Ubuntu 22.04 Jammy Jellyfish
Juju CLI: 3.4.3
Juju agent: 3.4.3
Charm revision: 153 (channel 8.0/stable)
microk8s: 1.28

Additional context

To recover the cluster, the mysqld_safe Pebble service needs to be restarted inside the leader unit:

juju ssh --container mysql mysql/leader /charm/bin/pebble restart mysqld_safe

The text was updated successfully, but these errors were encountered:

github-actions · 2024-08-18T19:29:19Z

https://warthogs.atlassian.net/browse/DPE-5183

paulomach · 2024-08-21T12:52:10Z

Thanks for the detailed report @peterctl , it will be useful to create a test.
We will queue it for fix.

peterctl added the bug Something isn't working label Aug 18, 2024

shayancanonical self-assigned this Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The MySQL cluster is not recovered automatically under certain scenarios. #485

The MySQL cluster is not recovered automatically under certain scenarios. #485

peterctl commented Aug 18, 2024 •

edited

Loading

github-actions bot commented Aug 18, 2024

paulomach commented Aug 21, 2024

The MySQL cluster is not recovered automatically under certain scenarios. #485

The MySQL cluster is not recovered automatically under certain scenarios. #485

Comments

peterctl commented Aug 18, 2024 • edited Loading

Steps to reproduce

Expected behavior

Actual behavior

Versions

Additional context

github-actions bot commented Aug 18, 2024

paulomach commented Aug 21, 2024

peterctl commented Aug 18, 2024 •

edited

Loading