Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The MySQL cluster is not recovered automatically under certain scenarios. #485

Open
peterctl opened this issue Aug 18, 2024 · 2 comments
Open
Assignees
Labels
bug Something isn't working

Comments

@peterctl
Copy link

peterctl commented Aug 18, 2024

Killing multiple MySQL pods at the same time without waiting for them to fully come online will put the cluster in an unhealthy state, but it will not trigger the reboot cluster from complete outage flow to recover the cluster.
This will also happen during a networking outage, which in this case was simulated by taking down all NICs on all the microk8s machines in a single AZ.

Steps to reproduce

  1. By killing pods.
    a. Kill multiple mysql pods at the same time.
    kubectl delete mysql-0
    kubectl delete mysql-1
  2. By simulating a network outage.
    a. Take down all NICs of the microk8s machines.
    juju machines -m microk8s | awk '/AZ3/ {print $1}' | while read machine; do
      juju ssh $machine '
        for nic in bond0 bondM; do  # adjust as necessary
          sudo ip link set dev $nic down
        done
      '
    done

b. Wait 15 minutes for the network outage to affect mysql, then reboot all the machines in the AZ to bring the network back online.

Expected behavior

The cluster will go offline and the reboot cluster from complete outage flow will be triggered to recover the cluster.

Actual behavior

The leader unit will not go offline and the reboot cluster flow will not be triggered, leaving the cluster in an inconsistent state.

mysql/0          maintenance  idle       10.1.1.10         offline
mysql/1          maintenance  idle       10.1.1.11         offline
mysql/2*         maintenance  idle       10.1.1.12         Unable to get member state

Versions

Operating system: Ubuntu 22.04 Jammy Jellyfish
Juju CLI: 3.4.3
Juju agent: 3.4.3
Charm revision: 153 (channel 8.0/stable)
microk8s: 1.28

Additional context

To recover the cluster, the mysqld_safe Pebble service needs to be restarted inside the leader unit:

juju ssh --container mysql mysql/leader /charm/bin/pebble restart mysqld_safe
@peterctl peterctl added the bug Something isn't working label Aug 18, 2024
Copy link
Contributor

@paulomach
Copy link
Contributor

Thanks for the detailed report @peterctl , it will be useful to create a test.
We will queue it for fix.

@shayancanonical shayancanonical self-assigned this Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants