Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network checkpoints don't work reliably on RHEL 8.2 #13532

Closed
martinpitt opened this issue Feb 11, 2020 · 3 comments · Fixed by #13534
Closed

Network checkpoints don't work reliably on RHEL 8.2 #13532

martinpitt opened this issue Feb 11, 2020 · 3 comments · Fixed by #13534
Assignees
Labels
flake unstable test

Comments

@martinpitt
Copy link
Member

martinpitt commented Feb 11, 2020

testCheckpoint on RHEL 8.2 is very unreliable. It often needs 3 retries, and several more test runs to succeed.

This reproduces locally very reliably. When this happens, the web UI just shows "Deactivating.." for the interface and never shows the "Testing connection..." curtain. The interface in the VM immediately gets shut down. After a few minutes the UI gives up and shows the "Disconnected" curtain that the test logs also show.

Journal (from virsh console) during this:

NetworkManager[830]: <info>  [1581412395.1358] audit: op="checkpoint-create" arg="/org/freedesktop/NetworkManager/Checkpoint/1" pid=1349 uid=0 result="success"
NetworkManager[830]: <info>  [1581412395.1586] device (eth0): state change: activated -> deactivating (reason 'user-requested', sys-iface-state: 'managed')
NetworkManager[830]: <info>  [1581412395.1589] manager: NetworkManager state is now CONNECTED_LOCAL
NetworkManager[830]: <info>  [1581412395.1598] audit: op="device-disconnect" interface="eth0" ifindex=2 pid=1349 uid=0 result="success"
dbus-daemon[764]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.9' (uid=0 pid=830 comm="/usr/sbin/NetworkManager --no-daemon " label="system_u:system_r:NetworkManager_t:s0")
NetworkManager[830]: <info>  [1581412396.2177] checkpoint[0x563e2f5e9210]: destroy /org/freedesktop/NetworkManager/Checkpoint/1
NetworkManager[830]: <info>  [1581412396.2181] audit: op="checkpoint-destroy" arg="/org/freedesktop/NetworkManager/Checkpoint/1" pid=1349 uid=0 result="success"
PackageKit[1399]: get-updates transaction /1_cadcbcbd from uid 0 finished with failed after 2930ms
systemd[1]: Starting Network Manager Script Dispatcher Service...
dbus-daemon[764]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
NetworkManager[830]: <info>  [1581412397.7787] device (eth0): state change: deactivating -> disconnected (reason 'user-requested', sys-iface-state: 'managed')
NetworkManager[830]: <info>  [1581412397.7798] dhcp4 (eth0): canceled DHCP transaction
NetworkManager[830]: <info>  [1581412397.7798] dhcp4 (eth0): state changed bound -> done
systemd[1]: Started Network Manager Script Dispatcher Service.

The checkpointing seems to work reliably when I sit() right before the test disables the interface, and I either do it manually (several times), or I just simply let some seconds pass. This suggests a race condition in either NM or the UI. I somewhat suspect NM as this doesn't happen on any other OS.

@martinpitt martinpitt added the flake unstable test label Feb 11, 2020
@martinpitt
Copy link
Member Author

martinpitt commented Feb 11, 2020

When it works, it looks like this:

<info>  [1581413526.6813] audit: op="checkpoint-create" arg="/org/freedesktop/NetworkManager/Checkpoint/1" pid=1381 uid=0 result="success"
<info>  [1581413526.7072] device (eth0): state change: activated -> deactivating (reason 'user-requested', sys-iface-state: 'managed')
<info>  [1581413526.7077] manager: NetworkManager state is now CONNECTED_LOCAL
<info>  [1581413526.7089] audit: op="device-disconnect" interface="eth0" ifindex=2 pid=1381 uid=0 result="success"
[system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.9' (uid=0 pid=838 comm="/usr/sbin/NetworkManager --no-daemon " label="system_u:system_r:NetworkManager_t:s0")
Starting Network Manager Script Dispatcher Service...
[system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Started Network Manager Script Dispatcher Service.
<info>  [1581413526.9960] device (eth0): state change: deactivating -> disconnected (reason 'user-requested', sys-iface-state: 'managed')
<info>  [1581413526.9971] dhcp4 (eth0): canceled DHCP transaction
<info>  [1581413526.9972] dhcp4 (eth0): state changed bound -> done
<info>  [1581413533.6885] checkpoint[0x55c564705210]: rollback of /org/freedesktop/NetworkManager/Checkpoint/1
<info>  [1581413533.6902] device (eth0): Activation: starting connection 'System eth0' (5fb06bd0-0bb0-7ffb-45f1-d6edd65f3e03)
<info>  [1581413533.6905] device (eth0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
<info>  [1581413533.6913] manager: NetworkManager state is now CONNECTING
<info>  [1581413533.6917] device (eth0): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
<info>  [1581413533.7170] device (eth0): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
<info>  [1581413533.7174] dhcp4 (eth0): activation: beginning transaction (timeout in 45 seconds)
<info>  [1581413533.7476] dhcp4 (eth0): option dhcp_lease_time      => '86400'
<info>  [1581413533.7477] dhcp4 (eth0): option expiry               => '1581499933'
<info>  [1581413533.7478] dhcp4 (eth0): option host_name            => 'm1.cockpit.lan'
<info>  [1581413533.7478] dhcp4 (eth0): option ip_address           => '172.27.0.15'
<info>  [1581413533.7479] dhcp4 (eth0): option next_server          => '172.27.0.2'
<info>  [1581413533.7479] dhcp4 (eth0): option requested_broadcast_address => '1'
<info>  [1581413533.7479] dhcp4 (eth0): option requested_domain_name => '1'
<info>  [1581413533.7479] dhcp4 (eth0): option requested_domain_name_servers => '1'
<info>  [1581413533.7479] dhcp4 (eth0): option requested_domain_search => '1'
<info>  [1581413533.7480] dhcp4 (eth0): option requested_host_name  => '1'
<info>  [1581413533.7480] dhcp4 (eth0): option requested_interface_mtu => '1'
<info>  [1581413533.7480] dhcp4 (eth0): option requested_ms_classless_static_routes => '1'
<info>  [1581413533.7480] dhcp4 (eth0): option requested_nis_domain => '1'
<info>  [1581413533.7480] dhcp4 (eth0): option requested_nis_servers => '1'
<info>  [1581413533.7481] dhcp4 (eth0): option requested_ntp_servers => '1'
<info>  [1581413533.7481] dhcp4 (eth0): option requested_rfc3442_classless_static_routes => '1'
<info>  [1581413533.7481] dhcp4 (eth0): option requested_root_path  => '1'
<info>  [1581413533.7481] dhcp4 (eth0): option requested_routers    => '1'
<info>  [1581413533.7481] dhcp4 (eth0): option requested_static_routes => '1'
<info>  [1581413533.7482] dhcp4 (eth0): option requested_subnet_mask => '1'
<info>  [1581413533.7482] dhcp4 (eth0): option requested_time_offset => '1'
<info>  [1581413533.7482] dhcp4 (eth0): option requested_wpad       => '1'
<info>  [1581413533.7482] dhcp4 (eth0): option subnet_mask          => '255.255.255.0'
<info>  [1581413533.7482] dhcp4 (eth0): state changed unknown -> bound
<info>  [1581413533.7504] device (eth0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
<info>  [1581413533.7676] device (eth0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
<info>  [1581413533.7679] device (eth0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
<info>  [1581413533.7684] manager: NetworkManager state is now CONNECTED_LOCAL
<info>  [1581413533.7692] device (eth0): Activation: successful, device activated.
iscsi.service: Unit cannot be reloaded because it is inactive.
<info>  [1581413535.1845] manager: NetworkManager state is now CONNECTED_SITE
<info>  [1581413535.1849] policy: set 'System eth0' (eth0) as default for IPv6 routing and DNS
<info>  [1581413538.7725] audit: op="checkpoint-destroy" arg="/org/freedesktop/NetworkManager/Checkpoint/1" pid=1381 uid=0 result="fail" reason="checkpoint /org/freedesktop/NetworkManager/Checkpoint/1 does not exist"
<info>  [1581413539.0682] audit: op="checkpoint-create" arg="/org/freedesktop/NetworkManager/Checkpoint/2" pid=1381 uid=0 result="success"
<info>  [1581413539.0877] audit: op="connection-update" uuid="5fb06bd0-0bb0-7ffb-45f1-d6edd65f3e03" name="System eth0" args="connection.timestamp,ipv4.addresses,ipv4.method" pid=1381 uid=0 result="success"

@martinpitt
Copy link
Member Author

I just (re)discovered the carefully tweaked timeouts in pkg/networkmanager/interfaces.js: curtain_time, settle_time, and rollback_time. At the start of the test, the VM is still so busy that the checkpoint gets destroyed before eth0 gets fully shut down, so the destroying succeeds. Unfortunately the .modify() call is async, it doesn't wait until the operation is actually finished.

@martinpitt
Copy link
Member Author

This seems to work reasonably reliably with settle_time = 3.0. I'll send a PR and discuss it there.

@martinpitt martinpitt self-assigned this Feb 11, 2020
martinpitt added a commit to martinpitt/cockpit that referenced this issue Feb 11, 2020
Similar to commit f1d06ab, but bump it harder. Right after
starting the test and cockpit, the VMs are fairly loaded (particularly
rhel-8-2), so that eth0 does not get shut down quickly enough and the
checkpoint_destroy() succeeds. As a result, Cockpit just disconnects,
the curtain never comes up, and there is never a restore.

Fixes cockpit-project#13532
martinpitt added a commit that referenced this issue Feb 12, 2020
Similar to commit f1d06ab, but bump it harder. Right after
starting the test and cockpit, the VMs are fairly loaded (particularly
rhel-8-2), so that eth0 does not get shut down quickly enough and the
checkpoint_destroy() succeeds. As a result, Cockpit just disconnects,
the curtain never comes up, and there is never a restore.

Fixes #13532
Closes #13534
martinpitt added a commit to martinpitt/cockpit that referenced this issue Feb 19, 2020
Similar to commit f1d06ab, but bump it harder. Right after
starting the test and cockpit, the VMs are fairly loaded (particularly
rhel-8-2), so that eth0 does not get shut down quickly enough and the
checkpoint_destroy() succeeds. As a result, Cockpit just disconnects,
the curtain never comes up, and there is never a restore.

Fixes cockpit-project#13532

Cherry-picked from master commit 7b115d1
martinpitt added a commit that referenced this issue Feb 19, 2020
Similar to commit f1d06ab, but bump it harder. Right after
starting the test and cockpit, the VMs are fairly loaded (particularly
rhel-8-2), so that eth0 does not get shut down quickly enough and the
checkpoint_destroy() succeeds. As a result, Cockpit just disconnects,
the curtain never comes up, and there is never a restore.

Fixes #13532

Cherry-picked from master commit 7b115d1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flake unstable test
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant