test/verify/check-networking-settings TestNetworking.testNoConnectionSettings broken of F32 #13821

marusak · 2020-03-30T15:55:40Z

On F32 this test fails very often. (not broken though, sometimes also succeeds)

Screenshot always shows this 'Connection will be lost' message, which is obviously related.
@martinpitt maybe you have some idea what is exactly happening?

Logs:
1
2
3

martinpitt · 2020-03-30T16:30:27Z

This is due to the carefully selected timing in pkg/networkmanager/interfaces.js ; I already tried to adjust it in PR #13534 and PR #13545, but apparently that still was not enough.

marusak · 2020-03-30T16:47:59Z

ah right. can we do something like the following code?

for _ in 10:
    self.toggle_onoff(".panel-heading:contains('%s')" % iface)
    try:
        b.wait_in_text("tr:contains('Status')", "10.111.")
        break
    except timeoutException:
        <if dialog is open, close it>

martinpitt · 2020-03-31T05:47:43Z

@marusak: yes, that should work. For bonus points it could use a with testvm.TimeOut

martinpitt · 2020-04-30T11:03:47Z

Locally, it either works, or it already fails much earlier, like here or here

  File "test/verify/check-networking-settings", line 59, in testNoConnectionSettings
    b.wait_in_text("tr:contains('IPv4')", "Address 1.2.3.4/24")

Only just now I had another case of this; stack trace is

  File "test/verify/check-networking-settings", line 78, in testNoConnectionSettings
    b.wait_in_text("tr:contains('Status')", "10.111.")

Here is a recent CI example for that, the original logs above were all cleaned up.

martinpitt · 2020-04-30T14:55:28Z

I am trying with this patch:

--- test/verify/check-networking-settings
+++ test/verify/check-networking-settings
@@ -73,9 +73,15 @@ class TestNetworking(NetworkCase):
         m.execute('nmcli con del "%s"' % con_id)
         b.wait_in_text("tr:contains('IPv4')", "Automatic")
 
-        # Activate with ghost settings
+        # Activate with ghost settings; on busy machines this may run into the checkpointing
         self.toggle_onoff(".panel-heading:contains('%s')" % iface)
-        b.wait_in_text("tr:contains('Status')", "10.111.")
+        try:
+            b.wait_in_text("tr:contains('Status')", "10.111.")
+        except Error:
+            self.assertTrue(b.is_visible("#confirm-breaking-change-text"))
+            print("Got checkpoint dialog, acknowledging")
+            b.click("#confirm-breaking-change-popup .pf-m-danger")
+            b.wait_in_text("tr:contains('Status')", "10.111.")
 
         # Check again that we now have connection settings
         con_id = wait(lambda: self.iface_con_id(iface))

But it doesn't work -- after accepting the change in the dialog, the interface is still in the previous "manual" mode, and has IP 1.2.3.4:

# nmcli con show eth2 
ipv4.method:                            manual
ipv4.addresses:                         1.2.3.4/24

In the journal I see that NM attempted DHCP, but the checkpoint rolled it back in one (!!) second:

<info>  [1588257364.1028] manager: (eth2): new Ethernet device (/org/freedesktop/NetworkManager/Devices/6)
<info>  [1588257364.1829] device (eth2): state change: unmanaged -> unavailable (reason 'managed', sys-iface-state: 'external')
<info>  [1588257364.1848] device (eth2): carrier: link connected
<info>  [1588257364.1861] device (eth2): state change: unavailable -> disconnected (reason 'none', sys-iface-state: 'managed')
<info>  [1588257372.7236] audit: op="checkpoint-create" arg="/org/freedesktop/NetworkManager/Checkpoint/1" pid=20551 uid=0 result="success"
<info>  [1588257375.7154] audit: op="connection-add" uuid="7e380287-06b2-49a6-ab1f-fbfe5244884a" name="eth2" pid=20551 uid=0 result="success"
<info>  [1588257375.7174] policy: auto-activating connection 'eth2' (7e380287-06b2-49a6-ab1f-fbfe5244884a)
<info>  [1588257375.7189] device (eth2): Activation: starting connection 'eth2' (7e380287-06b2-49a6-ab1f-fbfe5244884a)
<info>  [1588257375.7209] device (eth2): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
<info>  [1588257375.7230] device (eth2): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
<info>  [1588257375.7654] device (eth2): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
<info>  [1588257375.7799] device (eth2): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
<info>  [1588257375.9903] device (eth2): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
<info>  [1588257375.9910] device (eth2): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
<info>  [1588257375.9983] device (eth2): Activation: successful, device activated.
<info>  [1588257377.6599] audit: op="checkpoint-create" arg="" pid=20551 uid=0 result="fail" reason="device 'eth2' is already included in checkpoint /org/freedesktop/NetworkManager/Checkpoint/1"
<info>  [1588257377.6679] device (eth2): state change: activated -> deactivating (reason 'user-requested', sys-iface-state: 'managed')
<info>  [1588257377.6770] audit: op="device-disconnect" interface="eth2" ifindex=6 pid=20551 uid=0 result="success"
<info>  [1588257377.6776] device (eth2): state change: deactivating -> disconnected (reason 'user-requested', sys-iface-state: 'managed')
<info>  [1588257377.9697] audit: op="connection-delete" uuid="7e380287-06b2-49a6-ab1f-fbfe5244884a" name="eth2" pid=23244 uid=0 result="success"
<info>  [1588257378.2102] audit: op="checkpoint-create" arg="" pid=20551 uid=0 result="fail" reason="device 'eth2' is already included in checkpoint /org/freedesktop/NetworkManager/Checkpoint/1"
<info>  [1588257378.2268] device (eth2): Activation: starting connection 'eth2' (001aec24-37af-4930-9ada-e5174aca235d)
<info>  [1588257378.2282] audit: op="connection-add-activate" uuid="001aec24-37af-4930-9ada-e5174aca235d" name="eth2" pid=20551 uid=0 result="success"
<info>  [1588257378.2307] device (eth2): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
<info>  [1588257378.2355] device (eth2): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
<info>  [1588257378.2588] device (eth2): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
<info>  [1588257378.2606] dhcp4 (eth2): activation: beginning transaction (timeout in 45 seconds)
<info>  [1588257379.7252] checkpoint[0x7ff5e8003e50]: rollback of /org/freedesktop/NetworkManager/Checkpoint/1
<info>  [1588257379.7258] device (eth2): state change: ip-config -> deactivating (reason 'user-requested', sys-iface-state: 'managed')
<info>  [1588257379.7303] device (eth2): state change: deactivating -> disconnected (reason 'user-requested', sys-iface-state: 'managed')
<info>  [1588257379.7356] dhcp4 (eth2): canceled DHCP transaction
<info>  [1588257379.7362] dhcp4 (eth2): state changed unknown -> done
<info>  [1588257379.7753] audit: op="checkpoint-destroy" arg="/org/freedesktop/NetworkManager/Checkpoint/1" pid=20551 uid=0 result="fail" reason="checkpoint /org/freedesktop/NetworkManager/Checkpoint/1 does not exist"
<info>  [1588257438.2958] audit: op="connection-add" uuid="7e380287-06b2-49a6-ab1f-fbfe5244884a" name="eth2" pid=20551 uid=0 result="success"
<info>  [1588257438.2990] policy: auto-activating connection 'eth2' (7e380287-06b2-49a6-ab1f-fbfe5244884a)
<info>  [1588257438.3120] device (eth2): Activation: starting connection 'eth2' (7e380287-06b2-49a6-ab1f-fbfe5244884a)
<info>  [1588257438.3151] device (eth2): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
<info>  [1588257438.3299] device (eth2): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
<info>  [1588257438.3620] device (eth2): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
<info>  [1588257438.3697] device (eth2): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
<info>  [1588257438.4210] device (eth2): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
<info>  [1588257438.4215] device (eth2): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
<info>  [1588257438.4256] device (eth2): Activation: successful, device activated.

I don't understand what's going on there -- checkpoint 1 gets created three times, rolled back 7s after initial creation and 1s after last creation, and DHCP gets cut off after 1s. Thus the whole transaction does not just include the last step 'activate device', but apparently the previous nmcli con del as well.

@mvollmer, can you make some sense of this? At this point it seems to me that it's not really the test that is broken, but the checkpoint logic?

mvollmer · 2020-05-04T10:20:40Z

can you make some sense of this?

Just from reading the log:

Chjeckpoint 1 is only created once, but there are two more attempts at creating checkpoints which both fail because Checkpoint 1 is still around. (This shouldn't happen ideally, but there is no code in Cockpit to somehow serialize checkpoint creation for a single interface.)

Checkpoint 1 is created at time 1588257372.7236, rolled back at time 1588257379.7252 and attempted to be destroyed at time 1588257379.7753. This is the normal pattern for when a checkpoint has prevented a disconnection, and the "this will disconnect you" dialog should be open at this point.

However, I don't think we expect a disconnect here (right?), and Cockpit should destroy the checkpoint before it is rolled back, right after "connection-add" has succeded at 1588257375.9983

So at first look:

Cockpit takes much too long to destroy the checkpoint after the guarded operation has succeeded.
The test manages to trigger actions in Cockpit in that time anyway, which shouldn't happen ideally but is probably harmless here.

I'll do some code reading and I'll try to reproduce this.

mvollmer · 2020-05-04T10:54:25Z

Ok, this seems to be the very race that is explained in the comments:

 * The basic idea is to perform the following steps:
 *
 * 1) Create a checkpoint with automatic rollback
 * 2) Make the change
 * 3) Destroy the checkpoint

 * For a change that _doesn't_ break connectivity, this whole process
 * is inherently a race: Steps 2 and 3 need to complete before the
 * checkpoint created in step 1 reaches its timeout.

settle_time is pretty long now since 7b115d1, 3 seconds, while rollback_time is kinda low, 7 seconds.

The "connection-add" operation seems to be pretty slow: 1588257375.9983 - 1588257372.7236 = 3.274 and together with a settle_time of 3 we get pretty close to the rollback_time of 7.

I propose to bring the settle_time down to something better suited for normal scenarios, and to switch off checkpointing entirely during CI tests (except for the tests that actually need it).

A long settle_time time causes the checkpointing race to get pretty tight, and checkpoints would be unexpectedly rolled back in our integration tests. On the other hand, we had to increased the settle_time in 7b115d1 and f1d06ab to make the checkpointing tests themselves more robust. Thus, we reset the settle_time that is appropriate for manual use, switch off checkpointing entirely for most of the integration tests, and use a long settle_time when testing checkpoints themselves. Fixes #13821

A long settle_time time causes the checkpointing race to get pretty tight, and checkpoints would be unexpectedly rolled back in our integration tests. On the other hand, we had to increased the settle_time in 7b115d1 and f1d06ab to make the checkpointing tests themselves more robust. Thus, we reset the settle_time to something that is appropriate for manual use, switch off checkpointing entirely for most of the integration tests, and use a long settle_time when testing checkpoints themselves. Fixes cockpit-project#13821

A long settle_time time causes the checkpointing race to get pretty tight, and checkpoints would be unexpectedly rolled back in our integration tests. On the other hand, we had to increased the settle_time in 7b115d1 and f1d06ab to make the checkpointing tests themselves more robust. Thus, we reset the settle_time to something that is appropriate for manual use, switch off checkpointing entirely for most of the integration tests, and use a long settle_time when testing checkpoints themselves. Fixes #13821 Closes #14026

marusak added the flake unstable test label Mar 30, 2020

marusak mentioned this issue Apr 10, 2020

Image refresh for fedora-32 cockpit-project/bots#741

Merged

1 task

martinpitt self-assigned this Apr 30, 2020

martinpitt assigned mvollmer and unassigned martinpitt Apr 30, 2020

martinpitt mentioned this issue May 1, 2020

Fix some network test flakes #14000

Merged

mvollmer mentioned this issue May 4, 2020

networking: Avoid checkpointing in tests and use custom settle_time #14016

Closed

mvollmer mentioned this issue May 5, 2020

networking: Avoid checkpointing in tests and use custom settle_time #14026

Merged

martinpitt closed this as completed in #14026 May 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test/verify/check-networking-settings TestNetworking.testNoConnectionSettings broken of F32 #13821

test/verify/check-networking-settings TestNetworking.testNoConnectionSettings broken of F32 #13821

marusak commented Mar 30, 2020

martinpitt commented Mar 30, 2020

marusak commented Mar 30, 2020

martinpitt commented Mar 31, 2020

martinpitt commented Apr 30, 2020

martinpitt commented Apr 30, 2020

mvollmer commented May 4, 2020

mvollmer commented May 4, 2020

test/verify/check-networking-settings TestNetworking.testNoConnectionSettings broken of F32 #13821

test/verify/check-networking-settings TestNetworking.testNoConnectionSettings broken of F32 #13821

Comments

marusak commented Mar 30, 2020

martinpitt commented Mar 30, 2020

marusak commented Mar 30, 2020

martinpitt commented Mar 31, 2020

martinpitt commented Apr 30, 2020

martinpitt commented Apr 30, 2020

mvollmer commented May 4, 2020

mvollmer commented May 4, 2020