Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test/verify/check-networking-settings TestNetworking.testNoConnectionSettings broken of F32 #13821

Closed
marusak opened this issue Mar 30, 2020 · 7 comments · Fixed by #14026
Closed
Assignees
Labels
flake unstable test

Comments

@marusak
Copy link
Member

marusak commented Mar 30, 2020

On F32 this test fails very often. (not broken though, sometimes also succeeds)

Screenshot always shows this 'Connection will be lost' message, which is obviously related.
@martinpitt maybe you have some idea what is exactly happening?

Logs:
1
2
3

@marusak marusak added the flake unstable test label Mar 30, 2020
@martinpitt
Copy link
Member

This is due to the carefully selected timing in pkg/networkmanager/interfaces.js ; I already tried to adjust it in PR #13534 and PR #13545, but apparently that still was not enough.

@marusak
Copy link
Member Author

marusak commented Mar 30, 2020

ah right. can we do something like the following code?

for _ in 10:
    self.toggle_onoff(".panel-heading:contains('%s')" % iface)
    try:
        b.wait_in_text("tr:contains('Status')", "10.111.")
        break
    except timeoutException:
        <if dialog is open, close it>

@martinpitt
Copy link
Member

@marusak: yes, that should work. For bonus points it could use a with testvm.TimeOut

@martinpitt
Copy link
Member

Locally, it either works, or it already fails much earlier, like here or here

  File "test/verify/check-networking-settings", line 59, in testNoConnectionSettings
    b.wait_in_text("tr:contains('IPv4')", "Address 1.2.3.4/24")

Only just now I had another case of this; stack trace is

  File "test/verify/check-networking-settings", line 78, in testNoConnectionSettings
    b.wait_in_text("tr:contains('Status')", "10.111.")

Here is a recent CI example for that, the original logs above were all cleaned up.

@martinpitt martinpitt self-assigned this Apr 30, 2020
@martinpitt
Copy link
Member

I am trying with this patch:

--- test/verify/check-networking-settings
+++ test/verify/check-networking-settings
@@ -73,9 +73,15 @@ class TestNetworking(NetworkCase):
         m.execute('nmcli con del "%s"' % con_id)
         b.wait_in_text("tr:contains('IPv4')", "Automatic")
 
-        # Activate with ghost settings
+        # Activate with ghost settings; on busy machines this may run into the checkpointing
         self.toggle_onoff(".panel-heading:contains('%s')" % iface)
-        b.wait_in_text("tr:contains('Status')", "10.111.")
+        try:
+            b.wait_in_text("tr:contains('Status')", "10.111.")
+        except Error:
+            self.assertTrue(b.is_visible("#confirm-breaking-change-text"))
+            print("Got checkpoint dialog, acknowledging")
+            b.click("#confirm-breaking-change-popup .pf-m-danger")
+            b.wait_in_text("tr:contains('Status')", "10.111.")
 
         # Check again that we now have connection settings
         con_id = wait(lambda: self.iface_con_id(iface))

But it doesn't work -- after accepting the change in the dialog, the interface is still in the previous "manual" mode, and has IP 1.2.3.4:

# nmcli con show eth2 
ipv4.method:                            manual
ipv4.addresses:                         1.2.3.4/24

In the journal I see that NM attempted DHCP, but the checkpoint rolled it back in one (!!) second:

<info>  [1588257364.1028] manager: (eth2): new Ethernet device (/org/freedesktop/NetworkManager/Devices/6)
<info>  [1588257364.1829] device (eth2): state change: unmanaged -> unavailable (reason 'managed', sys-iface-state: 'external')
<info>  [1588257364.1848] device (eth2): carrier: link connected
<info>  [1588257364.1861] device (eth2): state change: unavailable -> disconnected (reason 'none', sys-iface-state: 'managed')
<info>  [1588257372.7236] audit: op="checkpoint-create" arg="/org/freedesktop/NetworkManager/Checkpoint/1" pid=20551 uid=0 result="success"
<info>  [1588257375.7154] audit: op="connection-add" uuid="7e380287-06b2-49a6-ab1f-fbfe5244884a" name="eth2" pid=20551 uid=0 result="success"
<info>  [1588257375.7174] policy: auto-activating connection 'eth2' (7e380287-06b2-49a6-ab1f-fbfe5244884a)
<info>  [1588257375.7189] device (eth2): Activation: starting connection 'eth2' (7e380287-06b2-49a6-ab1f-fbfe5244884a)
<info>  [1588257375.7209] device (eth2): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
<info>  [1588257375.7230] device (eth2): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
<info>  [1588257375.7654] device (eth2): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
<info>  [1588257375.7799] device (eth2): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
<info>  [1588257375.9903] device (eth2): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
<info>  [1588257375.9910] device (eth2): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
<info>  [1588257375.9983] device (eth2): Activation: successful, device activated.
<info>  [1588257377.6599] audit: op="checkpoint-create" arg="" pid=20551 uid=0 result="fail" reason="device 'eth2' is already included in checkpoint /org/freedesktop/NetworkManager/Checkpoint/1"
<info>  [1588257377.6679] device (eth2): state change: activated -> deactivating (reason 'user-requested', sys-iface-state: 'managed')
<info>  [1588257377.6770] audit: op="device-disconnect" interface="eth2" ifindex=6 pid=20551 uid=0 result="success"
<info>  [1588257377.6776] device (eth2): state change: deactivating -> disconnected (reason 'user-requested', sys-iface-state: 'managed')
<info>  [1588257377.9697] audit: op="connection-delete" uuid="7e380287-06b2-49a6-ab1f-fbfe5244884a" name="eth2" pid=23244 uid=0 result="success"
<info>  [1588257378.2102] audit: op="checkpoint-create" arg="" pid=20551 uid=0 result="fail" reason="device 'eth2' is already included in checkpoint /org/freedesktop/NetworkManager/Checkpoint/1"
<info>  [1588257378.2268] device (eth2): Activation: starting connection 'eth2' (001aec24-37af-4930-9ada-e5174aca235d)
<info>  [1588257378.2282] audit: op="connection-add-activate" uuid="001aec24-37af-4930-9ada-e5174aca235d" name="eth2" pid=20551 uid=0 result="success"
<info>  [1588257378.2307] device (eth2): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
<info>  [1588257378.2355] device (eth2): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
<info>  [1588257378.2588] device (eth2): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
<info>  [1588257378.2606] dhcp4 (eth2): activation: beginning transaction (timeout in 45 seconds)
<info>  [1588257379.7252] checkpoint[0x7ff5e8003e50]: rollback of /org/freedesktop/NetworkManager/Checkpoint/1
<info>  [1588257379.7258] device (eth2): state change: ip-config -> deactivating (reason 'user-requested', sys-iface-state: 'managed')
<info>  [1588257379.7303] device (eth2): state change: deactivating -> disconnected (reason 'user-requested', sys-iface-state: 'managed')
<info>  [1588257379.7356] dhcp4 (eth2): canceled DHCP transaction
<info>  [1588257379.7362] dhcp4 (eth2): state changed unknown -> done
<info>  [1588257379.7753] audit: op="checkpoint-destroy" arg="/org/freedesktop/NetworkManager/Checkpoint/1" pid=20551 uid=0 result="fail" reason="checkpoint /org/freedesktop/NetworkManager/Checkpoint/1 does not exist"
<info>  [1588257438.2958] audit: op="connection-add" uuid="7e380287-06b2-49a6-ab1f-fbfe5244884a" name="eth2" pid=20551 uid=0 result="success"
<info>  [1588257438.2990] policy: auto-activating connection 'eth2' (7e380287-06b2-49a6-ab1f-fbfe5244884a)
<info>  [1588257438.3120] device (eth2): Activation: starting connection 'eth2' (7e380287-06b2-49a6-ab1f-fbfe5244884a)
<info>  [1588257438.3151] device (eth2): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
<info>  [1588257438.3299] device (eth2): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
<info>  [1588257438.3620] device (eth2): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
<info>  [1588257438.3697] device (eth2): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
<info>  [1588257438.4210] device (eth2): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
<info>  [1588257438.4215] device (eth2): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
<info>  [1588257438.4256] device (eth2): Activation: successful, device activated.

I don't understand what's going on there -- checkpoint 1 gets created three times, rolled back 7s after initial creation and 1s after last creation, and DHCP gets cut off after 1s. Thus the whole transaction does not just include the last step 'activate device', but apparently the previous nmcli con del as well.

@mvollmer, can you make some sense of this? At this point it seems to me that it's not really the test that is broken, but the checkpoint logic?

@mvollmer
Copy link
Member

mvollmer commented May 4, 2020

can you make some sense of this?

Just from reading the log:

Chjeckpoint 1 is only created once, but there are two more attempts at creating checkpoints which both fail because Checkpoint 1 is still around. (This shouldn't happen ideally, but there is no code in Cockpit to somehow serialize checkpoint creation for a single interface.)

Checkpoint 1 is created at time 1588257372.7236, rolled back at time 1588257379.7252 and attempted to be destroyed at time 1588257379.7753. This is the normal pattern for when a checkpoint has prevented a disconnection, and the "this will disconnect you" dialog should be open at this point.

However, I don't think we expect a disconnect here (right?), and Cockpit should destroy the checkpoint before it is rolled back, right after "connection-add" has succeded at 1588257375.9983

So at first look:

  • Cockpit takes much too long to destroy the checkpoint after the guarded operation has succeeded.
  • The test manages to trigger actions in Cockpit in that time anyway, which shouldn't happen ideally but is probably harmless here.

I'll do some code reading and I'll try to reproduce this.

@mvollmer
Copy link
Member

mvollmer commented May 4, 2020

Ok, this seems to be the very race that is explained in the comments:

 * The basic idea is to perform the following steps:
 *
 * 1) Create a checkpoint with automatic rollback
 * 2) Make the change
 * 3) Destroy the checkpoint
 * For a change that _doesn't_ break connectivity, this whole process
 * is inherently a race: Steps 2 and 3 need to complete before the
 * checkpoint created in step 1 reaches its timeout.

settle_time is pretty long now since 7b115d1, 3 seconds, while rollback_time is kinda low, 7 seconds.

The "connection-add" operation seems to be pretty slow: 1588257375.9983 - 1588257372.7236 = 3.274 and together with a settle_time of 3 we get pretty close to the rollback_time of 7.

I propose to bring the settle_time down to something better suited for normal scenarios, and to switch off checkpointing entirely during CI tests (except for the tests that actually need it).

mvollmer added a commit that referenced this issue May 4, 2020
A long settle_time time causes the checkpointing race to get pretty
tight, and checkpoints would be unexpectedly rolled back in our
integration tests.  On the other hand, we had to increased the
settle_time in 7b115d1 and f1d06ab to make the
checkpointing tests themselves more robust.

Thus, we reset the settle_time that is appropriate for manual use,
switch off checkpointing entirely for most of the integration tests,
and use a long settle_time when testing checkpoints themselves.

Fixes #13821
mvollmer added a commit to mvollmer/cockpit that referenced this issue May 5, 2020
A long settle_time time causes the checkpointing race to get pretty
tight, and checkpoints would be unexpectedly rolled back in our
integration tests.  On the other hand, we had to increased the
settle_time in 7b115d1 and f1d06ab to make the
checkpointing tests themselves more robust.

Thus, we reset the settle_time to something that is appropriate for
manual use, switch off checkpointing entirely for most of the
integration tests, and use a long settle_time when testing checkpoints
themselves.

Fixes cockpit-project#13821
martinpitt pushed a commit that referenced this issue May 5, 2020
A long settle_time time causes the checkpointing race to get pretty
tight, and checkpoints would be unexpectedly rolled back in our
integration tests.  On the other hand, we had to increased the
settle_time in 7b115d1 and f1d06ab to make the
checkpointing tests themselves more robust.

Thus, we reset the settle_time to something that is appropriate for
manual use, switch off checkpointing entirely for most of the
integration tests, and use a long settle_time when testing checkpoints
themselves.

Fixes #13821
Closes #14026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flake unstable test
Projects
None yet
3 participants