Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

machine: Put SSHConnection.execute() with stdout under timeout #517

Merged
merged 1 commit into from
Feb 11, 2020

Conversation

martinpitt
Copy link
Member

This caused tests hanging for half an hour on e. g. cleanup commands
like calling "journalctl" in cockpit's BaseCase.copy_journal().

Fixes #516

This caused tests hanging for half an hour on e. g. cleanup commands
like calling "journalctl" in cockpit's `BaseCase.copy_journal()`.

Fixes cockpit-project#516
@martinpitt
Copy link
Member Author

I'm deliberately triggering loads of tests here to make sure that we don't have legitimate .execute() calls that take longer than 2 minutes. If we do, we need to fix them with an explicit timeout= first.

@martinpitt
Copy link
Member Author

cockpit's rhel-8.2 branch also needs the less fix backported, I sent cockpit-project/cockpit#13529 for that.

@martinpitt
Copy link
Member Author

Here testCheckpoint fails (NM bug?), and it now only takes 140s.

@martinpitt
Copy link
Member Author

The subscription-manager failure is unrelated, due to a less 3.11 regression. Let's give upstream a day or two to fix it, otherwise I'll send a workaround PR. But I went through integration-tests/check-subscriptions and it doesn't use .execute(..., stdout=...), so this is unaffected by this change.

@marusak
Copy link
Member

marusak commented Feb 11, 2020

Here testCheckpoint fails (NM bug?), and it now only takes 140s.

I am actually bit concern about this. A bit afraid that this could bring a lot of flakes when it needs a bit more time. But it seems that when it gets broken, more time won't help, in which case it makes sense.

@martinpitt
Copy link
Member Author

@marusak: The timeout can be changed, the default is 2 mins; but normally checkpoint/restore should take about 10s. The failures that we saw so far were due to checkpointing not working at all, not it being slow.

@martinpitt
Copy link
Member Author

The subscription-manager failure

less 3.11.1 got released which fixes that. The test is green now.

However, I'm going to take a look at testCheckpoints on RHEL 8.2. This fails a little too often, and this at least deserves a bug report.

@martinpitt
Copy link
Member Author

A bit afraid that this could bring a lot of flakes when it needs a bit more time.

To clarify: This change does not affect the actual test code. It does affect the post-test cleanup that calls journalctl to get the logs. That's the bit that uses Machine.execute() with the stdout= option. Hence I think this is pretty safe, as the journalctl call is the only thing in Cockpit's test suite that uses stdout=.

Copy link
Member

@marusak marusak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify: This change does not affect the actual test code. It does affect the post-test cleanup that calls journalctl to get the logs. That's the bit that uses Machine.execute() with the stdout= option. Hence I think this is pretty safe, as the journalctl call is the only thing in Cockpit's test suite that uses stdout=.

But if it fails, the whole test fails (example). But seems that indeed it works just fine and is much faster, so have my ack :)

@martinpitt
Copy link
Member Author

@marusak: Right, it would have failed before as well. It's another failure in the exception handling path of the original failure. :-) BTW, cockpit-project/cockpit#13534 is an attempt to actually fix the test on rhel-8-2.

@martinpitt martinpitt merged commit 8ad0049 into cockpit-project:master Feb 11, 2020
@martinpitt martinpitt deleted the ssh-exec-timeout branch February 11, 2020 11:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

testCheckpoint takes unreasonably long to fail
2 participants