Add process locking #851

dkubek · 2024-02-22T11:51:28Z

This change addresses the potential risk of running multiple instances of Leapp simultaneously on a single system.

Considerations:

Insights Tasks (preupgrade and upgrade) currently allow multiple executions on the same system
The absence of a filter/tooltip in the Insights user interface makes it challenging to determine if a task is already running on a system.

A simple lock mechanism to prevent concurrent executions using a BSD lock has been implemented (see 1).

JIRA: https://issues.redhat.com/browse/OAMG-9827

github-actions · 2024-02-22T11:51:40Z

Thank you for contributing to the Leapp project!

Please note that every PR needs to comply with the Leapp Guidelines and must pass all tests in order to be mergeable.
If you want to request a review or rebuild a package in copr, you can use following commands as a comment:

review please @oamg/developers to notify leapp developers of the review request
/packit copr-build to submit a public copr build using packit

To launch regression testing public members of oamg organization can leave the following comment:

/rerun to schedule basic regression tests using this pr build and leapp-repository*master* as artifacts
/rerun 42 to schedule basic regression tests using this pr build and leapp-repository*PR42* as artifacts
/rerun-sst to schedule sst tests using this pr build and leapp-repository*master* as artifacts
/rerun-sst 42 to schedule sst tests using this pr build and leapp-repository*PR42* as artifacts

Please open ticket in case you experience technical problem with the CI. (RH internal only)

Note: In case there are problems with tests not being triggered automatically on new PR/commit or pending for a long time, please consider rerunning the CI by commenting leapp-ci build (might require several comments). If the problem persists, contact leapp-infra.

fernflower · 2024-03-06T10:21:12Z

(not a real review yet) Could you please rebase? The tests should be fixed by this

leapp/utils/lock.py

dkubek · 2024-03-13T12:59:20Z

@oamg/developers There is currently one point of failure I want to discuss, but I am not sure how relevant it is. It is possible that with this change leapp can fail under some extreme circumstances. The only one I can think of now is that leapp fails (kill -9) after finishing the upgrade process but before releasing the lock file with pid 1234 in it. When we reboot and run leapp in the initram there might be some other process with the same PID running already (which we don't detect) and we would fail. This in itself is not very likely as generally the PID numbers will be quite high and in initram all will be low.

Is it worth considering something like this? In such a case we could implement a flag to instruct leapp to ignore locking in initram.

pirat89 · 2024-03-13T17:09:39Z

@dkubek hmm... thinking whether we could just read it from /proc/<pid>/comm to check we find leapp or whether there could be some problem with it.

abadger · 2024-03-13T18:01:36Z

*NOTE: I have some knowledge of locking but have not looked at this PR yet. I'm just responding to one part of this comment.

When we reboot and run leapp in the initram there might be some other process with the same PID running already (which we don't detect) and we would fail.

We ought to be able to determine whether the lockfile was created before the last reboot. As long as this lockfile's purpose is only to ensure leapp is not running twice, then we can use the fact that there has been a reboot since the lockfile's creation to decide that the lockfile is stale.

Maybe comparing the lockfile's ctime and either the output of running last or calculating the reboot from the values in /proc/uptime?

vinzenz

You have to overthink the lock cleanup design here

leapp/config.py

vinzenz · 2024-03-13T18:30:35Z

I am not really sure, right now if it would work for leapp but did you consider opening the SQLite db exclusively?

vinzenz · 2024-03-13T18:36:42Z

I am not really sure, right now if it would work for leapp but did you consider opening the SQLite db exclusively?

Or rather a SQLite db, that will tell you that you can't use it 😁

leapp/utils/lock.py

abadger · 2024-03-15T17:22:06Z

I've looked over the PR now and have a few higher level questions... In the past, I've implemented locks using filesystem atomicity (the key filesystem operation here is link() which will fail if a file already exists). That is robust and relatively (compared to fcntl and flock) intuitive but has one problem: there is an unavoidable race condition dealing with stale lockfiles. I have usually forced the user to clean up stale lockfiles for this reason.

I see in this PR that although you are creating a lockfile and adding the PID to it (things which are needed for a filesystem-driven lockfile approach), you're using fcnt()l (one of the POSIX APIs) for locking. However, you aren't using that to determine whether another instance of LEAPP is already running directly, you're just using that to protect the lockfile while determining whether the lockfile is currently allocated by another process, to record our ownership of it by writing our PID to it (these two needs are what link() addresses in the filesystem atomicity case), and to ensure that no one else takes the lock while we're evaluating if the lockfile is stale.

What I'm wondering is what problems we are trying to avoid by doing that instead of relying on the lock for the length of the process? The traditional problem with fcntl() is that the lock is associated with the process rather than the specific file descriptor. So you can't synchronize resources across threads of a single process (the lock is process-wide) and when the process closes any open filedescriptor for that inode, the lock is closed (even if it is a different filedescriptor than the lock was opened against). Since we're only using the lock at the toplevel of the LEAPP cli, I don't think either of these apply? We can simply open the filedescriptor in the context manager, successfully take the lock on it, then hold onto that file descriptor until the context manager exits. That should eliminate the problem of having a stale lockfile as the contents of the file won't matter any longer.

Let me know if I'm missing something.

[*] The other problem with fcntl() locking is that it may or may not work on NFS depending on how all the nfs clients are configured (flock() is even worse in this regard). My kind of think this is not a problem, though. Is the lockfile ever going to be present on an NFS share or will that cause other problems to the upgrade?

leapp/utils/lock.py

dkubek · 2024-03-20T11:43:30Z

@abadger Thanks for awesome comment!

DISCLAIMER: I am very new to this topic and it might soon become obvious

I see in this PR that although you are creating a lockfile and adding the PID to it (things which are needed for a filesystem-driven lockfile approach), you're using fcnt()l (one of the POSIX APIs) for locking. However, you aren't using that to determine whether another instance of LEAPP is already running directly, you're just using that to protect the lockfile while determining whether the lockfile is currently allocated by another process, to record our ownership of it by writing our PID to it (these two needs are what link() addresses in the filesystem atomicity case), and to ensure that no one else takes the lock while we're evaluating if the lockfile is stale.

My original idea I was aiming for was exactly something like you are describing using the link() call. Basically, just create a file, write PID inside and check if it exists. Instead of just using a plain python open(), I have tried to introduce atomicity. A quick google search suggested using filesystem locks and then I have basically emulated the approach of using link() (which I was not aware of) using flock(). Which resulted in the current implementation.

So correct me if I'm wrong. What you are suggesting to be a better approach is to acquire a filesystem lock, keep it for the whole duration of the process and release it when leapp ends. Am I correct? This would solve all the problems with stale lockfiles

abadger · 2024-03-20T14:23:03Z

Correct. Avoiding the stale lockfile problem is what we'd be able to solve. I think that the known problems with `fcntl()` and `flock()` won't be triggered by the way we need to use the lock in LEAPP so we can use them as the primary locking mechanism (correct me if I'm wrong about that, though).

…

On Wed, Mar 20, 2024, 4:43 AM David Kubek ***@***.***> wrote: @abadger <https://github.com/abadger> Thanks for awesome comment! *DISCLAIMER: I am very new to this topic and it might soon become obvious* I see in this PR that although you are creating a lockfile and adding the PID to it (things which are needed for a filesystem-driven lockfile approach), you're using fcnt()l (one of the POSIX APIs) for locking. However, you aren't using that to determine whether another instance of LEAPP is already running directly, you're just using that to protect the lockfile while determining whether the lockfile is currently allocated by another process, to record our ownership of it by writing our PID to it (these two needs are what link() addresses in the filesystem atomicity case), and to ensure that no one else takes the lock while we're evaluating if the lockfile is stale. My original idea I was aiming for was exactly something like you are describing using the link() call. Basically, just create a file, write PID inside and check if it exists. Instead of just using a plain python open(), I have tried to introduce atomicity. A quick google search suggested using filesystem locks and then I have basically emulated the approach of using link() (which I was not aware of) using flock(). Which resulted in the current implementation. So correct me if I'm wrong. What you are suggesting to be a better approach is to acquire a filesystem lock, keep it for the whole duration of the process and release it when leapp ends. Am I correct? This would solve all the problems with stale lockfiles — Reply to this email directly, view it on GitHub <#851 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABTCWVR3NHL2UZIQAZAI5DYZFY7PAVCNFSM6AAAAABDUZNVRCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBZGM3TOMBSHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

dkubek · 2024-03-21T11:32:11Z

Thanks everybody for awesome feedback! This last changes should fix all of the problems noted above.

Firstly, as per @abadger 's awesome suggestion, we will not rely on the PID in the lockfile, but instead on the BSD lock itself which we will keep for the duration of the execution. The lockfile has been moved to /var/run/leapp.pid (thanks @vinzenz !) and will store the PID purely for informational purposes (so we can inform user which process is blocking). This solves the issues around stale lockfiles. The issue raised by @fernflower should also be obsoleted by this.

abadger

Code reviewed. I didn't find anything major, just some adjustments that I thing would make the code cleaner.

leapp/exceptions.py

leapp/utils/lock.py

This commit addresses the potential risk of running multiple instances of Leapp simultaneously on a single system. It implements a simple lock mechanism to prevent concurrent executions on a single system using a simple BSD lock (`flock(2)`). Lock is acquired at the start of the execution and a PID number is stored in lockfile. The PID in lockfile currently has purely informational character.

abadger

I think all the issues have been resolved so I'm approving this. There is the open question of whether an incorrectly formatted lockfile should warn or error (currently errors) but I don't feel strongly one way or the other and it seems like Petr can see both sides of it as well. I'm okay to leave that as it is and revisit in the future if it is a problem.

pirat89 · 2024-03-29T07:15:02Z

/packit build

## Packaging - Start building for EL 9 in the upstream repository on COPR (oamg#855) ## Framework ### Enhancements - Minor update in the summary overview to highlight what is present in the pre-upgrade report (oamg#858) - Store metadata about actors, workflows, and dialogs inside leapp audit db (oamg#847, oamg#867) ## Leapp (tool) ### Enhancements - Implement singleton leapp execution to prevent multiple running leapp instances on the system in the same time (oamg#851) ## stdlib ### Fixes - Close properly all file descriptors when executing shell commands via `run` (oamg#880) ## Modifications - Code is now Python3.12 compatible (oamg#855)

## Packaging - Start building for EL 9 in the upstream repository on COPR (#855) ## Framework ### Enhancements - Minor update in the summary overview to highlight what is present in the pre-upgrade report (#858) - Store metadata about actors, workflows, and dialogs inside leapp audit db (#847, #867) ## Leapp (tool) ### Enhancements - Implement singleton leapp execution to prevent multiple running leapp instances on the system in the same time (#851) ## stdlib ### Fixes - Close properly all file descriptors when executing shell commands via `run` (#880) ## Modifications - Code is now Python3.12 compatible (#855)

dkubek changed the title ~~]Add process locking~~ [WIP] Add process locking Feb 22, 2024

dkubek added the wip label Feb 22, 2024

dkubek force-pushed the lock branch 2 times, most recently from 5a78eb1 to ac48f22 Compare March 1, 2024 11:22

dkubek changed the title ~~[WIP] Add process locking~~ Add process locking Mar 1, 2024

dkubek removed the wip label Mar 1, 2024

dkubek force-pushed the lock branch from ac48f22 to 8290cfe Compare March 6, 2024 10:29

fernflower reviewed Mar 12, 2024

View reviewed changes

leapp/utils/lock.py Outdated Show resolved Hide resolved

vinzenz reviewed Mar 13, 2024

View reviewed changes

leapp/config.py Show resolved Hide resolved

abadger reviewed Mar 13, 2024

View reviewed changes

leapp/utils/lock.py Outdated Show resolved Hide resolved

abadger reviewed Mar 15, 2024

View reviewed changes

leapp/utils/lock.py Outdated Show resolved Hide resolved

abadger reviewed Mar 15, 2024

View reviewed changes

leapp/utils/lock.py Outdated Show resolved Hide resolved

abadger reviewed Mar 22, 2024

View reviewed changes

dkubek force-pushed the lock branch from 08e93ef to 7d378e5 Compare March 27, 2024 11:24

abadger approved these changes Mar 28, 2024

View reviewed changes

pirat89 merged commit b64c44b into oamg:master May 9, 2024
20 checks passed

pirat89 added this to the 8.10/9.5 milestone May 9, 2024

pirat89 added changelog-checked The merger/reviewer checked the changelog draft document and updated it when relevant enhancement labels May 9, 2024

pirat89 mentioned this pull request Aug 16, 2024

Release 0.18.0 #882

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add process locking #851

Add process locking #851

dkubek commented Feb 22, 2024 •

edited by abadger

Loading

github-actions bot commented Feb 22, 2024

fernflower commented Mar 6, 2024

dkubek commented Mar 13, 2024

pirat89 commented Mar 13, 2024

abadger commented Mar 13, 2024

vinzenz left a comment

vinzenz commented Mar 13, 2024 •

edited

Loading

vinzenz commented Mar 13, 2024

abadger commented Mar 15, 2024

dkubek commented Mar 20, 2024

abadger commented Mar 20, 2024 via email

dkubek commented Mar 21, 2024

abadger left a comment

abadger left a comment

pirat89 commented Mar 29, 2024

Add process locking #851

Add process locking #851

Conversation

dkubek commented Feb 22, 2024 • edited by abadger Loading

github-actions bot commented Feb 22, 2024

Thank you for contributing to the Leapp project!

fernflower commented Mar 6, 2024

dkubek commented Mar 13, 2024

pirat89 commented Mar 13, 2024

abadger commented Mar 13, 2024

vinzenz left a comment

Choose a reason for hiding this comment

vinzenz commented Mar 13, 2024 • edited Loading

vinzenz commented Mar 13, 2024

abadger commented Mar 15, 2024

dkubek commented Mar 20, 2024

abadger commented Mar 20, 2024 via email

dkubek commented Mar 21, 2024

abadger left a comment

Choose a reason for hiding this comment

abadger left a comment

Choose a reason for hiding this comment

pirat89 commented Mar 29, 2024

dkubek commented Feb 22, 2024 •

edited by abadger

Loading

vinzenz commented Mar 13, 2024 •

edited

Loading