Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add process locking #851

Merged
merged 1 commit into from
May 9, 2024
Merged

Add process locking #851

merged 1 commit into from
May 9, 2024

Conversation

dkubek
Copy link
Member

@dkubek dkubek commented Feb 22, 2024

This change addresses the potential risk of running multiple instances of Leapp simultaneously on a single system.

Considerations:

  • Insights Tasks (preupgrade and upgrade) currently allow multiple executions on the same system
  • The absence of a filter/tooltip in the Insights user interface makes it challenging to determine if a task is already running on a system.

A simple lock mechanism to prevent concurrent executions using a BSD lock has been implemented (see 1).

JIRA: https://issues.redhat.com/browse/OAMG-9827

@dkubek dkubek changed the title ]Add process locking [WIP] Add process locking Feb 22, 2024
Copy link

Thank you for contributing to the Leapp project!

Please note that every PR needs to comply with the Leapp Guidelines and must pass all tests in order to be mergeable.
If you want to request a review or rebuild a package in copr, you can use following commands as a comment:

  • review please @oamg/developers to notify leapp developers of the review request
  • /packit copr-build to submit a public copr build using packit

To launch regression testing public members of oamg organization can leave the following comment:

  • /rerun to schedule basic regression tests using this pr build and leapp-repository*master* as artifacts
  • /rerun 42 to schedule basic regression tests using this pr build and leapp-repository*PR42* as artifacts
  • /rerun-sst to schedule sst tests using this pr build and leapp-repository*master* as artifacts
  • /rerun-sst 42 to schedule sst tests using this pr build and leapp-repository*PR42* as artifacts

Please open ticket in case you experience technical problem with the CI. (RH internal only)

Note: In case there are problems with tests not being triggered automatically on new PR/commit or pending for a long time, please consider rerunning the CI by commenting leapp-ci build (might require several comments). If the problem persists, contact leapp-infra.

@dkubek dkubek added the wip label Feb 22, 2024
@dkubek dkubek force-pushed the lock branch 2 times, most recently from 5a78eb1 to ac48f22 Compare March 1, 2024 11:22
@dkubek dkubek changed the title [WIP] Add process locking Add process locking Mar 1, 2024
@dkubek dkubek removed the wip label Mar 1, 2024
@fernflower
Copy link
Member

(not a real review yet) Could you please rebase? The tests should be fixed by this

leapp/utils/lock.py Outdated Show resolved Hide resolved
@dkubek
Copy link
Member Author

dkubek commented Mar 13, 2024

@oamg/developers There is currently one point of failure I want to discuss, but I am not sure how relevant it is. It is possible that with this change leapp can fail under some extreme circumstances. The only one I can think of now is that leapp fails (kill -9) after finishing the upgrade process but before releasing the lock file with pid 1234 in it. When we reboot and run leapp in the initram there might be some other process with the same PID running already (which we don't detect) and we would fail. This in itself is not very likely as generally the PID numbers will be quite high and in initram all will be low.

Is it worth considering something like this? In such a case we could implement a flag to instruct leapp to ignore locking in initram.

@pirat89
Copy link
Member

pirat89 commented Mar 13, 2024

@dkubek hmm... thinking whether we could just read it from /proc/<pid>/comm to check we find leapp or whether there could be some problem with it.

@abadger
Copy link
Member

abadger commented Mar 13, 2024

*NOTE: I have some knowledge of locking but have not looked at this PR yet. I'm just responding to one part of this comment.

When we reboot and run leapp in the initram there might be some other process with the same PID running already (which we don't detect) and we would fail.

We ought to be able to determine whether the lockfile was created before the last reboot. As long as this lockfile's purpose is only to ensure leapp is not running twice, then we can use the fact that there has been a reboot since the lockfile's creation to decide that the lockfile is stale.

Maybe comparing the lockfile's ctime and either the output of running last or calculating the reboot from the values in /proc/uptime?

Copy link
Member

@vinzenz vinzenz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to overthink the lock cleanup design here

leapp/config.py Show resolved Hide resolved
@vinzenz
Copy link
Member

vinzenz commented Mar 13, 2024

I am not really sure, right now if it would work for leapp but did you consider opening the SQLite db exclusively?

@vinzenz
Copy link
Member

vinzenz commented Mar 13, 2024

I am not really sure, right now if it would work for leapp but did you consider opening the SQLite db exclusively?

Or rather a SQLite db, that will tell you that you can't use it 😁

leapp/utils/lock.py Outdated Show resolved Hide resolved
leapp/utils/lock.py Outdated Show resolved Hide resolved
@abadger
Copy link
Member

abadger commented Mar 15, 2024

I've looked over the PR now and have a few higher level questions... In the past, I've implemented locks using filesystem atomicity (the key filesystem operation here is link() which will fail if a file already exists). That is robust and relatively (compared to fcntl and flock) intuitive but has one problem: there is an unavoidable race condition dealing with stale lockfiles. I have usually forced the user to clean up stale lockfiles for this reason.

I see in this PR that although you are creating a lockfile and adding the PID to it (things which are needed for a filesystem-driven lockfile approach), you're using fcnt()l (one of the POSIX APIs) for locking. However, you aren't using that to determine whether another instance of LEAPP is already running directly, you're just using that to protect the lockfile while determining whether the lockfile is currently allocated by another process, to record our ownership of it by writing our PID to it (these two needs are what link() addresses in the filesystem atomicity case), and to ensure that no one else takes the lock while we're evaluating if the lockfile is stale.

What I'm wondering is what problems we are trying to avoid by doing that instead of relying on the lock for the length of the process? The traditional problem with fcntl() is that the lock is associated with the process rather than the specific file descriptor. So you can't synchronize resources across threads of a single process (the lock is process-wide) and when the process closes any open filedescriptor for that inode, the lock is closed (even if it is a different filedescriptor than the lock was opened against). Since we're only using the lock at the toplevel of the LEAPP cli, I don't think either of these apply? We can simply open the filedescriptor in the context manager, successfully take the lock on it, then hold onto that file descriptor until the context manager exits. That should eliminate the problem of having a stale lockfile as the contents of the file won't matter any longer.

Let me know if I'm missing something.

[*] The other problem with fcntl() locking is that it may or may not work on NFS depending on how all the nfs clients are configured (flock() is even worse in this regard). My kind of think this is not a problem, though. Is the lockfile ever going to be present on an NFS share or will that cause other problems to the upgrade?

leapp/utils/lock.py Outdated Show resolved Hide resolved
@dkubek
Copy link
Member Author

dkubek commented Mar 20, 2024

@abadger Thanks for awesome comment!

DISCLAIMER: I am very new to this topic and it might soon become obvious

I see in this PR that although you are creating a lockfile and adding the PID to it (things which are needed for a filesystem-driven lockfile approach), you're using fcnt()l (one of the POSIX APIs) for locking. However, you aren't using that to determine whether another instance of LEAPP is already running directly, you're just using that to protect the lockfile while determining whether the lockfile is currently allocated by another process, to record our ownership of it by writing our PID to it (these two needs are what link() addresses in the filesystem atomicity case), and to ensure that no one else takes the lock while we're evaluating if the lockfile is stale.

My original idea I was aiming for was exactly something like you are describing using the link() call. Basically, just create a file, write PID inside and check if it exists. Instead of just using a plain python open(), I have tried to introduce atomicity. A quick google search suggested using filesystem locks and then I have basically emulated the approach of using link() (which I was not aware of) using flock(). Which resulted in the current implementation.

So correct me if I'm wrong. What you are suggesting to be a better approach is to acquire a filesystem lock, keep it for the whole duration of the process and release it when leapp ends. Am I correct? This would solve all the problems with stale lockfiles

@abadger
Copy link
Member

abadger commented Mar 20, 2024 via email

@dkubek
Copy link
Member Author

dkubek commented Mar 21, 2024

Thanks everybody for awesome feedback! This last changes should fix all of the problems noted above.

Firstly, as per @abadger 's awesome suggestion, we will not rely on the PID in the lockfile, but instead on the BSD lock itself which we will keep for the duration of the execution. The lockfile has been moved to /var/run/leapp.pid (thanks @vinzenz !) and will store the PID purely for informational purposes (so we can inform user which process is blocking). This solves the issues around stale lockfiles. The issue raised by @fernflower should also be obsoleted by this.

Copy link
Member

@abadger abadger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code reviewed. I didn't find anything major, just some adjustments that I thing would make the code cleaner.

leapp/exceptions.py Show resolved Hide resolved
leapp/utils/lock.py Outdated Show resolved Hide resolved
leapp/utils/lock.py Show resolved Hide resolved
leapp/utils/lock.py Outdated Show resolved Hide resolved
leapp/utils/lock.py Outdated Show resolved Hide resolved
leapp/utils/lock.py Outdated Show resolved Hide resolved
leapp/utils/lock.py Outdated Show resolved Hide resolved
leapp/utils/lock.py Show resolved Hide resolved
leapp/utils/lock.py Outdated Show resolved Hide resolved
This commit addresses the potential risk of running multiple instances
of Leapp simultaneously on a single system. It implements a simple lock
mechanism to prevent concurrent executions on a single system using a
simple BSD lock (`flock(2)`).

Lock is acquired at the start of the execution and a PID number is
stored in lockfile. The PID in lockfile currently has purely
informational character.
Copy link
Member

@abadger abadger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all the issues have been resolved so I'm approving this. There is the open question of whether an incorrectly formatted lockfile should warn or error (currently errors) but I don't feel strongly one way or the other and it seems like Petr can see both sides of it as well. I'm okay to leave that as it is and revisit in the future if it is a problem.

@pirat89
Copy link
Member

pirat89 commented Mar 29, 2024

/packit build

@pirat89 pirat89 merged commit b64c44b into oamg:master May 9, 2024
20 checks passed
@pirat89 pirat89 added this to the 8.10/9.5 milestone May 9, 2024
@pirat89 pirat89 added changelog-checked The merger/reviewer checked the changelog draft document and updated it when relevant enhancement labels May 9, 2024
pirat89 added a commit to pirat89/leapp that referenced this pull request Aug 16, 2024
## Packaging
- Start building for EL 9 in the upstream repository on COPR (oamg#855)

## Framework
### Enhancements
- Minor update in the summary overview to highlight what is present in the pre-upgrade report (oamg#858)
- Store metadata about actors, workflows, and dialogs inside leapp audit db (oamg#847, oamg#867)

## Leapp (tool)
### Enhancements
- Implement singleton leapp execution to prevent multiple running leapp instances on the system in the same time (oamg#851)

## stdlib
### Fixes
- Close properly all file descriptors when executing shell commands via `run` (oamg#880)

## Modifications
- Code is now Python3.12 compatible (oamg#855)
@pirat89 pirat89 mentioned this pull request Aug 16, 2024
pirat89 added a commit that referenced this pull request Aug 16, 2024
## Packaging
- Start building for EL 9 in the upstream repository on COPR (#855)

## Framework
### Enhancements
- Minor update in the summary overview to highlight what is present in the pre-upgrade report (#858)
- Store metadata about actors, workflows, and dialogs inside leapp audit db (#847, #867)

## Leapp (tool)
### Enhancements
- Implement singleton leapp execution to prevent multiple running leapp instances on the system in the same time (#851)

## stdlib
### Fixes
- Close properly all file descriptors when executing shell commands via `run` (#880)

## Modifications
- Code is now Python3.12 compatible (#855)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog-checked The merger/reviewer checked the changelog draft document and updated it when relevant enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants