Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make fs::remove_dir_all race-resistant #170

Closed
pitaj opened this issue Jan 30, 2023 · 10 comments
Closed

Make fs::remove_dir_all race-resistant #170

pitaj opened this issue Jan 30, 2023 · 10 comments
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api

Comments

@pitaj
Copy link

pitaj commented Jan 30, 2023

Proposal

Problem statement

Currently, fs::remove_dir_all will fail if the directory in question does not exist. It will also fail if any descendant fails deletion for any reason, even if that's because it doesn't exist.

This means the API is not idempotent, and can fail when multiple threads or processes call it concurrently.

Motivation, use-cases

The current documentation for this function does not directly state the above behavior, instead simply punting to the docs for fs::remove_file and fs::remove_dir.

The current behavior is also in stark contrast to fs::create_dir_all, which has been race-resistant since 2017:

This function will return an error in the following situations, but is not limited to just these cases:

  • If any directory in the path specified by path does not already exist and it could not be created otherwise. The specific error conditions for when a directory is being created (after it is determined to not exist) are outlined by fs::create_dir.

Notable exception is made for situations where any of the directories specified in the path could not be created as it was being created concurrently. Such cases are considered to be successful. That is, calling create_dir_all concurrently from multiple threads or processes is guaranteed not to fail due to a race condition with itself.

Solution sketches

Change the remove_dir_all algorithm to work similarly to create_dir_all:

  • If the directory at path does not exist, immediately exit Ok(()).
  • Accept ErrorKind::NotFound as if the deletion succeeded.
  • Any other errors from read_dir and remove_dir are immediately returned.
pub fn remove_dir_all(path: &Path) -> io::Result<()> {
    let filetype = match fs::symlink_metadata(path) {
        Ok(x) => x.file_type(),
        Err(e) if e.kind() == io::ErrorKind::NotFound => return Ok(()),
        Err(e) => return Err(e),
    };
    if filetype.is_symlink() {
        match fs::remove_file(path) {
            Ok(_) => Ok(()),
            Err(e) if e.kind() == io::ErrorKind::NotFound => Ok(()),
            Err(e) => Err(e),
        }
    } else { remove_dir_all_recursive(path) }
}

fn remove_dir_all_recursive(path: &Path) -> io::Result<()> {
    let dir = match fs::read_dir(path) {
        Ok(x) => x,
        Err(e) if e.kind() == io::ErrorKind::NotFound => return Ok(()),
        Err(e) => return Err(e),
    };
    for child in dir {
        let child = match child {
            Ok(x) => x,
            Err(e) if e.kind() == io::ErrorKind::NotFound => continue,
            Err(e) => return Err(e),
        };
        let file_type = match child.file_type() {
            Ok(x) => x,
            Err(e) if e.kind() == io::ErrorKind::NotFound => continue,
            Err(e) => return Err(e),
        };
        if file_type.is_dir() {
            remove_dir_all_recursive(&child.path())?;
        } else {
            match fs::remove_file(&child.path())  {
                Ok(_) => (),
                Err(e) if e.kind() == io::ErrorKind::NotFound => (),
                Err(e) => return Err(e),
            }
        }
    }
    match fs::remove_dir(path) {
        Ok(_) => Ok(()),
        Err(e) if e.kind() == io::ErrorKind::NotFound => Ok(()),
        Err(e) => Err(e),
    }
}

Links and related work

What happens now?

This issue is part of the libs-api team API change proposal process. Once this issue is filed the libs-api team will review open proposals in its weekly meeting. You should receive feedback within a week or two.

@pitaj pitaj added api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api labels Jan 30, 2023
@pitaj pitaj changed the title ACP: Make fs::remove_dir_all race-resistant Make fs::remove_dir_all race-resistant Jan 30, 2023
@the8472
Copy link
Member

the8472 commented Jan 30, 2023

On windows there can be an error if multiple processes have a handle open for a directory object in the tree that's to be deleted and that can't be ignored because it means the directory may not have been deleted.
Which means if two threads are racing to delete a directory tree they'll end stepping on each other's toes.

@ChrisDenton
Copy link
Member

I don't think that affects the suggested solution? Returning Ok(()) on ErrorKind::NotFound can be done well enough.

And on modern versions of Windows, with POSIX delete semantics, any stepping on toes can be mitigated to an extent unless there's an actual lock involved.

@pitaj
Copy link
Author

pitaj commented Jan 30, 2023

It's worth noting that this could be considered a breaking change: any code that relies on this returning an error when the path does not exist could break.

I'm not sure what such code would look like, but if the risk level is deemed to be too high, removing the first match statement in pub fn remove_dir_all above would suffice to prevent that change in behavior, while remaining race-resistant.

However, that alternate function would not be idempotent. And I'd still consider that behavior to be inconsistent with create_dir_all and unexpected, given the behavior of rm -rf.

@cuviper
Copy link
Member

cuviper commented Jan 30, 2023

I do think the top dir should report when it's missing, just like the single remove_dir or remove_file would, but it seems okay to ignore inner-traversed items that go missing.

@the8472
Copy link
Member

the8472 commented Jan 30, 2023

I don't think that affects the suggested solution? Returning Ok(()) on ErrorKind::NotFound can be done well enough.

In those cases you wouldn't get NotFound though, you'd get whatever the "handle in use" error is on windows, no?

This is an ACP, so I assume it's about providing the guarantee of concurrent idempotence. Which we can't if races can lead to cross-thread interference.

@pitaj
Copy link
Author

pitaj commented Jan 30, 2023

It's an ACP because it changes the observed runtime behavior of a stable API. Providing some guarantees would be nice, but that is not a requirement as far as I'm concerned.

@the8472
Copy link
Member

the8472 commented Jan 30, 2023

Well, the problem statement of this ACP is

This means the API is not idempotent, and can fail when multiple threads or processes call it concurrently.

To make it usable for that scenario we would have to provide a guarantee, no? Otherwise we'd be sending users to their probabilistic doom.

@pitaj
Copy link
Author

pitaj commented Jan 30, 2023

We could guarantee the behavior only on some OSs.

But regardless, the proposed behavior is, IMO, strictly an improvement over the current behavior.

@the8472
Copy link
Member

the8472 commented Jan 30, 2023

Another issue, rust-lang/rust#95925 would change the unix impl to be iterative instead of recursive and wants to open .. on directory fds. That would have a similar issue of threads stepping on each others toes because an unlinked directory file descriptor doesn't have a valid .. anymore and you'll get a ENOENT that can't be easily ignored because that parent is needed for ascending out of the directory tree.

@the8472
Copy link
Member

the8472 commented Jan 30, 2023

I do think the top dir should report when it's missing, just like the single remove_dir or remove_file would, but it seems okay to ignore inner-traversed items that go missing.

Yeah, if it were only that then imo it wouldn't even need an ACP, it could be considered a QoI thing.

But I'm not sure if "breaks under current use 1/10 times" to "breaks 1/100 times" (numbers made up) is really an improvement, it might just delay the errors until you're on production.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api
Projects
None yet
Development

No branches or pull requests

4 participants