Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(turborepo): Size comes from the thing being read. #6092

Merged
merged 10 commits into from
Oct 17, 2023
30 changes: 16 additions & 14 deletions crates/turborepo-scm/src/manual.rs
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
use std::{fs::Metadata, io::Read};
use std::io::{ErrorKind, Read};

use globwalk::fix_glob_pattern;
use hex::ToHex;
Expand All @@ -9,13 +9,13 @@ use wax::{any, Glob, Pattern};

use crate::{package_deps::GitHashes, Error};

fn git_like_hash_file(path: &AbsoluteSystemPath, metadata: &Metadata) -> Result<String, Error> {
fn git_like_hash_file(path: &AbsoluteSystemPath) -> Result<String, Error> {
let mut hasher = Sha1::new();
let mut f = path.open()?;
let mut buffer = Vec::new();
f.read_to_end(&mut buffer)?;
let size = f.read_to_end(&mut buffer)?;
hasher.update("blob ".as_bytes());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read_to_end already attempts to do this optimally in terms of syscall count, we don't really need to help out.

read_to_end is this:

// Reserves space in the buffer based on the file size when available.
fn read_to_end(&mut self, buf: &mut Vec<u8>) -> io::Result<usize> {
    let size = buffer_capacity_required(self);
    buf.reserve(size.unwrap_or(0));
    io::default_read_to_end(self, buf, size)
}

buffer_capacity_required is this:

fn buffer_capacity_required(mut file: &File) -> Option<usize> {
    let size = file.metadata().map(|m| m.len()).ok()?;
    let pos = file.stream_position().ok()?;
    // Don't worry about `usize` overflow because reading will fail regardless
    // in that case.
    Some(size.saturating_sub(pos) as usize)
}

gsoltis marked this conversation as resolved.
Show resolved Hide resolved
hasher.update(metadata.len().to_string().as_bytes());
hasher.update(size.to_string().as_bytes());
hasher.update([b'\0']);
hasher.update(buffer.as_slice());
let result = hasher.finalize();
Expand All @@ -30,17 +30,19 @@ pub(crate) fn hash_files(
let mut hashes = GitHashes::new();
for file in files.into_iter() {
let path = root_path.resolve(file.as_ref());
let metadata = match path.symlink_metadata() {
Ok(metadata) => metadata,
Copy link
Contributor Author

@nathanhammond nathanhammond Oct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point of failure one: lstat. The size that we get off of a symlink_metadata call is nonsense in the event of the thing being a symlink.

Err(e) => {
if allow_missing && e.is_io_error(std::io::ErrorKind::NotFound) {
continue;
match git_like_hash_file(&path) {
Ok(hash) => hashes.insert(file.as_ref().to_unix(), hash),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, now we just attempt to hash the file, and let read_to_end do its thing.

We can then investigate the returned error to see if it is ENOENT.

Err(e) => match e {
Error::Io(ref io_error, _) => {
if allow_missing && io_error.kind() == ErrorKind::NotFound {
continue;
} else {
return Err(e);
}
}
return Err(e.into());
}
_ => return Err(e),
},
};
nathanhammond marked this conversation as resolved.
Show resolved Hide resolved
let hash = git_like_hash_file(&path, &metadata)?;
hashes.insert(file.as_ref().to_unix(), hash);
Copy link
Contributor Author

@nathanhammond nathanhammond Oct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point of failure two: fopen. fopen follows symlinks, which means that fopen can fail while symlink_metadata succeeds.

}
Ok(hashes)
}
Expand Down Expand Up @@ -109,7 +111,7 @@ pub(crate) fn get_package_file_hashes_from_processing_gitignore<S: AsRef<str>>(
if metadata.is_symlink() {
continue;
}
let hash = git_like_hash_file(path, &metadata)?;
Comment on lines 113 to 115
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line of code + follow_links = false means that if the symlink target is outside of the package or in the package but otherwise ignored we get a bad hash.

I believe that we should be following symlinks, and if it is a broken symlink, hash it the same way git does (by target).

Copy link
Contributor Author

@nathanhammond nathanhammond Oct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would also mean that if we ls-tree and have a symlink, we need to replace it with the target hash if the target exists and is a file, and, if the target is a directory, walk it.

What should we do about .gitignore when walking a symlink'd directory's contents? 😅🤷‍♂️

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting behaviour, did a little test and it seems that ls-tree does indeed not expose symlinks. Git does at least give the file mode so we can manually check for symlinks (120000).

I think in this case, there is an optimisation we could apply with a caveat:

  • Evaluate the symlink, and try to find files with the same path prefix as the destination in the ls-tree output rather than having to walk the fs
  • In that case the 'gitignore stack' would be the of the target rather than of the source
  • If the destination is not in the ls-tree output (ie. outside the repo) we would either need to warn and ignore or explicitly
  • With the target being outside the repo it would not have a 'gitignore stack' so we would have to use the stack of the source

These two behaviours conflict but I don't think we should support symlinks outside the repo root anyways so perhaps erroring is sufficient.

Copy link
Contributor

@gsoltis gsoltis Oct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention with the change was to hash symlink files, not their targets. There's been a series of differently-incorrect behavior here:

  • As @arlyon notes, ls-tree does list a hash for the symlink. So without inputs, they are wholesale ignored.
  • when using inputs, and at least since switching to the encoder setup in Go, symlinks caused an error and we fell back to manual hashing. It's very possible that was the behavior before also, I haven't checked
  • manual hashing in Go calls os.Open, which reads the target.
  • Rust ls-tree has the same restriction as the Go implementation, it ignores symlinks
  • Rust git-based hashing, with inputs, was intended to switch to hashing the contents of the symlink, but there are several problems:
    1. We eventually get to fopen as @nathanhammond notes. This follows the link. The Go manual file hashing implementation has the same behavior. At least for the Rust implementation, that was not the intention, and I would consider it a bug at least insofar as it doesn't do what was intended.
    2. As noted in the slack thread, the raw contents of the link is not portable, so probably not what we want to hash
    3. We currently have checks in place to skip over hashing symlinks only when using manual hashing

In terms of what I think should happen is the following:

  • Short term: update all paths through hashing to match. Probably ignore symlinks for now. The link target should be required to be covered by inputs anyways. This leaves the existing bug of not hashing symlinks in place, but makes us consistent about it.
  • Medium term: Hash a portable representation of the contents of the symlink, not the target. This saves us from having to care about file vs directory, broken vs working, and in-repo vs not, which in the presence of inputs is not even necessarily a clear distinction (do we follow in-repo, but out-of-inputs links?). Users will be required to ensure that the target is covered by their inputs if they want the target hashed. This is mirrors our behavior for caching: we don't cache the target of symlinks. If you want it cached, ensure that the target is also covered by the outputs globs.
  • Longer term: determine how to handle symlinks when inputs is not specified.

let hash = git_like_hash_file(path)?;
hashes.insert(relative_path, hash);
}
Ok(hashes)
Expand Down
Loading