Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added archive validation for all available algorithms in hashlib. #8851

Merged
merged 15 commits into from
Jan 20, 2024
Merged

Added archive validation for all available algorithms in hashlib. #8851

merged 15 commits into from
Jan 20, 2024

Conversation

GMouzourou
Copy link
Contributor

@GMouzourou GMouzourou commented Jan 5, 2024

Resolves: #4578

May also contribute to #7881 , as it's unclear if the known hashes being checked are all sha256.

  • Added tests for changed code.
  • Updated documentation for changed code.

@radoering
Copy link
Member

Some thoughts:

  • Regardless of the changes in this PR, it looks like we are checking that the hash of an archive matches the hash of any known archive. Actually, we should check that it matches the hash of the correct archive since we know the name of the archive, shouldn't we?
  • With this PR we calculate multiple hashes if we have several files with different hash types. What's the performance impact? (This would be a non-issue if we'd fix the above point first.)

@dimbleby
Copy link
Contributor

dimbleby commented Jan 5, 2024

This certainly has nothing to do with #7778

@GMouzourou
Copy link
Contributor Author

GMouzourou commented Jan 5, 2024

Hi @radoering,

When calculating the known hashes, we filter for if f["file"] == archive.name. So we only check hashes for the file given.

What would impact performance is if we had multiple hashes types for the same given file (i.e. sha256, sha512, ...). This can be seen on the line were we generate archive_hashes. This could be changed to just picking the first hash type from hash_types if required.

@radoering
Copy link
Member

Thanks for the hint, I overlooked this. So in theory, the same archive could appear several times (with different hash types) in the list. (I'm not sure whether this can happen in practice but let's assume it can.)

This could be changed to just picking the first hash type from hash_types if required.

I don't think picking the first hash type is good enough. We do not want to rely on md5 if there are more secure hashes. I suppose, we have to put a list of preferred hash types somewhere (maybe in poetry.utils.helpers). Then, we can use the first hash type in the list that is available and just fallback to a "random" choice if no hash type is in our list.

I assume normally there shouldn't be other hashes than these in the lockfile anyway so this might all be a theoretical issue after all:

if not link.hash or (
link.hash_name is not None
and link.hash_name not in ("sha256", "sha384", "sha512")
and hasattr(hashlib, link.hash_name)
):
file_hash = self.calculate_sha256(link) or file_hash

@GMouzourou
Copy link
Contributor Author

I've created a get_highest_priority_hash_type function in poetry.utils.helpers as suggested. So now only one hash algorithm is used per archive (the most cryptographically secure). If you have issue with the priority of the algorithms, let me know the priority you'd like and I'll update the list.

I've also added tests for multiple hash types of the same file, multiple files in the same package, and an unsupported hash type.

@radoering
Copy link
Member

I think the approach makes sense in general. We just have to clarify some details.

I don't really like the repetetiveness of get_highest_priority_hash_type(). Further, you might get the log message that the hash is not in the priority list if it is in the list but not in hashlib.algorithms_available. I would probably rewrite it like this:

def get_preferred_available_hash_type(hash_types: set[str]) -> str | None:
    if not hash_types:
        return None

    # ordered by preference (most preferred first)
    preferred_hashes = ('sha512', 'sha256', ...)
    for chosen in preferred_hashes:
        if chosen in hash_types:
            if chosen in hashlib.algorithms_available:
                return chosen
            logger.debug("Hash type %s not available in hashlib", chosen)

    # do not consider hashes that are not available in hashlib again
    hash_types = hash_types - set(preferred_hashes)

    if hash_types:
        logger.debug("No hash type of %s in preferred hashes list", hash_types)
        # choose any available hash type
        for chosen in hash_types:
            if chosen in hashlib.algorithms_available:
                return chosen
            logger.debug("Hash type %s not available in hashlib", chosen)

    return None

We definitively want a parameterized unit test for this function that checks the special cases (regardless of the exact implementation).

the most cryptographically secure

That's probably fine. I don't expect performance to be an issue but if it was that would also be a factor to consider...

If you have issue with the priority of the algorithms, let me know the priority you'd like and I'll update the list.

LGTM but I'm not an expert. I suppose we can change it later if someone with more expertise comes along. Just one thing: I'd probably omit md5 (and maybe sha1?) because they are known to be insecure. If we omit them from the preferred/priority list, we still have a chance to choose a better algorithm, which is unknown to us. Not something you should rely on but anyway...

@GMouzourou
Copy link
Contributor Author

GMouzourou commented Jan 6, 2024

I didn't particularly like that big block of if statements either. I did originally write it as a match\case, but then realised poetry supports versions of python older than 3.10.

I think I got the gist of where you were going in your code snippet. I pulled out the creation of the prioritised hash types into a global, as I didn't like the idea of doing all that collection manipulation for every archive and to only get the same answer every time. Also python set is an unordered collection, so you wouldn't have got your intended behaviour.

As to your MD5 and SHA1 comment, preferring other algorithms is a must (which is what this implementation does). Omitting them from the list will remove the ability to ever use them. I took it that was not your intent, but if you did want to prohibit the use of archives with only an MD5 or SHA1 hash then I will remove them. That does tend to force people to point to git repositories to avoid any checks (or to lesser dependency management tools 😝), which is obviously worse. Is there a good mechanism to warn users if we have had to check an archive using MD5 or SHA1? Or perhaps create a feature to add a configuration option?

@radoering
Copy link
Member

I pulled out the creation of the prioritised hash types into a global, as I didn't like the idea of doing all that collection manipulation for every archive and to only get the same answer every time.

Makes sense.

Also python set is an unordered collection, so you wouldn't have got your intended behaviour.

Not sure what your are referring to. The preferred/prioritized hashes must be ordered, but the hash_types we pass to the function can be a set.

Omitting them from the list will remove the ability to ever use them.

Not when using my code snippet. 😉 My idea is that we still choose an available hash type even if none is in our preferred list. That part is missing now. I'll add a comment in the code to make it clearer.

Is there a good mechanism to warn users if we have had to check an archive using MD5 or SHA1?

We could have another list/set of deprecated hash types. However, I'm not sure if it's worth it.

src/poetry/utils/helpers.py Outdated Show resolved Hide resolved
src/poetry/utils/helpers.py Outdated Show resolved Hide resolved
@GMouzourou
Copy link
Contributor Author

So I've:

  • Removed MD5 and SHA1 from the prioritised list.
  • Introduced your non_prioritized_available_hash_types and the processing that goes with it.

I didn't include the warning log you had as this catered for in executor.py.

@GMouzourou
Copy link
Contributor Author

Also python set is an unordered collection, so you wouldn't have got your intended behaviour.

Not sure what your are referring to. The preferred/prioritized hashes must be ordered, but the hash_types we pass to the function can be a set.

I thought preferred_hashes = ('sha512', 'sha256', ...) was a set not a tuple. My bad.

tests/installation/test_executor.py Outdated Show resolved Hide resolved
tests/installation/test_executor.py Outdated Show resolved Hide resolved
src/poetry/utils/helpers.py Show resolved Hide resolved
Copy link
Member

@radoering radoering left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor nitpicks.

src/poetry/installation/executor.py Outdated Show resolved Hide resolved
src/poetry/installation/executor.py Outdated Show resolved Hide resolved
src/poetry/utils/helpers.py Outdated Show resolved Hide resolved
src/poetry/utils/helpers.py Show resolved Hide resolved
@radoering radoering merged commit a67c0ec into python-poetry:master Jan 20, 2024
20 checks passed
Copy link

github-actions bot commented Mar 3, 2024

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 3, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable multiple file hashing methods for backwards and forwards compatibility
3 participants