Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement PEP-458: Secure PyPI downloads with signed repository metadata #8585

Open
Tracked by #10672
jku opened this issue Jul 15, 2020 · 23 comments
Open
Tracked by #10672

Implement PEP-458: Secure PyPI downloads with signed repository metadata #8585

jku opened this issue Jul 15, 2020 · 23 comments
Labels
PEP implementation Involves some PEP state: blocked Can not be done until something else is done type: feature request Request for a new feature

Comments

@jku
Copy link
Contributor

jku commented Jul 15, 2020

What's the problem this feature will solve?

PEP-458:

protect users of PyPI from compromises of the integrity, consistency, and freshness properties of PyPI packages, and enhance compromise resilience by mitigating key risk and providing mechanisms to recover from a compromise of PyPI or its signing keys

This will allow pip to be more secure against attacks on PyPI mirrors and PyPI's content distribution network. The implementation ("the minimum security model") supports verification of PyPI distributions that are signed with keys stored on PyPI, but the pip client implementation should just continue working if/when Warehouse moves to the "maximum security model" (PEP-480) where both PyPI and the developers sign distributions.

original discussion on the PEP: https://discuss.python.org/t/pep-458-secure-pypi-downloads-with-package-signing/2648

Describe the solution you'd like

pip should use TUF reference client library to secure downloads from pypi.org (and 3rd party Warehouse instances that support TUF). This should happen without affecting the user experience in any major way (except of course in the event of TUF preventing downloads for security reasons). The implementation should allow using both TUF-enabled and non-TUF-enabled repositories at the same time: no existing functionality should break.

More information:

Additional context

I'm currently planning how to do this and am prepared to work on the actual implementation. The current state of things is:

  • the Warehouse implementation is being worked on by William Woodruff: it is not complete and not merged
  • I think I have identified the core issues that need to be solved and I've tested most of them in a throw-away draft implementation
  • there are still open issues with regards to the exact API promise that Warehouse will have to make and the actual TUF configuration that is required: so the pip implementation should not be rushed
  • I intend to document the identified issues and general design here before doing any major coding (I'm hoping for the main issues to get resolved first but I will document in any case). With summer holidays at hand, this will probably be in August
  • The warehouse implementation has been discussed in Zulipchat on #pep458-implementation (https://python.zulipchat.com/#narrow/stream/223926-pep458-implementation) and their biweekly meeting.
@triage-new-issues triage-new-issues bot added the S: needs triage Issues/PRs that need to be triaged label Jul 15, 2020
@chrahunt chrahunt added PEP implementation Involves some PEP type: feature request Request for a new feature labels Jul 15, 2020
@triage-new-issues triage-new-issues bot removed the S: needs triage Issues/PRs that need to be triaged label Jul 15, 2020
@chrahunt
Copy link
Member

The reference implementation for a tuf client transitively depends on a handful of libraries that aren't pure Python (cryptography, for example). Our current vendoring policy states:

Vendored libraries MUST function without any build steps such as 2to3 or compilation of C code, practically this limits to single source 2.x/3.x and pure Python.

No comment on how to address this, I just wanted to call it out since it seems like the biggest potential roadblock to me.

@jku
Copy link
Contributor Author

jku commented Jul 16, 2020

I've reviewed the dependencies and I believe the issue is not as bad: The crypto choices made in Warehouse mean that the client should not need cryptography or pynacl libraries. My understanding is that currently pip would need to vendor only tuf, securesystemslib and maybe iso8601 as new dependencies (iso8601 is tiny but we may be able to drop it as well: theupdateframework/python-tuf#1065)

EDIT: I'll doublecheck this just to be sure later this week ;)

@jku
Copy link
Contributor Author

jku commented Jul 16, 2020

Could not leave this nagging in the back of my head: I've now double checked the transitive dependencies. Warehouse is planning to use ed25519 signatures which can be verified using the python implementation included in securesystemslib (a vendored https://github.com/pyca/ed25519/). No C dependencies are required.

@jku
Copy link
Contributor Author

jku commented Sep 1, 2020

A bit more detail on how I plan to implement this.

Summary

  • Vendored TUF (https://github.com/theupdateframework/tuf/tree/develop/tuf) reference implementation client library will do the hard work
  • pip users will be protected from an attacker able to take over some pypi infrastructure
  • pip users will also get a reliable snapshot view of the repository: everything installed with one pip command will be from a specific repository snapshot (repository cannot change during the process)
  • repositories that do not support TUF will be supported as before (even mixed with TUF-enabled repositories)
  • pip user experience does not change in significant ways (but see the Open Issues section)

Status

I have a WIP design doc, feel free to have a look but please note that it is mostly a reference for myself. I also have a pip fork with some filed issues to keep track of things and some preliminary code: I intend to work there until I have something that can actually be tested. At that point I'll come back and get some early review -- that should be a good time to find any absolute blockers.
I believe I have identified all of the missing pieces and issues with my plans (and the Warehouse plans): not all of these issues have been 100% solved yet, see below. The Warehouse implementation is not yet ready (it does not support index file verification) so unfortunately very little can be properly/easily tested right now.
If someone wants to test this or work on it: let me know and I'll show my test setup.

Example package install flow
Everything in italics is unchanged from current pip

Let me know if you'd like more details or examples.

Other details

  • the local metadata will be stored in user data dir, e.g. on linux $HOME/.local/share/pip/tuf-repositories/HASH/, where HASH is a hash of canonicalized repository url (e.g. "https://pypi.org/simple/")
  • initial metadata for pypi.org will be shipped with pip
  • There is no user configuration: repositories will be TUF-enabled if local metadata is found
  • Some sort of --disable-repository-security switch might be needed
  • download cache (if one is needed, this needs some discussion) will be in user cache dir, e.g. on linux $HOME/.cache/pip/tuf/ -- the existing caches are not really usable for the purpose (but see also the issue about whether TUF should download files or not)

Known Open Issues

There are a few more issues in https://github.com/jku/pip/issues/ but these are the main ones I believe:

@pfmoore
Copy link
Member

pfmoore commented Sep 1, 2020

TUF downloads files (and does not just verify them): this is a visible issue as TUF currently provides no progress indication, but it might be an issue otherwise too if pip really wants low level control of the http request or the cache

We do have download progress bars, so I think we need to consider that question. While I think "show stopper" is a bit strong, I think that pip just sitting there for potentially quite a long time (some packages are huge) with no indication of progress, is far from ideal, and we wouldn't want to permanently block any chance of having some sort of "downloading, xx% complete" progress indicator.

Also, how does this integrate with the existing work on parallel downloads and partial downloads of wheels to extract metadata? That sounds like pip needing low-level control of the HTTP request.

This seems like it's turning file downloads into a black box that pip has no control over, which doesn't seem ideal given that's where we are currently spending a lot of development effort... I suggest you look through the tracker and make sure you are involved in discussions on any items that might be affected by this work. I'm a bit concerned that you didn't find these pieces of work already...

Resolving the distribution target name from URL requires knowledge of the hash implementation or the server configuration

I have no idea what that means, so I can't tell if it's important for pip. But we do rely on being able to parse the URL for a package we read from the index (to get the project name and version, and in the case of wheels the compatibility tags). If you're saying we can no longer do that, then this is a major issue.

@McSinyx
Copy link
Contributor

McSinyx commented Sep 1, 2020

I'm cross-linking pypi/warehouse#8254 while trying to understand this thread 😛

@jku
Copy link
Contributor Author

jku commented Sep 1, 2020

We do have download progress bars, so I think we need to consider that question. While I think "show stopper" is a bit strong, I think that pip just sitting there for potentially quite a long time (some packages are huge) with no indication of progress, is far from ideal

Agreed, and adding this to a future TUF release is almost certainly not an issue.

Also, how does this integrate with the existing work on parallel downloads and partial downloads of wheels to extract metadata? That sounds like pip needing low-level control of the HTTP request.

On the issue of extracting dependency metadata from wheels: I've seen Warehouse folks discussing this (and providing the metadata separately from the wheels) with regards to TUF but cannot remember right now where that happened -- I will try to find it.

Parallel downloads are not supported by TUF at the moment but I do not see why they could not be in the future.

This seems like it's turning file downloads into a black box that pip has no control over, which doesn't seem ideal given that's where we are currently spending a lot of development effort.

This is indeed the most important question -- if low level control of http requests is something pip does want to do then all the previous issues are moot: we should instead talk to TUF folks about providing alternative API for this use case.

This does not have to be decided right now (I'll do an implementation with current API in any case) but the discussion can be started.

I suggest you look through the tracker and make sure you are involved in discussions on any items that might be affected by this work. I'm a bit concerned that you didn't find these pieces of work already...

I'm a simple guy and can only keep track of so many moving parts in my head. I've been concentrating on the TUF and Warehouse tuf implementation details so far, trying to make sure an implementation is even possible.

Resolving the distribution target name from URL requires knowledge of the hash implementation or the server configuration

I have no idea what that means, so I can't tell if it's important for pip. But we do rely on being able to parse the URL for a package we read from the index (to get the project name and version, and in the case of wheels the compatibility tags). If you're saying we can no longer do that, then this is a major issue.

No, I don't think this is an issue for pip. The index file contents do not change: the issue is that we need to split the distribution URL into distribution mirror base url and target name recognised by TUF: in the example these would be https://files.pythonhosted.org/packages/ and b8/f7/dd9223b39f683690c30f759c876df0944815e47b588cb517e4b9e652bcf7/sampleproject-2.0.0-py3-none-any.whl respectively. To do that split we need a little bit of information (e.g. that the target name has enough directories to form a blake2 hash): the Warehouse folks did not see this as an issue so far.

The reason I mention it is that if Warehouse happened to change those details in the future, that could then become very annoying for a client that made the wrong assumptions about the details...

@pradyunsg
Copy link
Member

pradyunsg commented Sep 1, 2020

I have no idea what that means, so I can't tell if it's important for pip. But we do rely on being able to parse the URL for a package we read from the index (to get the project name and version, and in the case of wheels the compatibility tags). If you're saying we can no longer do that, then this is a major issue.

(I think) This basically means that we'd have the hash included in the simple index page we look for (which will use information provided from the TUF metadata). Basically something like https://pypi.org/simple/PROJECT.HASH. See pypi/warehouse#8487 (comment) for context.

@pradyunsg
Copy link
Member

pradyunsg commented Sep 1, 2020

Parallel downloads are not supported by TUF at the moment but I do not see why they could not be in the future.

TBH, that's not an issue, since I think a much better area to make improvement once we have a TUF-protected PyPI is to give ourselves a JSON API for dependency resolution that's also TUF-protected.

@pradyunsg
Copy link
Member

I suggest you look through the tracker and make sure you are involved in discussions on any items that might be affected by this work. I'm a bit concerned that you didn't find these pieces of work already...

See #7819, which has most of the relevant discussions. I think the first post there covers basically everything that we're thinking of doing?

@jku
Copy link
Contributor Author

jku commented Sep 1, 2020

pypi/warehouse#8254 is the Warehouse issue about serving wheel metadata separately. This implementation should just work with TUF verification (unlike partial downloads -- I don't see how those could be reasonably verified even to the standard that wheel files are verified right now)

@dstufft
Copy link
Member

dstufft commented Sep 1, 2020

Partial downloads are basically unsupportable with signed data. There are some schemes to make it possible, but those schemes are all super fragile and annoying, and also really subtle in how you use them, which I wouldn't feel comfortable relying on.

Parallel downloads should be fine to implement, there's no security related concerns with that.

One big possible reason for wanting to keep control of the actual downloading in pip, is to support better control of concurrency and concurrency primitives. If the TUF library is a black box that does I/O it makes it difficult to say, move pip to an async io paradigm (not that we currently have any plans to do that), or to ensure that multiple things that are handling concurrency are all cooperating to ensure we maintain certain levels of concurrency and don't exceed those. Other benefits are things like handling authentication, timeouts, etc. There is a lot of stuff there, and it feels like it might be better if TUF exposed APIs such that it didn't expect to "own" the downloading, but instead it could communicate out "hey fetch these URLs", and then pip could fetch them, and then pass the contents back in for verification (and of course, there might be multiple rounds of this to fully verify things.

That's a bit harder to implement, so maybe it's too out of scope, but a set up like that allows a lot more control and reusability I think. It still doesn't allow things like partial downloads (because a key requirement here is that no fetched content is trusted until it's been verified, and we can't verify until we have the entire file).

@dstufft
Copy link
Member

dstufft commented Sep 1, 2020

This is more on the TUF side ofthings, and honestly it's a non trivial amount of work so it might not be worth it, but what I described above is basically writing the library in the sans-io style. If there's any appetite for doing that, I have some experience with that, so can advise if needed.

@pfmoore
Copy link
Member

pfmoore commented Sep 1, 2020

(I think) This basically means that we'd have the hash included in the simple index page we look for (which will use information provided from the TUF metadata). Basically something like https://pypi.org/simple/PROJECT.HASH. See pypi/warehouse#8487 (comment) for context.

Isn't that in violation of PEP 503, which says "Below the root URL is another URL for each individual project contained within a repository. The format of this URL is // where the is replaced by the normalized name for that project, so a project named "HolyGrail" would have a URL like /holygrail/"?

@dstufft
Copy link
Member

dstufft commented Sep 1, 2020

Those URLs still work, the implication here is that the TUF PEP updates the simple repository format to have additional URLs, if that repository opts into TUF.

@pfmoore
Copy link
Member

pfmoore commented Sep 1, 2020

So there's an update needed to PEP 503, then? Or is this covered in the TUF PEP(s) (which I haven't followed, so I've no idea what state they are in)?

At a minimum I'd expect there to be a statement somewhere that the PEP 503 URLs and the "other" ones MUST contain the same content... (Sorry, I'm putting my standards hat on now...)

@dstufft
Copy link
Member

dstufft commented Sep 1, 2020

The other ones are snapshots of the 503 URLs, so they'll contain the same content... at some specific point in time. While the 503 URLs are basically the latest URLs.

@pfmoore
Copy link
Member

pfmoore commented Sep 1, 2020

OK. So all covered in the TUF PEPs then I guess? (I have no idea about snapshot stuff).

My only real concern is the extent to which this is not transparent to the client, so pip maintainers need to know this stuff. Pip's source code already makes my head hurt, and it's only being able to check the standards that keeps me sane sometimes 🙁

(I'm not joking here - I think we have a real problem with pip's codebase becoming so complex that even the core maintainers struggle to follow it. We need to reduce complexity, not increase it).

@dstufft
Copy link
Member

dstufft commented Sep 1, 2020

Yea it's covered in the TUF PEP, and ideally abstracted away in the TUF client.

@jku
Copy link
Contributor Author

jku commented Oct 23, 2020

There is now a draft PR available: I'm hoping for high-level review and maybe opinions on the discussion items listed in the PR.

The PR is from the branch with vendored sources so it's quite massive: there's another branch that only includes the code changes without the vendoring: master...jku:tuf-mvp

If you think I should make a PR with the code changes only, let me know: I did it this way to show that lint and and unit tests pass (I'm not sure what the HTTPTooManyRequests errors in some of the integration tests are).

@pradyunsg
Copy link
Member

(I'm not sure what the HTTPTooManyRequests errors in some of the integration tests are).

#9030

@di
Copy link
Sponsor Member

di commented Feb 1, 2022

FYI, we now have a roadmap for PEP 458 for PyPI: pypi/warehouse#10672

@pradyunsg
Copy link
Member

I've labelled this as blocked, since this needs the entire checklist on pypi/warehouse#10672 to be completed first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PEP implementation Involves some PEP state: blocked Can not be done until something else is done type: feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

7 participants