Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get hashes from PyPI JSON API #1109

Merged
merged 1 commit into from
Apr 21, 2020

Conversation

atugushev
Copy link
Member

@atugushev atugushev commented Apr 19, 2020

Resolves #543.

Before
$ time echo web3 | pip-compile - -qo- --generate-hashes > /dev/null

real	1m32.754s
user	0m10.959s
sys	0m3.083s
After
$ time echo web3 | pip-compile - -qo- --generate-hashes > /dev/null

real	0m4.824s
user	0m2.603s
sys	0m0.141s

Changelog-friendly one-liner: Get hashes from PyPI JSON API.

Contributor checklist
  • Provided the tests for the changes.
  • Gave a clear one-line description in the PR (that the maintainers can add to CHANGELOG.md on release).
  • Assign the PR to an existing or new milestone for the target version (following Semantic Versioning).

@atugushev atugushev added the enhancement Improvements to functionality label Apr 19, 2020
@codecov
Copy link

codecov bot commented Apr 19, 2020

Codecov Report

Merging #1109 into master will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1109      +/-   ##
==========================================
+ Coverage   99.44%   99.46%   +0.01%     
==========================================
  Files          36       36              
  Lines        2540     2631      +91     
  Branches      312      318       +6     
==========================================
+ Hits         2526     2617      +91     
  Misses          8        8              
  Partials        6        6              
Impacted Files Coverage Δ
piptools/repositories/pypi.py 94.52% <100.00%> (+1.26%) ⬆️
tests/test_repository_pypi.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a33e653...82d093f. Read the comment docs.

@atugushev atugushev marked this pull request as draft April 19, 2020 11:14
@graingert
Copy link
Member

it might be worth always getting the hashes now rather than skipping them if the hashes are already in the requirements.txt (this behaviour causes me problems, eg when new wheels are uploaded to an old version I don't get the hash of the new wheel

@atugushev
Copy link
Member Author

We could ignore current hashes using --rebuild. What do you think?

@atugushev
Copy link
Member Author

atugushev commented Apr 19, 2020

But if think it deserves a discussion in a separate issue.

for package_index in package_indexes:
url = "{url}/{name}/json".format(url=package_index.pypi_url, name=ireq.name)
try:
data = self.session.get(url).json()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably remember which URLs support the PyPI JSON api

Copy link
Member Author

@atugushev atugushev Apr 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would we know that JSON API of a custom index URL is supported?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this URL doesn't 404

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it raises 404 then it falls back to the files hash checking.

Copy link
Member Author

@atugushev atugushev Apr 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps 404 should be handled explicitly because currently, it falls back on every HTTP/JSON error.

Copy link
Member Author

@atugushev atugushev Apr 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I might be missing your point, but I'll try to describe mine in detail.

Using this heuristic, we can't be sure that JSON API is not supported. Consider the following case:

Suppose we run the following command:

$ pip-compile --index-url=https://pypi1-server/ --extra-index-url=https://pypi2-server/

Step 1. When pip-tools processes package1 everything is okay because it hits https://pypi1-server/pypi/package1/json which responses 200.

Step2. After, pip-tools compiles package2 and hit the first index URL, which is https://pypi1-server/pypi/package2/json. But this URL raises 404 because the package is on pypi2-server. So it has to go to the second index URL, which is https://pypi2-server/pypi/package2/json, which responses 200.

Step3. Then pip-tools processes package3 the same as in step 1 with the first index URL.

How would we know that JSON API of a custom index URL is supported?
If this URL doesn't 404.

Considering the above statement the test case will be failed on step 3 because pypi1-server would be marked as "doesn't support JSON API" on step 2 because pip-tools couldn't find package2 on pypi1-server due to it responded 404.

Am I missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the simple index and the JSON API 404s then you don't know
if the simple index 200s and the JSON API 404s then you know the API is unsupported

Copy link
Member Author

@atugushev atugushev Apr 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Yes, that would work. Assuming that suggestion is to hit less the PyPI servers now we have to hit the servers twice on 404. So what's the point of the optimization?

Copy link
Member Author

@atugushev atugushev Apr 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

graingert: You have to hit it twice to download the hashes from indexes that don't support the JSON API

Sorry, I don't understand what are you trying to achieve by that. Could you elaborate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think this should be fixed anyway please feel free to open an issue/PR. Thank you for the help!

@graingert
Copy link
Member

graingert commented Apr 19, 2020 via email

@atugushev atugushev marked this pull request as ready for review April 19, 2020 18:54
@atugushev
Copy link
Member Author

Okay, it's ready for review now.

@AndydeCleyre
Copy link
Contributor

I may be missing something in pip's internal vendored requests.session business, but AFAIU there's no timeout getting set during the self.session.get(url) call, which could be trouble.

If so, I notice there's one other case like that in pypi.py that predates this PR (from 627dbaf).

I'll take a closer look at this whole thing tonight, but it looks really nice, especially that timed example you've provided.

Thanks!

@atugushev
Copy link
Member Author

@AndydeCleyre

It respects pip's options.timeout, which is default 15 seconds.

@AndydeCleyre
Copy link
Contributor

Great! This is a much appreciated improvement.

If you like, minor English edits in the docstrings:

  • "by a given InstallRequirement" -> "for a given..." (x2)
  • "from every release files" -> "of all release files" or "for all release files"

@atugushev
Copy link
Member Author

Thanks @AndydeCleyre for reviewing this! Your suggestions addressed in a03f3fd17dafd0e8477f11785f03099e224f222c.

@atugushev atugushev force-pushed the hashes-from-pypi-json-api-543 branch from a03f3fd to 82d093f Compare April 21, 2020 18:36
@atugushev
Copy link
Member Author

Rebased onto the master and squashed commits.

@atugushev atugushev added this to the 5.1.0 milestone Apr 21, 2020
@atugushev atugushev merged commit 93454b7 into jazzband:master Apr 21, 2020
@atugushev atugushev deleted the hashes-from-pypi-json-api-543 branch April 21, 2020 20:17
@atugushev
Copy link
Member Author

@AndydeCleyre @auvipy @graingert

Thanks for reviewing this! 🙏

@graingert
Copy link
Member

graingert commented Apr 27, 2020

@atugushev the speedup here is going to be incredible! ❤️ ❤️ ❤️ ❤️ 🎉

I was just rehashing a requirements that uses numpy pandas and lxml and thinking longingly of this PR.

@atugushev I think it's worth collecting some benchmarks for the changelog, like "generate-hashes in 5.1 will take 0.001x the time 5.0 used to"

@atugushev
Copy link
Member Author

pip-tools v5.1.0 is released. Thanks @graingert!

@graingert
Copy link
Member

@atugushev fyi this PR means you now get .egg hashes in the requirements.txt, eg:

 toml==0.10.0 \
     --hash=sha256:229f81c57791a41d65e399fc06bf0848bab550a9dfd5ed66df18ce5f05e73d5c \
     --hash=sha256:235682dd292d5899d361a811df37e04a8828a5b1da3115886b73cf81ebc9100e \
+    --hash=sha256:f1db651f9657708513243e61e6cc67d101a39bad662eaa9b5546f789338e07a3 \

@atugushev
Copy link
Member Author

Erm, I wonder should we skip egg hashes?

@graingert
Copy link
Member

@atugushev what mechanism skipped them before?

@atugushev
Copy link
Member Author

atugushev commented Apr 27, 2020

pip doesn't support installation from eggs, so it picks wheels and source dists.

@atugushev
Copy link
Member Author

atugushev commented Apr 27, 2020

So we could skip "packagetype": "bdist_egg" digests, since they aren't useful or leave it as is.

@graingert
Copy link
Member

@atugushev #1121

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvements to functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

use sha256 hashes from warehouse
4 participants