Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up pip list --outdated #7962

Merged
merged 18 commits into from
Apr 13, 2020
Merged

Speed up pip list --outdated #7962

merged 18 commits into from
Apr 13, 2020

Conversation

CrafterKolyan
Copy link
Contributor

@CrafterKolyan CrafterKolyan commented Apr 2, 2020

Fixes #7964

This thing is working way too slow now. The problem is that it fetches all versions in one thread.

Copy link
Contributor

@McSinyx McSinyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to be a bit lazier, e.g.

something like this

    def iter_packages_latest_infos(self, packages, options):
        with self._build_session(options) as session:
            finder = self._build_package_finder(options, session)

            def latest_infos(dist):
                typ = 'unknown'
                all_candidates = finder.find_all_candidates(dist.key)
                if not options.pre:
                    # Remove prereleases
                    all_candidates = [candidate for candidate in all_candidates
                                      if not candidate.version.is_prerelease]

                evaluator = finder.make_candidate_evaluator(
                    project_name=dist.project_name,
                )
                best_candidate = evaluator.sort_best_candidate(all_candidates)
                if best_candidate is None:
                    return None

                remote_version = best_candidate.version
                if best_candidate.link.is_wheel:
                    typ = 'wheel'
                else:
                    typ = 'sdist'
                # This is dirty but makes the rest of the code much cleaner
                dist.latest_version = remote_version
                dist.latest_filetype = typ
                return dist

            pool = Pool(DEFAULT_POOLSIZE)
            for dist in pool.imap_unordered(latest_infos, packages):
                if dist is not None:
                    yield dist

We still need to consider the pool size and the chunk size and whether to optionize them.

@McSinyx
Copy link
Contributor

McSinyx commented Apr 2, 2020

I don't think we need to close/join the pool though, since after the iterations all jobs are finished.

@CrafterKolyan
Copy link
Contributor Author

CrafterKolyan commented Apr 2, 2020

I don't think we need to close/join the pool though, since after the iterations all jobs are finished.

Even though the jobs are done it's better to clean all resources. For example there are extra threads which may become zombie and unaccessible. Also there are not only worker threads but some extra utility threads.

https://github.com/python/cpython/blob/ea9e240aa02372440be8024acb110371f69c9d41/Lib/multiprocessing/pool.py#L662-L666

And I had one test failed without this closing code.

@pfmoore
Copy link
Member

pfmoore commented Apr 2, 2020

I'm concerned about the complexity vs benefit tradeoff here. Adding multithreaded code adds a significant possibility for extra failures/issues, and you haven't really given any details as to what the time savings are like. Can you give some examples of before/after timings (preferably with examples that others could reproduce)? I have never personally found pip list --outdated too slow, so I'm lukewarm about this change.

What do other @pypa/pip-committers think?

@CrafterKolyan
Copy link
Contributor Author

CrafterKolyan commented Apr 2, 2020

@pfmoore Here is reproducible MINIMAL example. You need python3, pip and virtualenv installed.

python3 -m venv venv
source venv/bin/activate
pip install -U setuptools
pip install -U pip
pip install numpy scipy pandas matplotlib seaborn statsmodels sklearn jupyter tqdm requests six
git clone https://github.com/CrafterKolyan/pip.git --depth=1 --branch=patch-1
cd pip
time pip list --outdated
pip install .
time pip list --outdated
cd .. && deactivate

Cleaning:

rm -rf venv pip

Output example:
image

Here we have only about 67 packages installed with just a few popular Python packages and this fix already saves 2 seconds of real time. In Data Science problems I usually have at least 100 packages installed and I get about two times boost with this fix. Also I don't see any problem in multi-threading programs especially parts with requesting something from the Internet.

@CrafterKolyan
Copy link
Contributor Author

CrafterKolyan commented Apr 2, 2020

Some real results (no toy problem like in my example) from the person I know (and who wanted to stay anonymized):
image

All text is saved in original language. If you want I can translate it for you into English.

@nikitabobko
Copy link

nikitabobko commented Apr 2, 2020

Also I believe that all performance fixes should be documented right in the code via comment. So if it's going to be decided that performance is more important in this performance vs simplicity tradeoff then it's worth to sum up @CrafterKolyan benchmarks and write them into comment (or at least in commit message)

@CrafterKolyan
Copy link
Contributor Author

One more example from Mac OS X:
Before:
image
After:
image

@CrafterKolyan CrafterKolyan changed the title Speed up pip list --outdated Speed up pip list --outdated (Fix #7964) Apr 2, 2020
@CrafterKolyan
Copy link
Contributor Author

@nikitabobko I've added comment in code and also created an issue with example which is referenced in pull request name and eventually will be in commit name.

@McSinyx
Copy link
Contributor

McSinyx commented Apr 3, 2020

I can also confirm the speedup on my Debian testing system with 286 Python packages installed:

On master

real	0m36.089s
user	0m15.676s
sys	0m0.306s

On this branch

real	0m11.451s
user	0m11.653s
sys	0m0.369s

@CrafterKolyan, could you please switch back to multiprocessing.dummy.Pool? multiprocessing.pool.ThreadPool is an undocumented feature. Also, to close an issue when this gets merged, the fix keyword needs to be in the PR's description (not title).

@uranusjr
Copy link
Member

uranusjr commented Apr 3, 2020

This will need a test to ensure the indentation thread safty thing is not accidentally broken in the future.

@CrafterKolyan
Copy link
Contributor Author

CrafterKolyan commented Apr 5, 2020

@uranusjr added two tests. Actually they are more about working the same in created threads as in main thread of IndentingFormatter. But they are failing without this fix. Or should I test get_indentation and indent_log as independent functions (without any context for what they are actually were written) and think of some tests where some race conditions may happen?

@BrownTruck
Copy link
Contributor

Hello!

I am an automated bot and I have noticed that this pull request is not currently able to be merged. If you are able to either merge the master branch into this pull request or rebase this pull request against master then it will be eligible for code review and hopefully merging!

@BrownTruck BrownTruck added the needs rebase or merge PR has conflicts with current master label Apr 9, 2020
@pypa-bot pypa-bot removed the needs rebase or merge PR has conflicts with current master label Apr 9, 2020
Copy link
Member

@pradyunsg pradyunsg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

With my 20.1 release manager hat on, this can be included in this month's release, if once the news fragment is updated.

I'll wait for a few days to let other @pypa/pip-committers chime in. :)

news/7962.bugfix Outdated Show resolved Hide resolved
@pradyunsg pradyunsg changed the title Speed up pip list --outdated (Fix #7964) Speed up pip list --outdated Apr 9, 2020
Co-Authored-By: Pradyun Gedam <pradyunsg@gmail.com>
@pradyunsg
Copy link
Member

Thanks @CrafterKolyan for the PR and for the prompt responses to review comments. :)

news/7962.bugfix Outdated Show resolved Hide resolved
Co-Authored-By: Pradyun Gedam <pradyunsg@gmail.com>
@pradyunsg pradyunsg added this to the 20.1 milestone Apr 10, 2020
Copy link
Member

@xavfernandez xavfernandez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not totally confident that pip's codebase is thread safe but this seems simple enough 👍

@pradyunsg
Copy link
Member

Yea I'm OK because this seems to not be touching anything other than logging and the specific code at hand.

@pradyunsg
Copy link
Member

Looks like at least our download progress indicator isn't thread safe.

https://github.com/pradyunsg/pip/pull/5/checks?check_run_id=579142888#step:6:2757

@CrafterKolyan
Copy link
Contributor Author

But this code doesn't use any progress bars

@pfmoore
Copy link
Member

pfmoore commented Apr 11, 2020

But this code doesn't use any progress bars

Understood. But you had to patch logging to fix thread safety. This PR could mean that future changes (for example, adding a progress bar to pip list!) need to do further thread safety patching in the same way.

The fact is, that we don't guarantee pip's code is thread safe. Thread safety has always been an unnecessary maintenance burden to require, given that pip doesn't use threads itself, This PR changes that, meaning that we now do have to guarantee thread safety, at least for the parts of pip's code called by this command.

As I said above, I'm concerned about the cost vs benefits of this change. It seems to me that we could be taking on a non-trivial maintenance cost to fix an issue that doesn't appear to be more than an inconvenience to a relatively small number of users.

@CrafterKolyan
Copy link
Contributor Author

CrafterKolyan commented Apr 11, 2020

My changes are not about thread safety but about not throwing exception that _log_state doesn't have indentation field in another thread. This code should have been there before. And if it is not intended to use threads at pip (and you actually don't use them) then why are there threading.local() variables?

Copy link
Contributor

@McSinyx McSinyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CrafterKolyan, I think once upon a time pip planned to be multithreading, but then other works got in the way and such speed enhancement is not yet implemented. BTW could you please remove _log_state.indentation = 0 since IMHO it is a bit misleading?

@pfmoore, with GH-825 under consideration, I think we should gradually prepare the codebase to be thread-safe, since Python 2 support AFAIK is not going to be dropped any time soon so asyncio is not a immediate option. I applied for GSoC with parallel download as the goal and would try to make the display-relevant part thread-safe this summer, if you pip fellows accept the proposal. That being said, there are considerations over the cost of such change and I would love to be given it a try. I cannot be 100% sure that I will stick around forever to be responsible of the thread-safety, but if the project succeed, it's very likely that I will have a stronger bond with pip and be able to help for a long-term.

The above is just my opinion and I hope you can make a decision on this matter in the way that works out best for both the maintainers and the users.

@pradyunsg
Copy link
Member

pradyunsg commented Apr 13, 2020

@pfmoore's concerns are reasonable and appropriate. Thinking about this a bunch more, I personally still feel that the benefits here could be worth it... at least enough to make it worth giving this a shot.


I think this will be a reasonable thing to include in pip 20.1-beta1 and to see if we get any user reports of problems due to this. That gives us the opportunity to communicate "hey, this is the first instance of pip actually utilizing parallelization within itself, so there may be rough edges". If we don't have any user reports of issues due to this, that's great. If we do, we'd treat it like any other regression in a new functionality -- if it's bad-enough, we'll revert the change and then decide if/how to reintroduce it.

@pfmoore
Copy link
Member

pfmoore commented Apr 13, 2020

+1 on the plan to have it available in 20.1 beta, and see if we get any issues in the wild.

@pradyunsg pradyunsg added the type: enhancement Improvements to functionality label Apr 13, 2020
@pradyunsg
Copy link
Member

Awesome. Looks like everyone is on board with the plan, so I'll go ahead and merge this in. :)

@pradyunsg pradyunsg merged commit 471bc0e into pypa:master Apr 13, 2020
@pradyunsg
Copy link
Member

Thanks for the PR @CrafterKolyan! ^>^

bors bot referenced this pull request in duckinator/emanate May 13, 2020
118: Update pip to 20.1 r=duckinator a=pyup-bot


This PR updates [pip](https://pypi.org/project/pip) from **20.0.2** to **20.1**.



<details>
  <summary>Changelog</summary>
  
  
   ### 20.1
   ```
   =================

Process
-------

- Document that pip 21.0 will drop support for Python 2.7.

Features
--------

- Add ``pip cache dir`` to show the cache directory. (`7350 &lt;https://github.com/pypa/pip/issues/7350&gt;`_)

Bug Fixes
---------

- Abort pip cache commands early when cache is disabled. (`8124 &lt;https://github.com/pypa/pip/issues/8124&gt;`_)
- Correctly set permissions on metadata files during wheel installation,
  to permit non-privileged users to read from system site-packages. (`8139 &lt;https://github.com/pypa/pip/issues/8139&gt;`_)
   ```
   
  
  
   ### 20.1b1
   ```
   ===================

Deprecations and Removals
-------------------------

- Remove emails from AUTHORS.txt to prevent usage for spamming, and only populate names in AUTHORS.txt at time of release (`5979 &lt;https://github.com/pypa/pip/issues/5979&gt;`_)
- Remove deprecated ``--skip-requirements-regex`` option. (`7297 &lt;https://github.com/pypa/pip/issues/7297&gt;`_)
- Building of local directories is now done in place, instead of a temporary
  location containing a copy of the directory tree. (`7555 &lt;https://github.com/pypa/pip/issues/7555&gt;`_)
- Remove unused ``tests/scripts/test_all_pip.py`` test script and the ``tests/scripts`` folder. (`7680 &lt;https://github.com/pypa/pip/issues/7680&gt;`_)

Features
--------

- pip now implements PEP 610, so ``pip freeze`` has better fidelity
  in presence of distributions installed from Direct URL requirements. (`609 &lt;https://github.com/pypa/pip/issues/609&gt;`_)
- Add ``pip cache`` command for inspecting/managing pip&#39;s wheel cache. (`6391 &lt;https://github.com/pypa/pip/issues/6391&gt;`_)
- Raise error if ``--user`` and ``--target`` are used together in ``pip install`` (`7249 &lt;https://github.com/pypa/pip/issues/7249&gt;`_)
- Significantly improve performance when ``--find-links`` points to a very large HTML page. (`7729 &lt;https://github.com/pypa/pip/issues/7729&gt;`_)
- Indicate when wheel building is skipped, due to lack of the ``wheel`` package. (`7768 &lt;https://github.com/pypa/pip/issues/7768&gt;`_)
- Change default behaviour to always cache responses from trusted-host source. (`7847 &lt;https://github.com/pypa/pip/issues/7847&gt;`_)
- An alpha version of a new resolver is available via ``--unstable-feature=resolver``. (`988 &lt;https://github.com/pypa/pip/issues/988&gt;`_)

Bug Fixes
---------

- Correctly freeze a VCS editable package when it is nested inside another VCS repository. (`3988 &lt;https://github.com/pypa/pip/issues/3988&gt;`_)
- Correctly handle ``%2F`` in URL parameters to avoid accidentally unescape them
  into ``/``. (`6446 &lt;https://github.com/pypa/pip/issues/6446&gt;`_)
- Reject VCS URLs with an empty revision. (`7402 &lt;https://github.com/pypa/pip/issues/7402&gt;`_)
- Warn when an invalid URL is passed with ``--index-url`` (`7430 &lt;https://github.com/pypa/pip/issues/7430&gt;`_)
- Use better mechanism for handling temporary files, when recording metadata
  about installed files (RECORD) and the installer (INSTALLER). (`7699 &lt;https://github.com/pypa/pip/issues/7699&gt;`_)
- Correctly detect global site-packages availability of virtual environments
  created by PyPA’s virtualenv&gt;=20.0. (`7718 &lt;https://github.com/pypa/pip/issues/7718&gt;`_)
- Remove current directory from ``sys.path`` when invoked as ``python -m pip &lt;command&gt;`` (`7731 &lt;https://github.com/pypa/pip/issues/7731&gt;`_)
- Stop failing uninstallation, when trying to remove non-existent files. (`7856 &lt;https://github.com/pypa/pip/issues/7856&gt;`_)
- Prevent an infinite recursion with ``pip wheel`` when ``$TMPDIR`` is within the source directory. (`7872 &lt;https://github.com/pypa/pip/issues/7872&gt;`_)
- Significantly speedup ``pip list --outdated`` by parallelizing index interaction. (`7962 &lt;https://github.com/pypa/pip/issues/7962&gt;`_)
- Improve Windows compatibility when detecting writability in folder. (`8013 &lt;https://github.com/pypa/pip/issues/8013&gt;`_)

Vendored Libraries
------------------

- Update semi-supported debundling script to reflect that appdirs is vendored.
- Add ResolveLib as a vendored dependency.
- Upgrade certifi to 2020.04.05.1
- Upgrade contextlib2 to 0.6.0.post1
- Upgrade distro to 1.5.0.
- Upgrade idna to 2.9.
- Upgrade msgpack to 1.0.0.
- Upgrade packaging to 20.3.
- Upgrade pep517 to 0.8.2.
- Upgrade pyparsing to 2.4.7.
- Remove pytoml as a vendored dependency.
- Upgrade requests to 2.23.0.
- Add toml as a vendored dependency.
- Upgrade urllib3 to 1.25.8.

Improved Documentation
----------------------

- Emphasize that VCS URLs using git, git+git and git+http are insecure due to
  lack of authentication and encryption (`1983 &lt;https://github.com/pypa/pip/issues/1983&gt;`_)
- Clarify the usage of --no-binary command. (`3191 &lt;https://github.com/pypa/pip/issues/3191&gt;`_)
- Clarify the usage of freeze command in the example of Using pip in your program (`7008 &lt;https://github.com/pypa/pip/issues/7008&gt;`_)
- Add a &quot;Copyright&quot; page. (`7767 &lt;https://github.com/pypa/pip/issues/7767&gt;`_)
- Added example of defining multiple values for options which support them (`7803 &lt;https://github.com/pypa/pip/issues/7803&gt;`_)
   ```
   
  
</details>


 

<details>
  <summary>Links</summary>
  
  - PyPI: https://pypi.org/project/pip
  - Changelog: https://pyup.io/changelogs/pip/
  - Homepage: https://pip.pypa.io/
</details>



Co-authored-by: pyup-bot <github-bot@pyup.io>
@lock lock bot added the auto-locked Outdated issues that have been locked by automation label May 20, 2020
@lock lock bot locked as resolved and limited conversation to collaborators May 20, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
auto-locked Outdated issues that have been locked by automation type: enhancement Improvements to functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve speed of pip list --outdated via parallelization
9 participants