Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: simplify python dependencies management #27465

Closed
wants to merge 35 commits into from

Conversation

mistercrunch
Copy link
Member

@mistercrunch mistercrunch commented Mar 11, 2024

SUMMARY

POC around simplifying our python dependency setup. Goal ultimately is to:

  • restore dependabot support
  • untangle complex nested dependencies / simplify
  • get out of the current impasse with pip-compile-multi - where we may have to bissect 30+ upgrades to figure out which one breaks the build

Find more context here -> #27413

Simplify from 6 to 2 bundles -> base.txt and development.txt

Currently we have 6 bundles -> base/development/testing/integration/docker/local
Here I'm aiming at bringing it down to 2 ->

  • a base/lean bundle
  • a development bundle (with everything)

While the benefits of simplifying are clear, the tradeoff is having a more bloated setup for development. My stance here that having a more bloated dev setup is better than having 3-4 that are confusing and half-broken at times. Also note that while it's super important for the production dependencies to be pinned, it matters a bit less going into the longer tail of dependencies.

I'm certain the current setup with 6 requirements files is overly engineered/complex and should be simplified. Is 2 the sweet spot? Do we need 3 to strike the right balance? Let's POC with 2 and see how it impacts CI / dev enviornments

This is a bit of a POC, if successful, next step could be to move away from
pip-compile-multi and onto a simpler pip-compile setup, and restore
dependatot support.

What's in this PR

I got pulled deeper than I expected in this PR, resulting in related
side-quests that I took on and want to list out here so you understand
what's in the 40 files touched by this PR:

  • Moved towards pyproject.toml the now-standard way to manage deps and project semantics.
    This is compatible with dependabot and the ecosystem in general
  • created requirements/pip-compile-superset.py a simple utility/CLI that wraps pip-compile
    commands and features around python dependency management, sure you can use pip-compile
    directly, but this utility is a bit more opinionated (which files to generate with what input)
  • documented and spread the use of pip-compile-superset.py
  • Moved all requirements/*.in files into pyproject.toml
  • introducing a setup-backend github action to DRY the python-setup logic in one place.
    now that is used consistently across a dozen+ actions, we have more consistency and can
    ultimately tweak for more perf
  • moved all actions to use this reusable action

Mechanics

  • Moving to using the now standard pyproject.toml - as specified in PEP-517 & PEP-518
  • Using simple, basic pip-compile to
    • build the simple requirements/base.txt
    • build a more complex requirements/development.txt, by specifying many --extra that point to group of "extra_requires". This file is a superset of the base file, and contrarily to before, re-itemizes each dependency (pip-compile-multi was able to point one to another - so it'd be DRYer)

Composition in specific envs

How to use the existing constructs to build upon them in your own environment, like we do at Preset and Airbnb? What's the pattern to follow given these new mechanics. Listing out the needs/requirements around this:

  • allow "extending" the requirements with extra/external/arbitrary new libraries while specifying ranges
  • allow "overriding" rules for any library specified or inherited in the tree
  • reuse as much of the pinned list from the tested list, meaning we want to be as close to possible as base.txt except where specfied

Step by step:

  • create a new preset-prod-requirements.in that references the extra you want, first line should be something like -e .[hive,presto,mysql]
  • add your rules on top, things like cryptography>=3.0.0,<4.0.0 or some-lib-only-me-needs
  • copy the rich/tested development.txt output file as your target. cp requirements/development.txt requirements/preset-prod-requirements.txt so it becomes the basis for your environment.
  • compile with constraints pip-compile -o preset-prod-requirements.txt preset-prod-requirements.in -c requirements/base.txt. Note that the output here acts as an input as well, and that we're counting on the fact that our input file + the latest output file leads to a new deterministic output that fits the criteria above

TESTING INSTRUCTIONS

Let CI play its course. This is a draft for now.

@github-actions github-actions bot added doc Namespace | Anything related to documentation github_actions Pull requests that update GitHub Actions code labels Mar 11, 2024
Copy link

codecov bot commented Mar 11, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 69.71%. Comparing base (85d0d88) to head (28188d4).
Report is 3 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #27465      +/-   ##
==========================================
+ Coverage   67.30%   69.71%   +2.40%     
==========================================
  Files        1909     1909              
  Lines       74678    74665      -13     
  Branches     8325     8325              
==========================================
+ Hits        50262    52051    +1789     
+ Misses      22366    20564    -1802     
  Partials     2050     2050              
Flag Coverage Δ
hive 53.71% <100.00%> (?)
mysql 78.01% <100.00%> (?)
postgres 78.10% <100.00%> (-0.02%) ⬇️
presto 53.66% <100.00%> (?)
python 83.12% <100.00%> (+4.99%) ⬆️
sqlite 77.56% <100.00%> (?)
unit 56.58% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

POC around simplifying our python dependency setup. Goal ultimately is
to restore dependabot support and overall untangle complex nested dependencies.

Here I'm aiming at having two bundles: a base/lean bundle, and a
development bundle. We used to have testing/integration/docker/local and
we're merging all this into development.

While the benefits of simplifying are clear, the tradeoff is having a
more bloated setup for development. My stance here that having a more
bloated dev setup is better than having 3-4 that are confusing and
half-broken at times.
BASE_REQS = "requirements/base.txt"
DEV_REQS = "requirements/development.txt"

DEV_EXTRAS = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this CLI essentially add another layer that we need to work though? I would have thought that the old compile-multi would have just had a include file for these deps, no? I'm a little concerned that folks will need to come and update this python file whenever they need to tune a dep, so you'd need to add here AND in the toml, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, originally I just had a bash scripts that would run pip-compile --extra {whole_list_here}, as we need those semantics somewhere. Those semantics used to live in .in files and needed a new place.

Now I realized I needed a few things on tip of the bash script and thought a tiny CLI would be useful so that :

  • I wanted a way to make sure development.txt is in sync with base.txt for the librairies that they share, and prevent any drift. In the past, that was insured by having the .txt reference one another. I think I should add a CI check that runs pip-compile-superset.py compare-versions to make sure things don't drift for whatever reason. Happy to add this check part of this PR
  • compile-deps this subcommand should insure that consistency when changing inputs and/or bumping libs. I mean you could run both pip-compile manually. Basically that means we can update both environments pins atomically
  • I wanted to offer a way to expand upon what's here. Sure you could pip install -r requirements/base.txt and pip install whatever else on top, but I wanted to offer a way to pip-compile your stuff alongside what's in here. I added docs as part of this PR both for CONTRIBUTING.md and in the producitonizing-related docs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, but I think we should try to keep a single source of truth for this list. Would it not suffice to add it to the toml file under the development "optional" req's? Just trying to avoid the need to update dependency lists in multiple locations going forward. If there end up being multiple requirement sources in the toml file, perhaps this CLI would perform a linting function which would ensure version consistency.

Copy link
Member Author

@mistercrunch mistercrunch Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, clarifying here, but the list of --extra here is referencing named GROUPs of requirements as specified in the pyproject.toml, these groups are useful in their own rights as someone might want to pick one or compose a collection of those for their envrionments. So those are not individual libs, but named groups of libs pre-defined in pyproject.toml, and listed in as required for our development.txt bundle

So this little array represent the list of groups we want to compose into our requirements/development.txt. Now why not having one big group called development with everything? Well people in the community might want to install different subsets of these groups, say mysql, snowflake and trino. Another company may want redshift and hive.

To define our own development.txt, I picked everything that's needed in some form or fashion by CI, plus the stuff used in common dev workflows filed under a development group. This contains libs like pylint / docker / mypy ...

"docker",
"flask-testing",
"freezegun",
"greenlet>=2.0.2",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't greenlet a requirement for gevent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, as requirements/development.txt compile, greenlet gets imported via all of these ->

greenlet==3.0.3
    # via
    #   apache-superset (setup.py)
    #   gevent
    #   playwright
    #   shillelagh

the reason why it made this development group is that somehow it was listed here https://github.com/apache/superset/blob/master/requirements/docker.in#L19 , so I merged that under development

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems related to #26396

@mistercrunch
Copy link
Member Author

@john-bodley this matured quite a bit since you look, would love to get your feedback on all this. I think it's viable as is and should address most of the origin goals while also addressing or mitigating concerns.

@mistercrunch
Copy link
Member Author

Problems with the current setup:

  • keeping base.txt and develpment.txt in locksteps - will dependabot know to submit PRs that update both base.txt and development.txt? likely not, might require post-processing
  • ?

@mistercrunch
Copy link
Member Author

Now starting to think re-writing something like dependabot might be easier than all this ... I think I may factor out everything that's good/clear about this PR that's not related to pip-compile-multi (reducing to 2 .in files, improvements in GHA action re-use, ...)

@mistercrunch
Copy link
Member Author

Closing this for now in favor of simpler/less disruptive #27505 - might reconsider in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc Namespace | Anything related to documentation github_actions Pull requests that update GitHub Actions code size/XXL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants