Improve dependency management #1979

ZanSara · 2022-01-10T13:30:18Z

The current handling of dependencies is quite monolithic: users must install them all regardless of the subset of features they want to use. We should make Haystack more modular at install time.

Options

Nowadays there are several ways to properly handle dependency groups:

several requirement.txt files: quite old fashioned by now and a bit harder to manage
extras_require in setup.py: "traditional" way, safe and widely used
pyproject.toml: the new way, as recommended by PEP517 and PEP660.

Proposed dependency groups

minimal: basic Haystack on CPU with one single document store (inMemory maybe)
gpu: for running Haystack on GPU
rest: install also the REST server API deps
ui: install Streamlit deps
demo: rest + ui
ci: for GitHub runners
win: for Windows installs (if possible)
colab: to workaround Colab specific issues when necessary
One group for each document store
all_doc_stores: install all possible dependency from document stores
test for the test dependencies
docs: for building documentation
code: black, linter and possible extra tools if/when we introduce them
all (or dev): complete dependency list for development and contributing. Includes all of the above.

We can also consider adding smaller groups for special components with exotic dependencies, like crawler, ocr, etc.

Default install

It's up to debate what the default install (pip install haystack) should look like.

The important point is that the dependencies that are installed in this case must be marked as mandatory. This at least is the case for extras_require in setup.py, and might have changed in pyproject.toml. If it's the case, the default install should be effectively a minimal install. For example, if we include GPU deps in this group, they will become mandatory, and having a pure CPU install will be impossible.

I will investigate the options and update this section with new information.

Related issues

Related to #1291, #1716, #1826, #1806

Closes #1070

Next steps

Learn more about what's currently possible with pyproject.toml and whether all of our dependencies can actually work with it. As of last year that were still some issues with large libraries that needed complex build steps.
Finalize dependency groups list
Define what a default install should look like
Investigate how to properly handle failed imports for unmet dependencies
Fix dependency related issues (like Improve Colab setup experience by simplifying dependencies #1806)

The text was updated successfully, but these errors were encountered:

ZanSara · 2022-01-10T13:40:26Z

@tholor @oryx1729 @julian-risch @tstadel @askainet @brandenchan @bogdankostic @lalitpagaria: Let me know what do you think about the dependency groups and if you have any opinions about the topic 🙂

lalitpagaria · 2022-01-11T13:21:52Z

I feel many groups may confuse the users. Small options might be good like core, minimal, all , colab etc. So if anyone likes their customization for example mentioned in #1716 they can use minimal and then install whatever they like.

Not related to this task but how about a CLI utility that takes pipeline YAML as input and lists or installs required dependencies to run that pipeline smoothly.

tholor · 2022-01-11T13:36:09Z

I agree with @lalitpagaria and would reduce the groups a bit:

minimal: basic Haystack on CPU with one single document store (inMemory maybe)
gpu: for running Haystack on GPU
rest: install also the REST server API deps
ui: install Streamlit deps
demo: rest + ui
~~ci: for GitHub runners~~ => Not sure, but we can probably use dev or all in the CI
win: for Windows installs (if possible)
colab: to workaround Colab specific issues when necessary
One group for each document store
all_doc_stores: install all possible dependency from document stores
test for the test dependencies
docs: for building documentation
code: black, linter and possible extra tools if/when we introduce them
=> group the above three as dev
all (or dev): complete dependency list for development and contributing. Includes all of the above.

Not sure how many dependencies we have for "preprocessing / conversion", but might also be an extra category "preprocessing".

Thinking about which version is the "default", we might also consider something between "minimal" and "all". Maybe really just calling it internally default or standard.

Also: I think once we have the basic structure implemented it will be rather easy to extend the list of options here if we see the need.

ZanSara · 2022-01-11T15:17:23Z

Thank you both for the feedback! I'm ok reducing the groups of course. I like @tholor's list except for the demo related deps: I think many people would like to use the REST API with their own frontend, so I'd rather keep rest and ui and remove demo.
Regarding preprocessing, I agree many weird deps will go here so probably it makes sense. On the other hand, if it ends up being fundamental for most users, making a separate group could be confusing.

Proposed list:

minimal: basic Haystack on CPU with one single document store (inMemory maybe)
gpu: for running Haystack on GPU
rest: install also the REST server API deps
ui: install Streamlit deps
win: for Windows installs (if possible)
colab: to workaround Colab specific issues when necessary
elasticsearch, faiss, milvus, milvus2, weaviate, graphdb, all_doc_stores
dev: test, docs and dev tools
all: complete dependency list for development and contributing. Includes all of the above.
preprocessing: to verify

That makes for 15(16) categories. Many indeed, but most users will not need to know about them anyway.

Note also that this syntax pip install haystack[rest,ui] is totally valid, so having a bit more groups helps granularity without making the installation too complex

hugoperrin · 2022-01-21T09:52:32Z

Hello, It would make sense to me that one of the optional dependencies would be transformers and torch, as one could want to use their own code for testing certain embeddings with incompatible versions of torch and still respect the fact you need to give numpy arrays to the document store query for instance.

ZanSara · 2022-01-25T09:14:48Z

Unfortunately I don't think it is an option. This doesn't have to do with your idea (in principle it's not bad), but with the way pip and Python's dependency management works right now.

The issue here is that, currently, there is no way to specify "opt-out" dependencies in a setup.py or setup.cfg file. This implies that every optional dependency has to be explicitly opt-in, and I believe that asking every user to explicitly ask for pytorch and transformers alongside their Haystack install would be a bit awkward. We want pip install farm-haystack to come with sensible defaults; unfortunately, those sensible defaults have to also be part of the "mandatory" requirements, due to this fact. That's why, for example, Elasticsearch is also included in the mandatory dependencies, when it's clearly an optional: because that's what to our knowledge most people use, and would be odd to have to opt-in explicitly.

If by any chance you know a good way to implement opt-out dependencies, I'll be glad to learn about it! Unfortunately after a few days of research I came back empty-handed (see pypa/setuptools#1503, pypa/setuptools#1139). I even experimented with custom extra markers based on environment variables with no avail. But as soon as the feature is implemented in pip, your idea will become viable, so let us know if you find a way to handle this 🙂

ZanSara added the epic label Jan 10, 2022

ZanSara self-assigned this Jan 10, 2022

ZanSara added breaking change topic:dependencies topic:installation type:refactor Not necessarily visible to the users and removed breaking change labels Jan 10, 2022

lalitpagaria mentioned this issue Jan 11, 2022

YAML Pipeline Validation #1981

Closed

4 tasks

ZanSara mentioned this issue Jan 12, 2022

Improve dependency management #1994

Merged

ZanSara linked a pull request Jan 12, 2022 that will close this issue

Improve dependency management #1994

Merged

tholor mentioned this issue Jan 19, 2022

Standardization of dependency management #1070

Closed

ZanSara closed this as completed in #1994 Jan 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve dependency management #1979

Improve dependency management #1979

ZanSara commented Jan 10, 2022 •

edited

Loading

ZanSara commented Jan 10, 2022

lalitpagaria commented Jan 11, 2022 •

edited

Loading

tholor commented Jan 11, 2022 •

edited

Loading

ZanSara commented Jan 11, 2022 •

edited

Loading

hugoperrin commented Jan 21, 2022

ZanSara commented Jan 25, 2022 •

edited

Loading

Improve dependency management #1979

Improve dependency management #1979

Comments

ZanSara commented Jan 10, 2022 • edited Loading

Options

Proposed dependency groups

Default install

Related issues

Next steps

ZanSara commented Jan 10, 2022

lalitpagaria commented Jan 11, 2022 • edited Loading

tholor commented Jan 11, 2022 • edited Loading

ZanSara commented Jan 11, 2022 • edited Loading

hugoperrin commented Jan 21, 2022

ZanSara commented Jan 25, 2022 • edited Loading

ZanSara commented Jan 10, 2022 •

edited

Loading

lalitpagaria commented Jan 11, 2022 •

edited

Loading

tholor commented Jan 11, 2022 •

edited

Loading

ZanSara commented Jan 11, 2022 •

edited

Loading

ZanSara commented Jan 25, 2022 •

edited

Loading