Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve dependency management #1979

Closed
5 tasks done
ZanSara opened this issue Jan 10, 2022 · 6 comments · Fixed by #1994
Closed
5 tasks done

Improve dependency management #1979

ZanSara opened this issue Jan 10, 2022 · 6 comments · Fixed by #1994

Comments

@ZanSara
Copy link
Contributor

ZanSara commented Jan 10, 2022

The current handling of dependencies is quite monolithic: users must install them all regardless of the subset of features they want to use. We should make Haystack more modular at install time.

Options

Nowadays there are several ways to properly handle dependency groups:

  • several requirement.txt files: quite old fashioned by now and a bit harder to manage
  • extras_require in setup.py: "traditional" way, safe and widely used
  • pyproject.toml: the new way, as recommended by PEP517 and PEP660.

Proposed dependency groups

  • minimal: basic Haystack on CPU with one single document store (inMemory maybe)
  • gpu: for running Haystack on GPU
  • rest: install also the REST server API deps
  • ui: install Streamlit deps
  • demo: rest + ui
  • ci: for GitHub runners
  • win: for Windows installs (if possible)
  • colab: to workaround Colab specific issues when necessary
  • One group for each document store
  • all_doc_stores: install all possible dependency from document stores
  • test for the test dependencies
  • docs: for building documentation
  • code: black, linter and possible extra tools if/when we introduce them
  • all (or dev): complete dependency list for development and contributing. Includes all of the above.

We can also consider adding smaller groups for special components with exotic dependencies, like crawler, ocr, etc.

Default install

It's up to debate what the default install (pip install haystack) should look like.

The important point is that the dependencies that are installed in this case must be marked as mandatory. This at least is the case for extras_require in setup.py, and might have changed in pyproject.toml. If it's the case, the default install should be effectively a minimal install. For example, if we include GPU deps in this group, they will become mandatory, and having a pure CPU install will be impossible.

I will investigate the options and update this section with new information.

Related issues

Related to #1291, #1716, #1826, #1806

Closes #1070

Next steps

  • Learn more about what's currently possible with pyproject.toml and whether all of our dependencies can actually work with it. As of last year that were still some issues with large libraries that needed complex build steps.
  • Finalize dependency groups list
  • Define what a default install should look like
  • Investigate how to properly handle failed imports for unmet dependencies
  • Fix dependency related issues (like Improve Colab setup experience by simplifying dependencies #1806)
@ZanSara ZanSara added the epic label Jan 10, 2022
@ZanSara ZanSara self-assigned this Jan 10, 2022
@ZanSara
Copy link
Contributor Author

ZanSara commented Jan 10, 2022

@tholor @oryx1729 @julian-risch @tstadel @askainet @brandenchan @bogdankostic @lalitpagaria: Let me know what do you think about the dependency groups and if you have any opinions about the topic 🙂

@lalitpagaria
Copy link
Contributor

lalitpagaria commented Jan 11, 2022

I feel many groups may confuse the users. Small options might be good like core, minimal, all , colab etc. So if anyone likes their customization for example mentioned in #1716 they can use minimal and then install whatever they like.

Not related to this task but how about a CLI utility that takes pipeline YAML as input and lists or installs required dependencies to run that pipeline smoothly.

@tholor
Copy link
Member

tholor commented Jan 11, 2022

I agree with @lalitpagaria and would reduce the groups a bit:

minimal: basic Haystack on CPU with one single document store (inMemory maybe)
gpu: for running Haystack on GPU
rest: install also the REST server API deps
ui: install Streamlit deps

demo: rest + ui
ci: for GitHub runners => Not sure, but we can probably use dev or all in the CI
win: for Windows installs (if possible)
colab: to workaround Colab specific issues when necessary
One group for each document store
all_doc_stores: install all possible dependency from document stores
test for the test dependencies
docs: for building documentation
code: black, linter and possible extra tools if/when we introduce them

=> group the above three as dev
all (or dev): complete dependency list for development and contributing. Includes all of the above.

Not sure how many dependencies we have for "preprocessing / conversion", but might also be an extra category "preprocessing".

Thinking about which version is the "default", we might also consider something between "minimal" and "all". Maybe really just calling it internally default or standard.

Also: I think once we have the basic structure implemented it will be rather easy to extend the list of options here if we see the need.

@ZanSara
Copy link
Contributor Author

ZanSara commented Jan 11, 2022

Thank you both for the feedback! I'm ok reducing the groups of course. I like @tholor's list except for the demo related deps: I think many people would like to use the REST API with their own frontend, so I'd rather keep rest and ui and remove demo.
Regarding preprocessing, I agree many weird deps will go here so probably it makes sense. On the other hand, if it ends up being fundamental for most users, making a separate group could be confusing.

Proposed list:

  • minimal: basic Haystack on CPU with one single document store (inMemory maybe)
  • gpu: for running Haystack on GPU
  • rest: install also the REST server API deps
  • ui: install Streamlit deps
  • win: for Windows installs (if possible)
  • colab: to workaround Colab specific issues when necessary
  • elasticsearch, faiss, milvus, milvus2, weaviate, graphdb, all_doc_stores
  • dev: test, docs and dev tools
  • all: complete dependency list for development and contributing. Includes all of the above.
  • preprocessing: to verify

That makes for 15(16) categories. Many indeed, but most users will not need to know about them anyway.

Note also that this syntax pip install haystack[rest,ui] is totally valid, so having a bit more groups helps granularity without making the installation too complex

@hugoperrin
Copy link

Hello, It would make sense to me that one of the optional dependencies would be transformers and torch, as one could want to use their own code for testing certain embeddings with incompatible versions of torch and still respect the fact you need to give numpy arrays to the document store query for instance.

@ZanSara
Copy link
Contributor Author

ZanSara commented Jan 25, 2022

Unfortunately I don't think it is an option. This doesn't have to do with your idea (in principle it's not bad), but with the way pip and Python's dependency management works right now.

The issue here is that, currently, there is no way to specify "opt-out" dependencies in a setup.py or setup.cfg file. This implies that every optional dependency has to be explicitly opt-in, and I believe that asking every user to explicitly ask for pytorch and transformers alongside their Haystack install would be a bit awkward. We want pip install farm-haystack to come with sensible defaults; unfortunately, those sensible defaults have to also be part of the "mandatory" requirements, due to this fact. That's why, for example, Elasticsearch is also included in the mandatory dependencies, when it's clearly an optional: because that's what to our knowledge most people use, and would be odd to have to opt-in explicitly.

If by any chance you know a good way to implement opt-out dependencies, I'll be glad to learn about it! Unfortunately after a few days of research I came back empty-handed (see pypa/setuptools#1503, pypa/setuptools#1139). I even experimented with custom extra markers based on environment variables with no avail. But as soon as the feature is implemented in pip, your idea will become viable, so let us know if you find a way to handle this 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants