Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement proper FK in MetaDocumentORM and MetaLabelORM to work on PostgreSQL #1990

Merged
merged 25 commits into from
Jan 14, 2022

Conversation

ZanSara
Copy link
Contributor

@ZanSara ZanSara commented Jan 11, 2022

Implement a proper composite FK key for DocumentORM and LabelORM in MetaDocumentORM and MetaLabelORM respectively, to allow for the same ID in two different indices. This was not enforced by SQLite, but crashes "strict" databases like PostgreSQL.

Also modifies the tests to make them runnable on PostgreSQL, although for now the tests will still run on SQLite on the CI.


It's a breaking change because it modifies the SQL schema. Previously saved SQLite databases might not work with the current implementation. It's recommended to re-create them from scratch.

@ZanSara ZanSara changed the title Change PK for DocumentORM and LabelORM Add unique to id in ORMBase to prevent PostgreSQL issues Jan 11, 2022
@ZanSara ZanSara linked an issue Jan 11, 2022 that may be closed by this pull request
@ZanSara ZanSara changed the title Add unique to id in ORMBase to prevent PostgreSQL issues Implement proper FK in MetaDocumentORM and MetaLabelORM to work on PostgreSQL Jan 11, 2022
@ZanSara ZanSara requested review from tholor and tstadel and removed request for tholor January 12, 2022 11:25
@ZanSara ZanSara mentioned this pull request Jan 12, 2022
@ZanSara
Copy link
Contributor Author

ZanSara commented Jan 12, 2022

Changes needed to fix the bug:

  • Modify the small ORM classes in haystack/document_stores/sql.py
  • update_document_meta() was identifying documents only by index. This is now fixed by defaulting to the default index and adding a parameter that allows users to select a different one if needed.

Changes required to make the tests runnable:

  • SQLDocumentStore accepts a isolation_level parameter, which forced me to add it to FAISSDocumentStore and both MilvusDocumentStores
  • Many tests and fixtures had to be modified to remove the explicit sqlite:/// path and replace it with a proper URL that depends on an env var placed in tests/conftest.py
  • Many tests had to include an explicit isolation_level="AUTOCOMMIT" param with initializing their docstore
  • In some occasions the function get_document_store was used directly in tests instead of being used as a fixture, but to allow the tests to run on PostgreSQL, I needed to ensure a teardown was executed. The easiest way to do it was to convert the return into a yield, with ugly consequences (see diff). I think this might be tolerable for now, and will definitely be dealt with as part of the DocumentStore's refactoring (Rethink Document Stores #1897).

In addition, a few tests were relocated from test_faiss_and_milvus to test_document_stores as they were generic enough to stay there.

@ZanSara ZanSara marked this pull request as ready for review January 12, 2022 11:40
…-ai/haystack into fix_postgresql_composite_index_2
Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQL part looks good! Just wondering if we can simplify the tests a bit. Seems to me that we are convoluting things even more with this PR and make the tests harder to read.

haystack/document_stores/faiss.py Show resolved Hide resolved
haystack/document_stores/sql.py Outdated Show resolved Hide resolved
test/conftest.py Outdated Show resolved Hide resolved
@pytest.mark.skipif(sys.platform in ['win32', 'cygwin'], reason="Test with tmp_path not working on windows runner")
def test_faiss_index_save_and_load(tmp_path):
def test_faiss_index_save_and_load(tmp_path, sql_url):
document_store = FAISSDocumentStore(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it simplify things if we use get_document_store() here instead? Maybe we can get rid of the sql_url fixture then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather use a fixture than a direct function call. What could be the advantage of using get_document_store() here? In general direct use of get_document_store() in the tests is something that gave me a lot of pain in this PR, and left me quite puzzled, so I might be missing something crucial...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. In general, a fixture would be nicer.
I think the history of this goes like this: some tests require passing quite special parameters to the document store (here: progress_bar=False). Instead of adding this as an "optional parameter" to the document_store fixture, we instantiated the document store just within the test. The get_document_store is probably something in the middle of the two approaches that avoids duplicate code across a few tests that require the same instantiation without polluting the document_store fixture.
However, if we find an elegant way to add all those optional params to the document store fixture itself, this would make things probably cleaner.

Copy link
Member

@tstadel tstadel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks good to me. However I would prefer the global init/teardown of postgres instead of getting a lot of cluttered/ugly places in the tests.

haystack/document_stores/elasticsearch.py Outdated Show resolved Hide resolved
haystack/document_stores/weaviate.py Outdated Show resolved Hide resolved
test/conftest.py Outdated Show resolved Hide resolved
@ZanSara ZanSara merged commit e28bf61 into master Jan 14, 2022
@ZanSara ZanSara deleted the fix_postgresql_composite_index_2 branch January 14, 2022 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

error when using other SQL backend than SQLite, e.g. Postgres
3 participants