Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Property id_hash_keys of documents never set #1920

Closed
ArzelaAscoIi opened this issue Dec 22, 2021 · 1 comment · Fixed by #2342
Closed

Property id_hash_keys of documents never set #1920

ArzelaAscoIi opened this issue Dec 22, 2021 · 1 comment · Fixed by #2342
Labels
type:bug Something isn't working

Comments

@ArzelaAscoIi
Copy link
Member

ArzelaAscoIi commented Dec 22, 2021

Describe the bug
The attributes of objects of type Document always have id_hash_keys = None.

Context
Documents have a property called id_hash_keys which is used to generate the id. Before merging this pr Haystack used strings to generate the id. Now we pass attributes on which the hash should be generated, i.e. content, meta, ...
If we create two documents with the same content and generate the ids from two different keys, we get different ids, but in both cases the "id_hash_keys" field is "None".

from haystack import Document
doc = Document(content="some text", content_type="text", id_hash_keys=["key1"])
doc2 = Document(content="some text", content_type="text", id_hash_keys=["key2"])

The parameter id_hash_keys is passed to the init function but never set by self.id_hash_keys = id_hash_keys.

There are now two options:
Remove the parameter from the Document primitive.
Since this paramerter is already not set, it will not cause any errors as long as the creation of the id works as expected. However, we may need to discuss whether a document needs to store these values.

Store the id_hash_keys by self.id_hash_keys = id_hash_keys
This will cause some tests to fail:

  1. Weaviate will complain, since the list of strings can not be parsed.
  2. Document.to_dict will need some fixes

General question:
How do we want to define document similarity in general? At the document store level, the id is used to determine whether a document is replaced on insert. Since this id is based on context, this is basically equivalent to comparing only the context. The __eq__ function of documents additionally takes the metadata and other fields to compare the documents. So there is a possibility that doc1==doc2 returns False, but when both documents are inserted into a document store, only the second document is inserted, which is passed to write_documents.

@ArzelaAscoIi ArzelaAscoIi added the type:bug Something isn't working label Dec 22, 2021
@julian-risch
Copy link
Member

julian-risch commented Mar 22, 2022

@bogdankostic @ArzelaAscoIi As of now we don't need Document to store id_hash_keys as an attribute. I can't come up with a potential use case that requires that. In the unlikely event that we need a Document to tell how its id was generated in future, we can add that functionality later. Until then, I'd say we remove id_hash_keys as an attribute of Document and just use id_hash_keys as a parameter to decide how to generate the id.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants