Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using multiple indices in FAISS seems broken #3327

Closed
1 task
einarbmag opened this issue Oct 5, 2022 · 3 comments · Fixed by #3383
Closed
1 task

Using multiple indices in FAISS seems broken #3327

einarbmag opened this issue Oct 5, 2022 · 3 comments · Fixed by #3383
Labels
Contributions wanted! Looking for external contributions topic:document_store topic:faiss type:bug Something isn't working

Comments

@einarbmag
Copy link

einarbmag commented Oct 5, 2022

Describe the bug
I'm trying to use FAISS document store and want to store multiple use cases in the same document store, i.e. specify different "index" during writing and querying. Everything works as expected for the first index I create, but for the subsequent ones, the query results are unexpectedly limited.

I looked into it, and I think what's happening is:

in the FAISS db, documents in each new index are keyed by vector_id, and this gets reset to 0 in each index. However, in the SQL db, the "vector_id" key isn't reset to 0 for each index. So when get_documents_by_vector_ids gets called with the vector_id's from the FAISS lookup, you get the wrong results, and most likely no results at all, when querying any of the subsequently created indices.

Error message
No error

Expected behavior
The vector ids in the SQL db should match those in the FAISS db for a given index.

Additional context
Add any other context about the problem here, like document types / preprocessing steps / settings of reader etc.

To Reproduce

from haystack.document_stores import FAISSDocumentStore
from haystack.nodes import EmbeddingRetriever
from haystack.pipelines import DocumentSearchPipeline
document_store = FAISSDocumentStore(sql_url="sqlite://", similarity='cosine', embedding_dim=384)
retriever = EmbeddingRetriever(document_store=document_store, embedding_model='sentence-transformers/all-MiniLM-L6-v2')
pipeline = DocumentSearchPipeline(retriever)                   

#vector_id's in the SQL db vs FAISS db (maybe not exact order)
document_store.write_documents([{"content":"dogs are great"},  #vector_id:0 vs 0
                                {"content":"dogs are OK"},      #vector_id:1 vs 1
                                {"content":"cats are awful"}],    #vector_id:2 vs 2
                                    index="index1")
document_store.write_documents([{"content":"I love cake"},       #vector_id:3 vs 0
                                {"content":"I like cake"},     #vector_id:4 vs 1
                                {"content":"I hate cake"}],    #vector_id:5 vs 2
                                index="index2")

document_store.update_embeddings(retriever, index="index1")
document_store.update_embeddings(retriever, index="index2")

print(len(pipeline.run("I love dogs", params={"Retriever": {"index": "index1", "top_k":1}})['documents']))
# => 1 
print(len(pipeline.run("I like cake", params={"Retriever": {"index": "index2", "top_k":1}})['documents']))
# => 0

FAQ Check

System:

  • OS: MacOS 12.6
  • GPU/CPU: CPU
  • Haystack version (commit or version number): 1.9.0
  • DocumentStore: FAISSDocumentStore
  • Reader: None
  • Retriever: EmbeddingRetriever
@julian-risch
Copy link
Member

Hi @einarbmag I was able to reproduce the bug that you described. Thank you for the detailed description. I think you are right about why the bug occurs, which already points to a way to fix it. 🙂 If you want, you would be more than welcome to try to create a pull request with a fix yourself. Otherwise, I'll discuss with the team how to address this issue in our next sprint.

@julian-risch julian-risch added type:bug Something isn't working topic:faiss topic:document_store Contributions wanted! Looking for external contributions labels Oct 5, 2022
@anakin87
Copy link
Member

Hey guys, as you know, I like to ⛏️ dig deep into code history.

vector_id is unique

In #556, vector_id was defined as unique:

vector_id = Column(String(100), unique=True, nullable=True)

Is this behavior still desired? I don't see why vector_id with the same value cannot exist in different indexes. But I'm not aware of all the internals...

Anyway, this uniqueness generated bugs, as reported in #1954.
So, in #1961, in update_embeddings, vector_id = self.faiss_indexes[index].ntotal was replaced by line 347:

if update_existing_embeddings is True:
if filters is None:
self.faiss_indexes[index].reset()
self.reset_vector_ids(index)
else:
raise Exception("update_existing_embeddings=True is not supported with filters.")
if not self.faiss_indexes.get(index):
raise ValueError("Couldn't find a FAISS index. Try to init the FAISSDocumentStore() again ...")
document_count = self.get_document_count(index=index)
if document_count == 0:
logger.warning("Calling DocumentStore.update_embeddings() on an empty index")
return
logger.info("Updating embeddings for %s docs...", document_count)
vector_id = sum(index.ntotal for index in self.faiss_indexes.values())

Current bug

It is caused by vector_id = sum(index.ntotal for index in self.faiss_indexes.values()) .
In fact, in update_embeddings, index2 is reset, but vector_id starts from 3, making the retrieval ineffective.

Possible solutions

If you agree that vector_id may not be unique, a quite easy solution is to revert #1961.

In any case, I am eager to hear your opinion and I am ready for the challenge of opening a PR 😄
@ZanSara any thoughts?

@ZanSara
Copy link
Contributor

ZanSara commented Oct 13, 2022

Hey @anakin87! 😊 Thank you once more for this research! I believe your analysis is correct and I don't see why we should have such mismatch between the vector_id in FAISS and in SQL. I'd imagine this was an early oversight and, as you discovered, it was later worked around.

So:

  1. Yes, let's remove the UNIQUE constraint on vector_id and revert Fix vector_id collision in FAISS #1961

  2. Then, let's test. Can we add one or few tests making sure the vector_id behave correctly?

  3. We should also run these tests on something else than SQLite to make sure it all works there. Feel free to use either PostgreSQL or MySQL or anything else similar, just not SQLite. I want to underline this because unfortunately SQLite does not really care about foreign key relationships, so if this uniqueness was used somehow, SQLite won't always complain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Contributions wanted! Looking for external contributions topic:document_store topic:faiss type:bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants