Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distinguish types of entries in the vector DB #23

Open
jba opened this issue Sep 17, 2024 · 0 comments
Open

distinguish types of entries in the vector DB #23

jba opened this issue Sep 17, 2024 · 0 comments

Comments

@jba
Copy link
Contributor

jba commented Sep 17, 2024

Currently we put every type of document into one vector DB:

  • GitHub issues
  • sections of Go documentation
  • gerrit CLs
  • and so on.

Our Related Entities API (#22) may want to (a) let users ask for a subset of the possible types, and (b) classify results by type.

As far as classifying the results, currently all the IDs are URLs, and it is easy to tell the type of doc from the form of the URL. I think we can continue that indefinitely. So we don't need separate namespaces or metadata to identify the type of doc.

To support asking for a subset of types, we can just search for more documents and throw out the ones that don't match. That can be expensive, though, since we might have to do multiple searches with increasing limits until we get the docs we want. If we only let the user provide a threshold (max distance from the query) instead of a limit (number of documents), then a single call will do.

An alternative is to use a separate namespace for each type. Advantages are that the type of document would be more evident, and we could query different types concurrently. Disadvantages are that we'd have to rewrite everything, and we'd have to perform N queries instead of one and merge the results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant