Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Does cuvs support IVFFlat for dataset size larger then GPU memory? #353

Open
VoVAllen opened this issue Sep 26, 2024 · 8 comments
Open
Labels
question Further information is requested

Comments

@VoVAllen
Copy link

VoVAllen commented Sep 26, 2024

What is your question?
Does cuvs support build the index with the dataset size larger than GPU memory?

Also does cuvs support multigpu building?

@VoVAllen VoVAllen added the question Further information is requested label Sep 26, 2024
@cjnolet
Copy link
Member

cjnolet commented Sep 26, 2024

Hi @VoVAllen,

Thank you for opening up an issue with your question. Some of the cuVS indexes do support out-of-core build. For example, a Cagra index can be built and then converted to an HNSW index without having to store all the vectors on the gpu.

Can you provide a little more info about what you are aiming to do?

@VoVAllen
Copy link
Author

VoVAllen commented Sep 26, 2024

@cjnolet Thanks, we're developing a vector database with a kind of quantization with IVF index. We need the KMeans results first and then do the quantization part after that. And the most time are spent on the Kmeans now.

Thus we're thinking of using GPU to generate the KMeans result from the dataset, which can greatly accelerate the process. The typical dataset size we're targeting is 50M-100M 768dim-1536dim vectors. This means 150GB~600GB total dataset size. Thus we're wondering whether we can leverage cuvs to get the KMeans results. And we plan to use L4 (like g5 instance on AWS) with 24GB GPU.

@cjnolet
Copy link
Member

cjnolet commented Sep 26, 2024

Ah ha, that makes sense. Thanks for providing this info.

I think the usual suggested way of doing this would be to bring a subsampling of your vectors onto the GPU device memory to train the kmeans (you probably want to use a balanced kmeans to guarantee a somewhat uniform distribution of points). Once the kmeans centroids are trained, you can batch the remaining vectors into the GPU memory and use the kmeans to predict the centroids for each data point.

Once you have all the centroids, you now have your ivf lists and you can do the quantization as needed. Are you planning to do product quantization here or a different kind of quantization? We have a scalable version of IVFPQ, which can also help trained on a subset of the vectors and then the vectors can be added to the index in batches without having to copy them all to the device memory at the same time.

@VoVAllen
Copy link
Author

@cjnolet We only need the centroids from the GPU KMeans part. The assignment and quantization part will be done at the database side with CPU due to data consistency issue with our current design. We can do subsampling for KMeans but feel that L4's 24GB is too small for the size we're targeting. And other GPU instance on AWS all comes with 8 GPU which is too expensive and exceed our demand. It would be nice if cuvs could support mini batch KMeans or out of core KMeans.

@cjnolet
Copy link
Member

cjnolet commented Sep 26, 2024

We have talked about supporting those. We have a pretty ambitious roadmap for the remainder of the year but we definitely plan on supporting a more scalable kmeans solution in the new year.

@cjnolet
Copy link
Member

cjnolet commented Sep 27, 2024

@VoVAllen we do have a multi-GPU API being merged very soon that will allow for multi-GPU index building and search. It also operates in two different modes- replication mode improves throughput during search by replicating shards across multiple GPUs and load balancing queries across them. Sharding mode improves scale by training different indexes on different GPUs and broadcasting the query to all of them during search.

Here is the PR for awareness: #231

@VoVAllen
Copy link
Author

@cjnolet Thanks. In terms of my question, I think the better way to solve it would be kind of mini batch KMeans algorithm, or leverage the CPU memory by cuda unified memory. There may be some performance trade-offs, but it scales more easily. And in real world cases, H100 is usually too much for the vector search scenario. Inference level GPU like L4 is more favorable due to the cost and cloud availability.

@cjnolet
Copy link
Member

cjnolet commented Oct 1, 2024

Hi @VoVAllen,

We have found managed / unified memory for over subscription to not work particularly well in the ML space in general because of the thrashing that occurs when memory is constantly brought back and forth from host to device. In fact, when oversubscribed by 2x or more, we have even seen deadlocks from many of these algorithms just by the very nature of their needing the whole dataset device at the same time. This is one of the reasons we don't suggest this approach for these types of algorithms and the main reason we prefer batching techniques instead.

We have gotten a fairly overwhelming interest in a more scalable batched kmeans method over the course of the year and so we are going to prioritize this for the next release or two.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants