[QST] Does cuvs support IVFFlat for dataset size larger then GPU memory? #353

VoVAllen · 2024-09-26T21:20:51Z

What is your question?
Does cuvs support build the index with the dataset size larger than GPU memory?

Also does cuvs support multigpu building?

cjnolet · 2024-09-26T21:38:37Z

Thank you for opening up an issue with your question. Some of the cuVS indexes do support out-of-core build. For example, a Cagra index can be built and then converted to an HNSW index without having to store all the vectors on the gpu.

Can you provide a little more info about what you are aiming to do?

VoVAllen · 2024-09-26T21:52:15Z

@cjnolet Thanks, we're developing a vector database with a kind of quantization with IVF index. We need the KMeans results first and then do the quantization part after that. And the most time are spent on the Kmeans now.

Thus we're thinking of using GPU to generate the KMeans result from the dataset, which can greatly accelerate the process. The typical dataset size we're targeting is 50M-100M 768dim-1536dim vectors. This means 150GB~600GB total dataset size. Thus we're wondering whether we can leverage cuvs to get the KMeans results. And we plan to use L4 (like g5 instance on AWS) with 24GB GPU.

cjnolet · 2024-09-26T21:56:13Z

Ah ha, that makes sense. Thanks for providing this info.

I think the usual suggested way of doing this would be to bring a subsampling of your vectors onto the GPU device memory to train the kmeans (you probably want to use a balanced kmeans to guarantee a somewhat uniform distribution of points). Once the kmeans centroids are trained, you can batch the remaining vectors into the GPU memory and use the kmeans to predict the centroids for each data point.

Once you have all the centroids, you now have your ivf lists and you can do the quantization as needed. Are you planning to do product quantization here or a different kind of quantization? We have a scalable version of IVFPQ, which can also help trained on a subset of the vectors and then the vectors can be added to the index in batches without having to copy them all to the device memory at the same time.

VoVAllen · 2024-09-26T22:02:09Z

@cjnolet We only need the centroids from the GPU KMeans part. The assignment and quantization part will be done at the database side with CPU due to data consistency issue with our current design. We can do subsampling for KMeans but feel that L4's 24GB is too small for the size we're targeting. And other GPU instance on AWS all comes with 8 GPU which is too expensive and exceed our demand. It would be nice if cuvs could support mini batch KMeans or out of core KMeans.

cjnolet · 2024-09-26T22:03:58Z

We have talked about supporting those. We have a pretty ambitious roadmap for the remainder of the year but we definitely plan on supporting a more scalable kmeans solution in the new year.

cjnolet · 2024-09-27T21:59:52Z

@VoVAllen we do have a multi-GPU API being merged very soon that will allow for multi-GPU index building and search. It also operates in two different modes- replication mode improves throughput during search by replicating shards across multiple GPUs and load balancing queries across them. Sharding mode improves scale by training different indexes on different GPUs and broadcasting the query to all of them during search.

Here is the PR for awareness: #231

VoVAllen · 2024-09-29T01:39:15Z

@cjnolet Thanks. In terms of my question, I think the better way to solve it would be kind of mini batch KMeans algorithm, or leverage the CPU memory by cuda unified memory. There may be some performance trade-offs, but it scales more easily. And in real world cases, H100 is usually too much for the vector search scenario. Inference level GPU like L4 is more favorable due to the cost and cloud availability.

cjnolet · 2024-10-01T01:14:10Z

Hi @VoVAllen,

We have found managed / unified memory for over subscription to not work particularly well in the ML space in general because of the thrashing that occurs when memory is constantly brought back and forth from host to device. In fact, when oversubscribed by 2x or more, we have even seen deadlocks from many of these algorithms just by the very nature of their needing the whole dataset device at the same time. This is one of the reasons we don't suggest this approach for these types of algorithms and the main reason we prefer batching techniques instead.

We have gotten a fairly overwhelming interest in a more scalable batched kmeans method over the course of the year and so we are going to prioritize this for the next release or two.

VoVAllen added the question Further information is requested label Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Does cuvs support IVFFlat for dataset size larger then GPU memory? #353

[QST] Does cuvs support IVFFlat for dataset size larger then GPU memory? #353

VoVAllen commented Sep 26, 2024 •

edited

Loading

cjnolet commented Sep 26, 2024

VoVAllen commented Sep 26, 2024 •

edited

Loading

cjnolet commented Sep 26, 2024

VoVAllen commented Sep 26, 2024

cjnolet commented Sep 26, 2024

cjnolet commented Sep 27, 2024 •

edited

Loading

VoVAllen commented Sep 29, 2024

cjnolet commented Oct 1, 2024 •

edited

Loading

[QST] Does cuvs support IVFFlat for dataset size larger then GPU memory? #353

[QST] Does cuvs support IVFFlat for dataset size larger then GPU memory? #353

Comments

VoVAllen commented Sep 26, 2024 • edited Loading

cjnolet commented Sep 26, 2024

VoVAllen commented Sep 26, 2024 • edited Loading

cjnolet commented Sep 26, 2024

VoVAllen commented Sep 26, 2024

cjnolet commented Sep 26, 2024

cjnolet commented Sep 27, 2024 • edited Loading

VoVAllen commented Sep 29, 2024

cjnolet commented Oct 1, 2024 • edited Loading

VoVAllen commented Sep 26, 2024 •

edited

Loading

VoVAllen commented Sep 26, 2024 •

edited

Loading

cjnolet commented Sep 27, 2024 •

edited

Loading

cjnolet commented Oct 1, 2024 •

edited

Loading