Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move away from file watcher for releasing memory #1885

Open
jmazanec15 opened this issue Jul 25, 2024 · 18 comments
Open

Move away from file watcher for releasing memory #1885

jmazanec15 opened this issue Jul 25, 2024 · 18 comments
Assignees
Labels
Enhancements Increases software capabilities beyond original client specifications

Comments

@jmazanec15
Copy link
Member

Description

Right now, we free memory for segments whose files get deleted by having a resource watcher watch for those files:

  1. Init of resource watcher - https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/plugin/KNNPlugin.java#L196
  2. File change watcher initialization - https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/memory/NativeMemoryLoadStrategy.java#L84
  3. File watcher - https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/memory/NativeMemoryLoadStrategy.java#L97

This file watching can actually be completely removed and it would do some good in simplifying the code (less static initialization, decoupling between cache manager and native memory load strategy, moving away from tight coupling of FSDirectory, etc). And additionally, how we do it now is a bug - see #1012.

To fix this, we just need to ensure that for either our DocValuesProducer and our KnnVectorsReader, when close is called, we evict any indices from memory (similar to how we do it now with the filewatcher). We would actually need to implement a really light weight DocValuesProducer that delegates for everything.

Related issues

#1012

@luyuncheng
Copy link
Collaborator

luyuncheng commented Jul 26, 2024

hi @jmazanec15 i am writing a DocValuesProducer for KNN80Codec for save store usage. i like this issues very much. it can release memory efficiently. i can help contribute my DocValuesProducer if you like

@heemin32
Copy link
Collaborator

Is the lifecycle of native memory(native knn index) same as the lifecycle of DocValuesProducer or KnnVectorsReader?
I thought the native memory stays in memory until either we clear cache or the index is unusable(delete or close).

Does close is called for either DocValuesProducer or KnnVectorsReader when the index is closed or deleted?

@jmazanec15
Copy link
Member Author

hi @jmazanec15 i am writing a DocValuesProducer for KNN80Codec for save store usage. i like this issues very much. it can release memory efficiently. i can help contribute my DocValuesProducer if you like

Awesome! What custom functionality did you add in DVProducer to reduce storage? I think the change for this issue would be quite small so I do not see it conflicting.

Is the lifecycle of native memory(native knn index) same as the lifecycle of DocValuesProducer or KnnVectorsReader?

DocValuesProducer or KnnVectorsReader are both per segment. They two producer/reader classes are opened (or shared) in https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/SegmentReader.java ctor.

On close, the close methods will eventually be called:

  1. https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/SegmentReader.java#L220-L231
  2. https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/SegmentCoreReaders.java#L172-L179

This reader will be required in the search path: https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/query/KNNWeight.java#L206

@luyuncheng
Copy link
Collaborator

Awesome! What custom functionality did you add in DVProducer to reduce storage? I think the change for this issue would be quite small so I do not see it conflicting.

@jmazanec15

in #1571 we can save the storage for _source store field.

and i think we can save the docValues storage like #1571 (comment)

we create 2 types docValues in Lucene engine. also we can get docValues from native engines, in faiss we can use reconstruct_n function to get i'th vector values from flat storage which would keep it in memory after it opens

@heemin32
Copy link
Collaborator

@jmazanec15 Right. However, does producer/reader classes are singleton and stay open until index is closed or deleted?
Otherwise, the close method might get called even if we don't want the native memory cache to be evicted. Or, do we want to reload the cache when new producer/reader classes are created?

@jmazanec15
Copy link
Member Author

@luyuncheng thats interesting. I think @navneet1v may have been thinking around that. We are in process of migrating away from Custom Doc Values to KNNVectorFormat. See #1855. Would you be able to do it with the KNNVectorReader instead of doc values?

@jmazanec15
Copy link
Member Author

@heemin32 Oh I think I see what you are saying. Will there be multiple readers/producers per segment/search. I think the purpose of SegmentCoreReaders is to ensure that they can be shared.

Worth noting that for lucene, the input is closed as well (see https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsReader.java#L325).

@heemin32
Copy link
Collaborator

@heemin32 Oh I think I see what you are saying. Will there be multiple readers/producers per segment/search. I think the purpose of SegmentCoreReaders is to ensure that they can be shared.

Sharing is good but one other thing is whether the reader get closed and opened repetitively regardless the existence of segment file.

@jmazanec15
Copy link
Member Author

closed and opened repetitively regardless the existence of segment file

Im not sure why this would happen. Need to be sure though before changing.

@heemin32
Copy link
Collaborator

I am don't know how it works internally but based on this comment, it seems like multiple class of SegmentReader can be created. https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/SegmentReader.java#L128

/**
   * Create new SegmentReader sharing core from a previous SegmentReader and using the provided
   * liveDocs, and recording whether those liveDocs were carried in ram (isNRT=true).
   */

@navneet1v
Copy link
Collaborator

@jmazanec15 thanks for opening up the issue and I see already some great ideas on the gh issue. Here is my true sense here:

  1. @luyuncheng I would like to really see how we can use the vectors present in Faiss index rather than having a separate file in segment for flat vectors storage. Also how much will be the latency in mapping those vectors back to heap for say exact search. As @jmazanec15 already mentioned we are in process of moving to KNNVectorFormat by 2.17(RFC: [RFC] Integrating KNNVectorsFormat in Native Vector Search Engine #1853). So please have a look in that Reader code. The skeleton has been added by this PR: Reuse KNNVectorFieldData for reduce disk usage #1571 , more I will be adding in coming weeks.
  2. @heemin32 on the idea of will there be multiple instance of DVP or not, I think its no. Lucene maintains a readerPool at its end to ensure that it hands out same reader whoever asked for it, check code here, here. But lets say if its not true, then we can handle this case easily via refCount. We can do a small experiment to validate, since the code of Lucene gives mix signal.
  3. Now on opening and closing of DVP multiple times the ans is again no. A DVP gets open in case of segment creation and then gets closed if the segment is deleted, and a DVP gets opened/closed when you do refresh.

@jmazanec15
Copy link
Member Author

But lets say if its not true, then we can handle this case easily via refCount

Right, refCounting would prevent issues, but make removal a little trickier.

@kotwanikunal
Copy link
Member

kotwanikunal commented Aug 12, 2024

Read through the issue here - refCount would be a solution, but it is not a direct solution for this problem.
Going through the code - it looks like we want the cache manager to process index lifecycle events.

My proposal would be to separate the concerns and instead follow this pattern to handle lifecycle events
(The method names are hypothetical and we can see what events fit best for kNN)

public void onIndexModule(IndexModule module) {
        module.addSettingsUpdateConsumer(INDEX_KNN_ALGO_PARAM_EF_SEARCH_SETTING, newVal -> {
            logger.debug("The value of [KNN] setting [{}] changed to [{}]", KNN_ALGO_PARAM_EF_SEARCH, newVal);
            // TODO: replace cache-rebuild with index reload into the cache
            NativeMemoryCacheManager.getInstance().rebuildCache();
        });
        module.addIndexEventListener(new NativeCacheIndexEventListener());
    }

    static class NativeCacheIndexEventListener implements IndexEventListener {
        @Override
        public void beforeIndexShardClosed(ShardId shardId, IndexShard indexShard, Settings indexSettings) {
             NativeMemoryCacheManager.getInstance().shardClosed();
        }

        @Override
        public void shardRoutingChanged(IndexShard indexShard, ShardRouting oldRouting, ShardRouting newRouting) {
            NativeMemoryCacheManager.getInstance().shardMovedAway();
        }

        @Override
        public void afterIndexRemoved(Index index, IndexSettings indexSettings, IndicesClusterStateService.AllocatedIndices.IndexRemovalReason reason) {
             NativeMemoryCacheManager.getInstance().indexCleanup();
        }
    }

Thoughts?

@jmazanec15
Copy link
Member Author

@kotwanikunal My issue with the index life cycle is that we would also need to keep the file watcher code, which I dont like because its a direct dependency on file system and directory implementation. Thus, if we are able to properly ref-count and free on segment docvalues producer code, then we will be able to solve both issues at one go.

@kotwanikunal
Copy link
Member

@kotwanikunal My issue with the index life cycle is that we would also need to keep the file watcher code, which I dont like because its a direct dependency on file system and directory implementation. Thus, if we are able to properly ref-count and free on segment docvalues producer code, then we will be able to solve both issues at one go.

I see the problem now. We have to track at the file level and not the shard level. I was wondering if we could simply filter cache entries on receiving the shard lifecycle events and evict them (in a way - skip the file watcher all together) - but the cache operates at a more granular level.

@luyuncheng
Copy link
Collaborator

I see the problem now. We have to track at the file level and not the shard level. I was wondering if we could simply filter cache entries on receiving the shard lifecycle events and evict them (in a way - skip the file watcher all together) - but the cache operates at a more granular level.

@kotwanikunal when we have shard close, or file remove, segment merged events, it can release memory at producer level calling close. and i think we need refCount like #1885 (comment) and #1885 (comment) says. it is because that when there is an shard events happened before an segment query at concurrent search scenarios, refcount can help avoid memory segment fault.

i think DocValuesProducer is a simple way as the issue says:

We would actually need to implement a really light weight DocValuesProducer that delegates for everything.

@0ctopus13prime
Copy link
Contributor

Before cutting out the absolute file path dependency in KNN, this should be resolved first before this issue : #2033

@navneet1v navneet1v added the Enhancements Increases software capabilities beyond original client specifications label Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancements Increases software capabilities beyond original client specifications
Projects
Status: Backlog (Hot)
Development

No branches or pull requests

7 participants