Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a prototype/design for a hybrid local/fetch-on-demand directory #7331

Closed
andrross opened this issue Apr 29, 2023 · 4 comments
Closed
Assignees
Labels
discuss Issues intended to help drive brainstorming and decision making distributed framework enhancement Enhancement or improvement to existing feature or request

Comments

@andrross
Copy link
Member

More context here: #6528

Create a prototype/design for a hybrid directory that behaves as a local directory when complete files are present disk, but can fall back to the block-based on-demand fetch when data is requested that is not present. One of the key goals of this task will be to determine the delta between a read-only variant and a fully writable variant to make a call as to whether the read-only variant is a worthwhile incremental step. (I think the writable variant will increase the scope by requiring reconciling the separate Store approach used by remote-backed indexes, as well as bring in complexities around notifying replicas of changes).

@andrross andrross added enhancement Enhancement or improvement to existing feature or request untriaged distributed framework and removed untriaged labels Apr 29, 2023
@neetikasinghal
Copy link
Contributor

i will be working on this.

@neetikasinghal
Copy link
Contributor

neetikasinghal commented Jun 22, 2023

Create a prototype/design for a hybrid local/fetch-on-demand directory

Overview

Create a prototype/design for a directory that behaves as a local directory when complete files are present disk, but can fall back to the block-based on-demand fetch when data is requested that is not present.
Issue: #7331

The goal is to support adding an option to remote-backed indices that enables the system to intelligently offload index data before hitting disk watermarks, allowing more frequently used and more-likely-to-be-used data to be stored in faster local storage, while less frequently used data can be removed from local storage since the authoritative copy lives in the remote store. Any time a request is encountered for data not in local storage it will be re-hydrated into local storage on-demand in order to serve the request. The goal is also to provide an implementation that can intelligently manage resources and provide a performant "warm" tier that can serve both reads and writes.
RFC: #6528

Common Terminologies

WHOLE files

A single Lucene file: Lucene File Format Basics

BLOCK file

Each lucene file is mapped to logical blocks of constant size (currently 8M). A block file is the part of a lucene file that has been downloaded to local disk. Let’s say we have a lucene file 0001.cfs in S3 of 1 GB and search node has only downloaded few “parts” of this file. So, the node’s local disk will block file as have 0001.cfs._block.

Image

Fig : Lucene file, logical mapping of lucene file to blocks, actual block files on disk

FileCache

In-memory key value data structure which manages the lifecycle of all files on the disk.

Design

CompositeDirectory - this is a directory that has information about the FsDirectory, RemoteSegmentStoreDirectory and HybridDirectory. This hybrid directory implements all Directory apis, knowing how to read a block file as well as a non-block file/whole file. HybridDirectory is also responsible for downloading the blocks/whole files from the remote store if a read is on a file which is not present in the FileCache.

Image

A shard file of a writable warm index can exist in the following places with file format:

  • FsDirectory (in NON_BLOCK format)
  • Remote (in NON_BLOCK format)
  • FileCache (in BLOCK/NON_BLOCK format)

Hence, a shard file can be tracked with the help of a FileTracker which will be updated when the file moves from one state to the other. FileTracker can also store the metadata of the files present in the FileTracker such as the file size, most recent IOContext or the file operation, etc to support multiple use-cases.
FileTracker can be reinitialized by checking the files on the FsDirectory and the files in the FileCache after the node reboot or node failure.

Now, lets look at how the different states of the FileTracker:

Image

More detailed FSM of FileTracker is as follows:

Image

Description of states of FileTracker is as follows:
DISK: When a file is present on disk and has not ben backed up by remote.
CACHE: When a file is present on disk and backed up by the remote.
REMOTE_ONLY: When file is evicted from cache and only present on remote

Write to the disk
Initially, when a new shard is created, the corresponding shard files are present on the FsDirectory as whole files.
There is no entry added to the File tracker. These are the files that have not been backed up yet to the remote store.

FileTracker Write flow

Image

The file of a writable warm index is added to the file tracker with CACHE state and file type as NON_BLOCK file only once it is completely uploaded to the remote store. For the transient period for the file between getting created and uploaded to the remote store, the file should be read from the FSDirectory and hence not added to the file tracker.
Hence, the files that have been added to file tracker with the CACHE state only are eligible for eviction. This prevents the files that are not uploaded to the remote store and not having any reference count to be evicted.

How can we track the partial file uploads?
The files are moved to the CACHE state only when they are completely uploaded to the remote store else they are not added to the file tracker.

On a failure of an upload to the remote store?
The file is not added to the file tracker.

Delete from remote store?
When a file is deleted from the remote store, the file needs to removed from the file tracker.

Image

FileTracker Read flow

Search
On eviction from the disk, there can be a search request on the shard which would trigger a download of files from remote. The entries for these files will be updated with FileType as BLOCK and FileLocation as FileCache.
The next search request for the same shard will be served by the FileCache.

Indexing
On eviction from disk, there can be an indexing request on the shard which would trigger the download of the files from the remote. The files to be downloaded can whole files/block files depending on the type of indexing request. As, merges happen on complete files, whole files can be downloaded for merge whereas for updates/append only requests, the block downloads should suffice. IOContext can used to decide the type of operation and hence decide the type of file download. The corresponding entries for the files in the file tracker can be updated with FileType as BLOCK/NON_BLOCK (depending on the operation) and FileLocation as IndexingFileCache.

We can always download as block files - the optimization can be done later to download the whole files depending on the operation and/or file type.
Lets looks the read flow that would cover all the use-cases for search and indexing:

Image)

When a read request lands on the hybrid directory, it checks the following for the file:

  • Not present in the file tracker: if the file is not present in the file tracker, add it to the file tracker and to the file cache and return from the file cache
  • Present in the file tracker:
    • If the file tracker state is REMOTE_ONLY:
      • Use IOContextDecider to decide the type of the file to download: set the file type as BLOCK/NON_BLOCK
    • If the file tracker file type if NON_BLOCK:
      • Read from the OnDemandNonBlockSearchIndexInput - read from the cache else download whole files from the remote store
    • else If the file tracker file type is BLOCK:
      • Read from theOnDemandBlockSearchIndexInput - read from the cache else download block file from the remote store

On eviction from the file cache, the file tracker state is updated to REMOTE_ONLY.

FileTracker can be part of the CompositeDirectory that can contain FSDirectory, HybridDirectory, RemoteSegmentStoreDirectory. On a read flow, the call will land on the CompositeDirectory and can be delegated to the HybridDirectory to initialize the OnDemandNonBlockSearchIndexInput/OnDemandBlockSearchIndexInput based on the FileTracker file type. OnDemandBlockSearchIndexInput is the IndexInput inherited class as in the case of RemoteSearchDirectory (defined in Implement prototype remote store directory/index input for search), which is basically responsible for serving the read requests by checking if the required files are present in the FileCache and returning from there, else downloading the files, block by block from the remote store and then returning.

The class diagram looks as follows:

Image

CompositeDirectory would expose the following apis, it would delegate call to the FSDirectory/RemoteStoreSegmentDirectory/HybridDirectory:
listAll
Lists the names of all the distinct segment files that are present in the FileTracker with FileState as CACHE and REMOTE_ONLY.

deleteFile
Deletes file depending on its location in the file tracker.
If the file is not present on the file tracker, call is delegated to the FSDirectory.
If the file tracker state is REMOTE_ONLY, the call is delegated to the RemoteStoreSegmentDirectory.
If the file tracker state is CACHE, the call is delegated to the HybridDirectory.

fileLength
Returns the actual length of file by checking from the file tracker

createOutput/ createTempOutput
Call is always delegated to the FSDirectory.

afterRefresh
call is delegated to the RemoteSegmentStoreDirectory to upload files to the remote store. Once that is successful, the call is delegated to the HybridDirectory’s createOutput function to add the file to the file cache. Then the file is also added to the FileTracker with FileState as CACHE.

sync
Call is delegated to FSDirectory for NON_BLOCK files, Operation is noop for BLOCK files.

syncMetaData
Call is delegated to FSDirectory for NON_BLOCK files, Operation is noop for BLOCK files.

rename
Call is delegated to FSDirectory for NON_BLOCK files,
Operation should be an Exception for BLOCK files. Expected to be called for new writes.

openInput
Opens a file for read.
If the file is not present on the file tracker, call is delegated to the FSDirectory.
else if the file tracker state is REMOTE_ONLY, use IOContextDecider to update the FileTracker state with FileType as BLOCK/NON_BLOCK
then, the call is delegated to the HybridDirectory.

obtainLock
For NON_BLOCK files call is delegated to FSDirectory and for BLOCK file this is a Noop operation.

close
If the file is not present on the file tracker, call is delegated to the FSDirectory.
If the file tracker state is CACHE, the call is delegated to the HybridDirectory.

getPendingDeletions
Call is always delegated to FSDirectory.

HybridDirectory can make use of the FileTracker to make a decision to initialize OnDemandNonBlockSearchIndexInput/OnDemandBlockSearchIndexInput.

HybridDirectory would expose the following apis:

listAll
Lists the names of all the distinct segment files that are present in the FileTracker with FileState as CACHE.

deleteFile
Deletes file depending on its location in the file tracker.
If the file is present in any of the file caches, it is deleted from the file cache and then the FileTracker state is updated with FileState as REMOTE_ONLY.

fileLength
Returns the actual length of file by checking from the file tracker

createOutput/createTempOutput
This would add the file to the file cache. Expected to be called for the files that are uploaded to the remote store.

openInput
Opens a file for read.
Check if the file tracker file type is BLOCK open OnDemandBlockSearchIndexInput else open OnDemandNonBlockSearchIndexInput

close
This should delete respective files for this directory from Tracker and also initiate a complete clean up (in-memory, file handles, disk) of the FileCache.

@neetikasinghal
Copy link
Contributor

Prototype: main...neetikasinghal:OpenSearch:new-CD

@anasalkouz anasalkouz added the discuss Issues intended to help drive brainstorming and decision making label Jun 22, 2023
@andrross
Copy link
Member Author

This work is being tracked in #8446

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issues intended to help drive brainstorming and decision making distributed framework enhancement Enhancement or improvement to existing feature or request
Projects
None yet
Development

No branches or pull requests

3 participants