Skip to content

Commit

Permalink
Restore CacheFileRegion refcounting for writes (elastic#102843)
Browse files Browse the repository at this point in the history
In elastic#98241 we removed the refcounting around write handler 
in SharedBytes.IO. But recently we saw wrong bytes being read 
from the snapshot file under heavy evictions and investigation 
shows that the bytes belonged to another cached file.

Low level logging (hard to reproduce) shows that writes and 
reads using the same SharedBytes.IO instance but for different 
cache file region could be interleaved, so that bytes in shared 
cache could be overwritten and the last read would read (and 
store in internal index input buffers) bytes from a different file:

Thread[elasticsearch[node_t0][stateless_shard][T#4],5,TGRP-IndexCorruptionIT]: 10485760 bytes written using SharedBytes$IO@dc07632 (230716978) for FileCacheKey[shardId=[index-0][0], primaryTerm=1, fileName=stateless_commit_26]
Thread[elasticsearch[node_t0][stateless_shard][T#3],5,TGRP-IndexCorruptionIT]: 10485760 bytes written using  SharedBytes$IO@dc07632 (230716978) for FileCacheKey[shardId=[index-0][0], primaryTerm=1, fileName=stateless_commit_16]
Thread[elasticsearch[node_t0][stateless_shard][T#4],5,TGRP-IndexCorruptionIT]: 375 bytes read using SharedBytes$IO@dc07632 (230716978) for key FileCacheKey[shardId=[index-0][0], primaryTerm=1, fileName=stateless_commit_26]

This change fixes resfcounting around the write handler so 
that the IO instance is decref after bytes are fully written.

Relates elastic#98241
  • Loading branch information
tlrx committed Dec 8, 2023
1 parent 479fcef commit 619fd64
Show file tree
Hide file tree
Showing 2 changed files with 24 additions and 13 deletions.
5 changes: 5 additions & 0 deletions docs/changelog/102843.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
pr: 102843
summary: Restore `SharedBytes.IO` refcounting on reads & writes
area: Snapshot/Restore
type: bug
issues: []
Original file line number Diff line number Diff line change
Expand Up @@ -942,25 +942,31 @@ void populateAndRead(
}

private void fillGaps(Executor executor, RangeMissingHandler writer, List<SparseFileTracker.Gap> gaps) {
final var cacheFileRegion = CacheFileRegion.this;
for (SparseFileTracker.Gap gap : gaps) {
executor.execute(new AbstractRunnable() {

@Override
protected void doRun() throws Exception {
assert CacheFileRegion.this.hasReferences();
ensureOpen();
final int start = Math.toIntExact(gap.start());
var ioRef = io;
assert regionOwners.get(ioRef) == CacheFileRegion.this;
writer.fillCacheRange(
ioRef,
start,
start,
Math.toIntExact(gap.end() - start),
progress -> gap.onProgress(start + progress)
);
writeCount.increment();

if (cacheFileRegion.tryIncRef() == false) {
throw new AlreadyClosedException("File chunk [" + cacheFileRegion.regionKey + "] has been released");
}
try {
final int start = Math.toIntExact(gap.start());
var ioRef = io;
assert regionOwners.get(ioRef) == cacheFileRegion;
writer.fillCacheRange(
ioRef,
start,
start,
Math.toIntExact(gap.end() - start),
progress -> gap.onProgress(start + progress)
);
writeCount.increment();
} finally {
cacheFileRegion.decRef();
}
gap.onCompletion();
}

Expand Down

0 comments on commit 619fd64

Please sign in to comment.