Skip to content

Commit

Permalink
Steps toward deprecating implicit prefix seek, related fixes (#13026)
Browse files Browse the repository at this point in the history
Summary:
With some new use cases onboarding to prefix extractors/seek/filters, one of the risks is existing iterator code, e.g. for maintenance tasks, being unintentionally subject to prefix seek semantics. This is a longstanding known design flaw with prefix seek, and `prefix_same_as_start` and `auto_prefix_mode` were steps in the direction of making that obsolete. However, we can't just immediately set `total_order_seek` to true by default, because that would impact so much code instantly.

Here we add a new DB option, `prefix_seek_opt_in_only` that basically allows users to transition to the future behavior when they are ready. When set to true, all iterators will be treated as if `total_order_seek=true` and then the only ways to get prefix seek semantics are with `prefix_same_as_start` or `auto_prefix_mode`.

Related fixes / changes:
* Make sure that `prefix_same_as_start` and `auto_prefix_mode` are compatible with (or override) `total_order_seek` (depending on your interpretation).
* Fix a bug in which a new iterator after dynamically changing the prefix extractor might mix different prefix semantics between memtable and SSTs. Both should use the latest extractor semantics, which means iterators ignoring memtable prefix filters with an old extractor. And that means passing the latest prefix extractor to new memtable iterators that might use prefix seek. (Without the fix, the test added for this fails in many ways.)

Suggested follow-up:
* Investigate a FIXME where a MergeIteratorBuilder is created in db_impl.cc. No unit test detects a change in value that should impact correctness.
* Make memtable prefix bloom compatible with `auto_prefix_mode`, which might require involving the memtablereps because we don't know at iterator creation time (only seek time) whether an auto_prefix_mode seek will be a prefix seek.
* Add `prefix_same_as_start` testing to db_stress

Pull Request resolved: #13026

Test Plan:
tests updated, added. Add combination of `total_order_seek=true` and `auto_prefix_mode=true` to stress test. Ran `make blackbox_crash_test` for a long while.

Manually ran tests with `prefix_seek_opt_in_only=true` as default, looking for unexpected issues. I inspected most of the results and migrated many tests to be ready for such a change (but not all).

Reviewed By: ltamasi

Differential Revision: D63147378

Pulled By: pdillinger

fbshipit-source-id: 1f4477b730683d43b4be7e933338583702d3c25e
  • Loading branch information
pdillinger authored and facebook-github-bot committed Sep 20, 2024
1 parent 5f4a8c3 commit a1a102f
Show file tree
Hide file tree
Showing 26 changed files with 325 additions and 128 deletions.
21 changes: 12 additions & 9 deletions db/arena_wrapped_db_iter.cc
Original file line number Diff line number Diff line change
Expand Up @@ -45,20 +45,23 @@ void ArenaWrappedDBIter::Init(
const SequenceNumber& sequence, uint64_t max_sequential_skip_in_iteration,
uint64_t version_number, ReadCallback* read_callback,
ColumnFamilyHandleImpl* cfh, bool expose_blob_index, bool allow_refresh) {
auto mem = arena_.AllocateAligned(sizeof(DBIter));
db_iter_ = new (mem) DBIter(
env, read_options, ioptions, mutable_cf_options, ioptions.user_comparator,
/* iter */ nullptr, version, sequence, true,
max_sequential_skip_in_iteration, read_callback, cfh, expose_blob_index);
sv_number_ = version_number;
read_options_ = read_options;
allow_refresh_ = allow_refresh;
memtable_range_tombstone_iter_ = nullptr;

if (!CheckFSFeatureSupport(env->GetFileSystem().get(),
FSSupportedOps::kAsyncIO)) {
read_options_.async_io = false;
}
read_options_.total_order_seek |= ioptions.prefix_seek_opt_in_only;

auto mem = arena_.AllocateAligned(sizeof(DBIter));
db_iter_ = new (mem) DBIter(env, read_options_, ioptions, mutable_cf_options,
ioptions.user_comparator,
/* iter */ nullptr, version, sequence, true,
max_sequential_skip_in_iteration, read_callback,
cfh, expose_blob_index);

sv_number_ = version_number;
allow_refresh_ = allow_refresh;
memtable_range_tombstone_iter_ = nullptr;
}

Status ArenaWrappedDBIter::Refresh() { return Refresh(nullptr); }
Expand Down
4 changes: 3 additions & 1 deletion db/column_family.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1201,8 +1201,10 @@ Status ColumnFamilyData::RangesOverlapWithMemtables(
read_opts.total_order_seek = true;
MergeIteratorBuilder merge_iter_builder(&internal_comparator_, &arena);
merge_iter_builder.AddIterator(super_version->mem->NewIterator(
read_opts, /*seqno_to_time_mapping=*/nullptr, &arena));
read_opts, /*seqno_to_time_mapping=*/nullptr, &arena,
/*prefix_extractor=*/nullptr));
super_version->imm->AddIterators(read_opts, /*seqno_to_time_mapping=*/nullptr,
/*prefix_extractor=*/nullptr,
&merge_iter_builder,
false /* add_range_tombstone_iter */);
ScopedArenaPtr<InternalIterator> memtable_iter(merge_iter_builder.Finish());
Expand Down
202 changes: 157 additions & 45 deletions db/db_bloom_filter_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,7 @@ TEST_P(DBBloomFilterTestDefFormatVersion, KeyMayExist) {
options.statistics = ROCKSDB_NAMESPACE::CreateDBStatistics();
CreateAndReopenWithCF({"pikachu"}, options);

ASSERT_TRUE(!db_->KeyMayExist(ropts, handles_[1], "a", &value));
ASSERT_FALSE(db_->KeyMayExist(ropts, handles_[1], "a", &value));

ASSERT_OK(Put(1, "a", "b"));
bool value_found = false;
Expand All @@ -187,7 +187,7 @@ TEST_P(DBBloomFilterTestDefFormatVersion, KeyMayExist) {
uint64_t cache_added = TestGetTickerCount(options, BLOCK_CACHE_ADD);
ASSERT_TRUE(
db_->KeyMayExist(ropts, handles_[1], "a", &value, &value_found));
ASSERT_TRUE(!value_found);
ASSERT_FALSE(value_found);
// assert that no new files were opened and no new blocks were
// read into block cache.
ASSERT_EQ(numopen, TestGetTickerCount(options, NO_FILE_OPENS));
Expand All @@ -197,7 +197,7 @@ TEST_P(DBBloomFilterTestDefFormatVersion, KeyMayExist) {

numopen = TestGetTickerCount(options, NO_FILE_OPENS);
cache_added = TestGetTickerCount(options, BLOCK_CACHE_ADD);
ASSERT_TRUE(!db_->KeyMayExist(ropts, handles_[1], "a", &value));
ASSERT_FALSE(db_->KeyMayExist(ropts, handles_[1], "a", &value));
ASSERT_EQ(numopen, TestGetTickerCount(options, NO_FILE_OPENS));
ASSERT_EQ(cache_added, TestGetTickerCount(options, BLOCK_CACHE_ADD));

Expand All @@ -207,15 +207,15 @@ TEST_P(DBBloomFilterTestDefFormatVersion, KeyMayExist) {

numopen = TestGetTickerCount(options, NO_FILE_OPENS);
cache_added = TestGetTickerCount(options, BLOCK_CACHE_ADD);
ASSERT_TRUE(!db_->KeyMayExist(ropts, handles_[1], "a", &value));
ASSERT_FALSE(db_->KeyMayExist(ropts, handles_[1], "a", &value));
ASSERT_EQ(numopen, TestGetTickerCount(options, NO_FILE_OPENS));
ASSERT_EQ(cache_added, TestGetTickerCount(options, BLOCK_CACHE_ADD));

ASSERT_OK(Delete(1, "c"));

numopen = TestGetTickerCount(options, NO_FILE_OPENS);
cache_added = TestGetTickerCount(options, BLOCK_CACHE_ADD);
ASSERT_TRUE(!db_->KeyMayExist(ropts, handles_[1], "c", &value));
ASSERT_FALSE(db_->KeyMayExist(ropts, handles_[1], "c", &value));
ASSERT_EQ(numopen, TestGetTickerCount(options, NO_FILE_OPENS));
ASSERT_EQ(cache_added, TestGetTickerCount(options, BLOCK_CACHE_ADD));

Expand Down Expand Up @@ -2177,24 +2177,146 @@ TEST_F(DBBloomFilterTest, MemtableWholeKeyBloomFilterMultiGet) {

db_->ReleaseSnapshot(snapshot);
}
namespace {
std::pair<uint64_t, uint64_t> GetBloomStat(const Options& options, bool sst) {
if (sst) {
return {options.statistics->getAndResetTickerCount(
NON_LAST_LEVEL_SEEK_FILTER_MATCH),
options.statistics->getAndResetTickerCount(
NON_LAST_LEVEL_SEEK_FILTERED)};
} else {
auto hit = std::exchange(get_perf_context()->bloom_memtable_hit_count, 0);
auto miss = std::exchange(get_perf_context()->bloom_memtable_miss_count, 0);
return {hit, miss};
}
}

std::pair<uint64_t, uint64_t> HitAndMiss(uint64_t hits, uint64_t misses) {
return {hits, misses};
}
} // namespace

TEST_F(DBBloomFilterTest, MemtablePrefixBloomOutOfDomain) {
constexpr size_t kPrefixSize = 8;
const std::string kKey = "key";
assert(kKey.size() < kPrefixSize);
TEST_F(DBBloomFilterTest, MemtablePrefixBloom) {
Options options = CurrentOptions();
options.prefix_extractor.reset(NewFixedPrefixTransform(kPrefixSize));
options.prefix_extractor.reset(NewFixedPrefixTransform(4));
options.memtable_prefix_bloom_size_ratio = 0.25;
Reopen(options);
ASSERT_OK(Put(kKey, "v"));
ASSERT_EQ("v", Get(kKey));
std::unique_ptr<Iterator> iter(dbfull()->NewIterator(ReadOptions()));
iter->Seek(kKey);
ASSERT_FALSE(options.prefix_extractor->InDomain("key"));
ASSERT_OK(Put("key", "v"));
ASSERT_OK(Put("goat1", "g1"));
ASSERT_OK(Put("goat2", "g2"));

// Reset from other tests
GetBloomStat(options, false);

// Out of domain (Get)
ASSERT_EQ("v", Get("key"));
ASSERT_EQ(HitAndMiss(0, 0), GetBloomStat(options, false));

// In domain (Get)
ASSERT_EQ("g1", Get("goat1"));
ASSERT_EQ(HitAndMiss(1, 0), GetBloomStat(options, false));
ASSERT_EQ("NOT_FOUND", Get("goat9"));
ASSERT_EQ(HitAndMiss(1, 0), GetBloomStat(options, false));
ASSERT_EQ("NOT_FOUND", Get("goan1"));
ASSERT_EQ(HitAndMiss(0, 1), GetBloomStat(options, false));

ReadOptions ropts;
if (options.prefix_seek_opt_in_only) {
ropts.prefix_same_as_start = true;
}
std::unique_ptr<Iterator> iter(db_->NewIterator(ropts));
// Out of domain (scan)
iter->Seek("ke");
ASSERT_OK(iter->status());
ASSERT_TRUE(iter->Valid());
ASSERT_EQ(kKey, iter->key());
iter->SeekForPrev(kKey);
ASSERT_EQ("key", iter->key());
iter->SeekForPrev("kez");
ASSERT_OK(iter->status());
ASSERT_TRUE(iter->Valid());
ASSERT_EQ(kKey, iter->key());
ASSERT_EQ("key", iter->key());
ASSERT_EQ(HitAndMiss(0, 0), GetBloomStat(options, false));

// In domain (scan)
iter->Seek("goan");
ASSERT_OK(iter->status());
ASSERT_FALSE(iter->Valid());
ASSERT_EQ(HitAndMiss(0, 1), GetBloomStat(options, false));
iter->Seek("goat");
ASSERT_OK(iter->status());
ASSERT_TRUE(iter->Valid());
ASSERT_EQ("goat1", iter->key());
ASSERT_EQ(HitAndMiss(1, 0), GetBloomStat(options, false));

// Changing prefix extractor should affect prefix query semantics
// and bypass the existing memtable Bloom filter
ASSERT_OK(db_->SetOptions({{"prefix_extractor", "fixed:5"}}));
iter.reset(db_->NewIterator(ropts));
// Now out of domain (scan)
iter->Seek("goan");
ASSERT_OK(iter->status());
ASSERT_TRUE(iter->Valid());
ASSERT_EQ("goat1", iter->key());
ASSERT_EQ(HitAndMiss(0, 0), GetBloomStat(options, false));
// In domain (scan)
iter->Seek("goat2");
ASSERT_OK(iter->status());
ASSERT_TRUE(iter->Valid());
ASSERT_EQ("goat2", iter->key());
ASSERT_EQ(HitAndMiss(0, 0), GetBloomStat(options, false));
// In domain (scan)
if (ropts.prefix_same_as_start) {
iter->Seek("goat0");
ASSERT_OK(iter->status());
ASSERT_FALSE(iter->Valid());
ASSERT_EQ(HitAndMiss(0, 0), GetBloomStat(options, false));
} else {
// NOTE: legacy prefix Seek may return keys outside of prefix
}

// Start a fresh new memtable, using new prefix extractor
ASSERT_OK(SingleDelete("key"));
ASSERT_OK(SingleDelete("goat1"));
ASSERT_OK(SingleDelete("goat2"));
ASSERT_OK(Flush());

ASSERT_OK(Put("key", "_v"));
ASSERT_OK(Put("goat1", "_g1"));
ASSERT_OK(Put("goat2", "_g2"));

iter.reset(db_->NewIterator(ropts));

// Still out of domain (Get)
ASSERT_EQ("_v", Get("key"));
ASSERT_EQ(HitAndMiss(0, 0), GetBloomStat(options, false));

// Still in domain (Get)
ASSERT_EQ("_g1", Get("goat1"));
ASSERT_EQ(HitAndMiss(1, 0), GetBloomStat(options, false));
ASSERT_EQ("NOT_FOUND", Get("goat11"));
ASSERT_EQ(HitAndMiss(1, 0), GetBloomStat(options, false));
ASSERT_EQ("NOT_FOUND", Get("goat9"));
ASSERT_EQ(HitAndMiss(0, 1), GetBloomStat(options, false));
ASSERT_EQ("NOT_FOUND", Get("goan1"));
ASSERT_EQ(HitAndMiss(0, 1), GetBloomStat(options, false));

// Now out of domain (scan)
iter->Seek("goan");
ASSERT_OK(iter->status());
ASSERT_TRUE(iter->Valid());
ASSERT_EQ("goat1", iter->key());
ASSERT_EQ(HitAndMiss(0, 0), GetBloomStat(options, false));
// In domain (scan)
iter->Seek("goat2");
ASSERT_OK(iter->status());
ASSERT_TRUE(iter->Valid());
ASSERT_EQ("goat2", iter->key());
ASSERT_EQ(HitAndMiss(1, 0), GetBloomStat(options, false));
// In domain (scan)
iter->Seek("goat0");
ASSERT_OK(iter->status());
ASSERT_FALSE(iter->Valid());
ASSERT_EQ(HitAndMiss(0, 1), GetBloomStat(options, false));
}

class DBBloomFilterTestVaryPrefixAndFormatVer
Expand Down Expand Up @@ -2507,7 +2629,11 @@ TEST_P(BloomStatsTestWithParam, BloomStatsTestWithIter) {
ASSERT_OK(Put(key1, value1, WriteOptions()));
ASSERT_OK(Put(key3, value3, WriteOptions()));

std::unique_ptr<Iterator> iter(dbfull()->NewIterator(ReadOptions()));
ReadOptions ropts;
if (options_.prefix_seek_opt_in_only) {
ropts.prefix_same_as_start = true;
}
std::unique_ptr<Iterator> iter(dbfull()->NewIterator(ropts));

// check memtable bloom stats
iter->Seek(key1);
Expand All @@ -2526,13 +2652,13 @@ TEST_P(BloomStatsTestWithParam, BloomStatsTestWithIter) {

iter->Seek(key2);
ASSERT_OK(iter->status());
ASSERT_TRUE(!iter->Valid());
ASSERT_FALSE(iter->Valid());
ASSERT_EQ(1, get_perf_context()->bloom_memtable_miss_count);
ASSERT_EQ(2, get_perf_context()->bloom_memtable_hit_count);

ASSERT_OK(Flush());

iter.reset(dbfull()->NewIterator(ReadOptions()));
iter.reset(dbfull()->NewIterator(ropts));

// Check SST bloom stats
iter->Seek(key1);
Expand All @@ -2550,7 +2676,7 @@ TEST_P(BloomStatsTestWithParam, BloomStatsTestWithIter) {

iter->Seek(key2);
ASSERT_OK(iter->status());
ASSERT_TRUE(!iter->Valid());
ASSERT_FALSE(iter->Valid());
ASSERT_EQ(1, get_perf_context()->bloom_sst_miss_count);
ASSERT_EQ(expected_hits, get_perf_context()->bloom_sst_hit_count);
}
Expand Down Expand Up @@ -2659,9 +2785,14 @@ TEST_F(DBBloomFilterTest, PrefixScan) {
PrefixScanInit(this);
count = 0;
env_->random_read_counter_.Reset();
iter = db_->NewIterator(ReadOptions());
ReadOptions ropts;
if (options.prefix_seek_opt_in_only) {
ropts.prefix_same_as_start = true;
}
iter = db_->NewIterator(ropts);
for (iter->Seek(prefix); iter->Valid(); iter->Next()) {
if (!iter->key().starts_with(prefix)) {
ASSERT_FALSE(ropts.prefix_same_as_start);
break;
}
count++;
Expand Down Expand Up @@ -3397,23 +3528,6 @@ class FixedSuffix4Transform : public SliceTransform {

bool InDomain(const Slice& src) const override { return src.size() >= 4; }
};

std::pair<uint64_t, uint64_t> GetBloomStat(const Options& options, bool sst) {
if (sst) {
return {options.statistics->getAndResetTickerCount(
NON_LAST_LEVEL_SEEK_FILTER_MATCH),
options.statistics->getAndResetTickerCount(
NON_LAST_LEVEL_SEEK_FILTERED)};
} else {
auto hit = std::exchange(get_perf_context()->bloom_memtable_hit_count, 0);
auto miss = std::exchange(get_perf_context()->bloom_memtable_miss_count, 0);
return {hit, miss};
}
}

std::pair<uint64_t, uint64_t> HitAndMiss(uint64_t hits, uint64_t misses) {
return {hits, misses};
}
} // anonymous namespace

// This uses a prefix_extractor + comparator combination that violates
Expand Down Expand Up @@ -3520,9 +3634,8 @@ TEST_F(DBBloomFilterTest, WeirdPrefixExtractorWithFilter2) {
if (flushed) { // TODO: support auto_prefix_mode in memtable?
read_options.auto_prefix_mode = true;
} else {
// TODO: why needed?
get_perf_context()->bloom_memtable_hit_count = 0;
get_perf_context()->bloom_memtable_miss_count = 0;
// Reset from other tests
GetBloomStat(options, flushed);
}
EXPECT_EQ(GetBloomStat(options, flushed), HitAndMiss(0, 0));
{
Expand Down Expand Up @@ -3664,9 +3777,8 @@ TEST_F(DBBloomFilterTest, WeirdPrefixExtractorWithFilter3) {
if (flushed) { // TODO: support auto_prefix_mode in memtable?
read_options.auto_prefix_mode = true;
} else {
// TODO: why needed?
get_perf_context()->bloom_memtable_hit_count = 0;
get_perf_context()->bloom_memtable_miss_count = 0;
// Reset from other tests
GetBloomStat(options, flushed);
}
EXPECT_EQ(GetBloomStat(options, flushed), HitAndMiss(0, 0));
{
Expand Down
Loading

0 comments on commit a1a102f

Please sign in to comment.