Track deletes only in the tombstone map instead of maintaining as copy #27868

s1monw · 2017-12-18T17:11:59Z

Today we maintain a copy of every delete in the live version maps. This is unnecessary
and might add quite some overhead if maps grow large. This change moves out the deletes
tracking into the tombstone map only and relies on the cleaning of tombstones when deletes
are collected.

Today we maintain a copy of every delete in the live version maps. This is unnecessary and might add quite some overhead if maps grow large. This change moves out the deletes tracking into the tombstone map only and relies on the cleaning of tombstones when deletes are collected.

Today we prevent that the same thread acquires the same lock more than once. This restriction is a relict form the early days of this concurrency construct and can be removed.

s1monw · 2017-12-20T10:30:31Z

@bleskes I opened #27920 for the last commit but it's needed here.

s1monw · 2018-01-23T12:33:59Z

@bleskes can you take a look a this if you have time?

bleskes

Production code LGTM (with suggestions that can be rejected if you don't like them). I like the approach. My only worry is that we now don't properly track the memory that will be used by deletes for refresh purposes. Say for example that someone disables refresh (or uses AUTO) and they run a delete by query. The tombstone will grow and we don't see a refresh. WDYT?

I left a question about a potential testing gap.

bleskes · 2018-01-31T08:37:25Z

server/src/main/java/org/elasticsearch/index/engine/LiveVersionMap.java

@@ -54,6 +55,7 @@
        // that will prevent concurrent updates to the same document ID and therefore we can rely on the happens-before guanratee of the
        // map reference itself.
        private boolean unsafe;
+        private final AtomicLong minDeleteTimestamp = new AtomicLong(Long.MAX_VALUE);


can you add a comment/java doc to what this mean? (minimum timestamp of delete operations that were made while this map was active. this is used to make sure they are kept in the tombstone)

bleskes · 2018-01-31T08:44:14Z

server/src/main/java/org/elasticsearch/index/engine/LiveVersionMap.java

@@ -97,15 +107,20 @@ void markAsUnsafe() {
        // have the volatile read of the Maps reference to make it visible even across threads.
        boolean needsSafeAccess;
        final boolean previousMapsNeededSafeAccess;
+        /** Tracks bytes used by current map, i.e. what is freed on refresh. For deletes, which are also added to tombstones, we only account
+         *  for the CHM entry here, and account for BytesRef/VersionValue against the tombstones, since refresh would not clear this RAM. */
+        final AtomicLong ramBytesUsed;


+1 to move it here.

Just an idea - maybe move it to VersionLookup and the VersionLookup class for the tombstones as well? then all accounting is in one place?

bleskes · 2018-01-31T08:54:59Z

server/src/main/java/org/elasticsearch/index/engine/LiveVersionMap.java

+        // Also enroll the delete into tombstones, and account for its RAM too:
+        final VersionValue prevTombstone = tombstones.put(uid, version);
+        // We initially account for BytesRef/VersionValue RAM for a delete against the tombstones, because this RAM will not be freed up
+        // on refresh. Later, in removeTombstoneUnderLock, if we clear the tombstone entry but the delete remains in current, we shift


this comment is wrong now.

bleskes · 2018-01-31T08:58:04Z

server/src/test/java/org/elasticsearch/index/engine/LiveVersionMapTests.java

            map.beforeRefresh();
-            assertEquals(new VersionValue(1, 1, 1), map.getUnderLock(uid("test")));
+            assertEquals(new VersionValue(1,1,1), map.getUnderLock(uid("test")));


space fanatic 👅 ;)

bleskes · 2018-01-31T09:00:12Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

        lastDeleteVersionPruneTimeMSec = timeMSec;
    }

    // testing
    void clearDeletedTombstones() {
-        versionMap.clearTombstones();
+        versionMap.pruneTombstones(-1, 0);


wait - this feels weird - current time is -1 and interval is 0? Shouldn't current time be Long.MAX_VALUE?

time means we clean everything, I think that is correct?

bleskes · 2018-01-31T09:01:29Z

server/src/test/java/org/elasticsearch/index/engine/LiveVersionMapTests.java

@@ -137,24 +142,25 @@ public void testConcurrently() throws IOException, InterruptedException {
                            }
                            if (isDelete == false && rarely()) {
                                versionValue = new DeleteVersionValue(versionValue.version + 1, versionValue.seqNo + 1,
-                                    versionValue.term, Long.MAX_VALUE);
+                                    versionValue.term, clock.incrementAndGet());


bleskes · 2018-01-31T09:02:28Z

server/src/test/java/org/elasticsearch/index/engine/LiveVersionMapTests.java

                            } else {
                                versionValue = new VersionValue(versionValue.version + 1, versionValue.seqNo + 1, versionValue.term);
                            }
                            values.put(bytesRef, versionValue);
                            map.putUnderLock(bytesRef, versionValue);
                        }
+                        if (rarely()) {
+                            map.pruneTombstones(0, 0);


can you comment we explicitly don't do any pruning? otherwise this looks odd.

not sure I understand

bleskes · 2018-01-31T09:07:23Z

server/src/test/java/org/elasticsearch/index/engine/LiveVersionMapTests.java

@@ -137,24 +142,25 @@ public void testConcurrently() throws IOException, InterruptedException {
                            }
                            if (isDelete == false && rarely()) {
                                versionValue = new DeleteVersionValue(versionValue.version + 1, versionValue.seqNo + 1,
-                                    versionValue.term, Long.MAX_VALUE);
+                                    versionValue.term, clock.incrementAndGet());


I can comment on two lines above, so comment here: map.removeTombstoneUnderLock(bytesRef); isn't meant for public usage, right? if we index on a delete, the delete should go away.

it's not meant for public use

bleskes · 2018-01-31T09:24:01Z

server/src/test/java/org/elasticsearch/index/engine/LiveVersionMapTests.java

        }
        do {
-            Map<BytesRef, VersionValue> valueMap = new HashMap<>(map.getAllCurrent());
+            final Map<BytesRef, VersionValue> valueMap = new HashMap<>(map.getAllCurrent());


I think this is now less strong as we don't track deletes here any more... we should somehow show that we capture deletes correctly in the various phases of refresh and that we don't forget them until we prune. Please me correct if I'm missing something. This is tricky ;)

s1monw · 2018-02-13T12:01:31Z

My only worry is that we now don't properly track the memory that will be used by deletes for refresh purposes. Say for example that someone disables refresh (or uses AUTO) and they run a delete by query. The tombstone will grow and we don't see a refresh. WDYT?

I must be missing something, a refresh doesn't clean tombstones they are cleaned over time with the GC deletes. I am not sure I understand why you would want to refresh based on them.

bleskes · 2018-02-13T12:21:47Z

must be missing something, a refresh doesn't clean tombstones they are cleaned over time with the GC deletes. I am not sure I understand why you would want to refresh based on them.

We prune deletes are each delete and on refresh. If you have a burst of deletes, we can't prune them when they're done and we have to wait for the refresh cycles to clean them up. We also currently don't refresh every 1s but wait for the memory signature of the shard to grow - as reported by InternalEngine#getIndexBufferRAMBytesUsed , including that version map size). That one now will ignore the delete signature. The more I look at this the more I think we should come up with some other mechanism to prune deletes. This is too implicit imo.

s1monw · 2018-02-13T13:29:23Z

We prune deletes are each delete and on refresh. If you have a burst of deletes, we can't prune them when they're done and we have to wait for the refresh cycles to clean them up. We also currently don't refresh every 1s but wait for the memory signature of the shard to grow - as reported by InternalEngine#getIndexBufferRAMBytesUsed , including that version map size). That one now will ignore the delete signature. The more I look at this the more I think we should come up with some other mechanism to prune deletes. This is too implicit imo.

we do call maybePruneDeletedTombstones on every delete I think that is enough? we can still call it on a regular basis but I am not sure that is needed.

bleskes · 2018-02-13T13:38:51Z

we do call maybePruneDeletedTombstones on every delete I think that is enough?

I tried to explain it with:

If you have a burst of deletes, we can't prune them when they're done and we have to wait for the refresh cycles to clean them up.

We can't clean them up because of gc deletes. We stick to them for a minute (by default).

s1monw · 2018-02-13T13:54:40Z

We can't clean them up because of gc deletes. We stick to them for a minute (by default).

but we only keep the ones 1/4 of the GC deletes interval. I am not sure if we are chasing corner cases here and if we don't get any changes (which would trigger no refresh) that eventually results in a flush will be enough.

bleskes · 2018-02-13T14:05:40Z

I think there's some confusion. We check for gc pruning every 1/4 of the gc interval:

        if (engineConfig.isEnableGcDeletes() && engineConfig.getThreadPool().relativeTimeInMillis() - lastDeleteVersionPruneTimeMSec > getGcDeletesInMillis() * 0.25) {
            pruneDeletedTombstones();
        }

but the actual pruning keeps delete for at least the gc deletes interval:

                if (timeMSec - versionValue.time > getGcDeletesInMillis()) {
                        versionMap.removeTombstoneUnderLock(uid);
                    }
                }

s1monw · 2018-02-13T14:29:47Z

I do understand what that means but what different does it make. If we can't collect we can't collect / prune. I don't see the issue compared to what we do today

bleskes · 2018-02-14T14:54:08Z

I don't see the issue compared to what we do today

I think this getting to be a bigger discussion that what I meant. There is a difference, but we can choose to live with it and/or fix in a follow up (hence my comment that the production code is LGTM). I'll reach out to discuss via another channel.

s1monw · 2018-02-16T13:50:58Z

@bleskes i pushed changes to address the pending issues

bleskes

LGTM. Thanks @s1monw

#27868) Today we maintain a copy of every delete in the live version maps. This is unnecessary and might add quite some overhead if maps grow large. This change moves out the deletes tracking into the tombstone map only and relies on the cleaning of tombstones when deletes are collected.

* master: Enable selecting adaptive selection stats Remove leftover mention of file-based scripts Fix threading issue on listener notification (elastic#28730) Revisit deletion policy after release the last snapshot (elastic#28627) Remove unused method Track deletes only in the tombstone map instead of maintaining as copy (elastic#27868) [Docs] Correct typo in README.textile (elastic#28716) Fix AdaptiveSelectionStats serialization bug (elastic#28718) TEST: Fix InternalEngine#testAcquireIndexCommit Add note on temporary directory for Windows service Added coming annotation and breaking changes link to release notes script Remove leftover PR link for previously disabled bwc tests Separate acquiring safe commit and last commit (elastic#28271) Fix BWC issue of the translog last modified age stats

s1monw added :Engine >enhancement v6.2.0 v7.0.0 labels Dec 18, 2017

s1monw requested a review from bleskes December 18, 2017 17:11

s1monw added 7 commits December 18, 2017 23:07

Merge branch 'master' into track_deletes_only_in_tombstones

cfecfa2

remove entry from old map as well

e05a673

add aditional test that exercises concurrent delete add and refresh

467ffe3

fix tests to ensure we hold the lock

1575b35

Merge branch 'master' into track_deletes_only_in_tombstones

f364b15

fix test

f172f0b

Make KeyedLock reentrant

4aa4adc

Today we prevent that the same thread acquires the same lock more than once. This restriction is a relict form the early days of this concurrency construct and can be removed.

s1monw added 3 commits December 20, 2017 14:44

Merge branch 'master' into track_deletes_only_in_tombstones

ce31d17

Merge branch 'master' into track_deletes_only_in_tombstones

e726f6d

Merge branch 'master' into track_deletes_only_in_tombstones

f05ad54

colings86 added v6.3.0 and removed v6.2.0 labels Jan 22, 2018

s1monw added 2 commits January 23, 2018 13:27

Merge branch 'master' into track_deletes_only_in_tombstones

2e2318a

fix after merge

3575ee6

Merge branch 'master' into track_deletes_only_in_tombstones

0142bc9

bleskes suggested changes Jan 31, 2018

View reviewed changes

Merge branch 'master' into track_deletes_only_in_tombstones

dbbe239

s1monw added 3 commits February 13, 2018 13:01

move some code around

6ebbb81

remove stale comment

f8ebf56

add comment

6dd12f3

s1monw added 2 commits February 16, 2018 14:49

address pending issues

c3bc0ec

Merge branch 'master' into track_deletes_only_in_tombstones

1a8c6c4

s1monw requested a review from bleskes February 16, 2018 13:51

fix clear tombstones

a390bb7

bleskes approved these changes Feb 16, 2018

View reviewed changes

s1monw merged commit 56edb5e into elastic:master Feb 19, 2018

s1monw added the backport pending label Feb 19, 2018

s1monw deleted the track_deletes_only_in_tombstones branch February 19, 2018 11:25

s1monw removed the backport pending label Feb 19, 2018

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track deletes only in the tombstone map instead of maintaining as copy #27868

Track deletes only in the tombstone map instead of maintaining as copy #27868

s1monw commented Dec 18, 2017

s1monw commented Dec 20, 2017

s1monw commented Jan 23, 2018

bleskes left a comment •

edited

Loading

bleskes Jan 31, 2018

bleskes Jan 31, 2018

bleskes Jan 31, 2018

bleskes Jan 31, 2018

bleskes Jan 31, 2018

bleskes Jan 31, 2018

s1monw Feb 13, 2018

bleskes Jan 31, 2018

bleskes Jan 31, 2018

s1monw Feb 13, 2018

bleskes Jan 31, 2018

s1monw Feb 16, 2018

bleskes Jan 31, 2018

s1monw commented Feb 13, 2018

bleskes commented Feb 13, 2018

s1monw commented Feb 13, 2018

bleskes commented Feb 13, 2018

s1monw commented Feb 13, 2018

bleskes commented Feb 13, 2018

s1monw commented Feb 13, 2018

bleskes commented Feb 14, 2018

s1monw commented Feb 16, 2018

bleskes left a comment

Track deletes only in the tombstone map instead of maintaining as copy #27868

Track deletes only in the tombstone map instead of maintaining as copy #27868

Conversation

s1monw commented Dec 18, 2017

s1monw commented Dec 20, 2017

s1monw commented Jan 23, 2018

bleskes left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

s1monw commented Feb 13, 2018

bleskes commented Feb 13, 2018

s1monw commented Feb 13, 2018

bleskes commented Feb 13, 2018

s1monw commented Feb 13, 2018

bleskes commented Feb 13, 2018

s1monw commented Feb 13, 2018

bleskes commented Feb 14, 2018

s1monw commented Feb 16, 2018

bleskes left a comment

Choose a reason for hiding this comment

bleskes left a comment •

edited

Loading