mvcc: txns and r/w views #7105

heyitsanthony · 2017-01-05T18:42:56Z

Clean-up of the mvcc interfaces to use txn interfaces instead of an id.

Adds support for concurrent read-only mvcc transactions.

Fixes #7083

xiang90 · 2017-01-05T19:10:14Z

@heyitsanthony Even with the change, a write will still block all reads?

heyitsanthony · 2017-01-05T19:16:54Z

@xiang90 yes, still WIP

xiang90 · 2017-01-05T19:18:27Z

@heyitsanthony ACK

heyitsanthony · 2017-03-01T22:17:15Z

seemingly stable PTAL /cc @xiang90

gyuho · 2017-03-01T22:30:07Z

mvcc/backend/backend_test.go

+		k, v := rtx.UnsafeRange([]byte("test"), tt.key, tt.end, tt.limit)
+		rtx.Unlock()
+		if !reflect.DeepEqual(tt.wkey, k) || !reflect.DeepEqual(tt.wval, v) {
+			t.Errorf("#%d: want k=%+v, v=%d; got k=%+v, v=%+v", i, tt.wkey, tt.wval, k, v)


%+v for tt.wval to be consistent?

xiang90 · 2017-03-02T17:06:44Z

mvcc/kv.go

+}
+
+// txnWriteReadOnly promotes a read txn to a write, panicking if any reads are called.
+type txnReadWrite struct{ TxnRead }


the comment does not match the struct naming, and its implementation.

yeah, it should be coerce, not promote

xiang90 · 2017-03-02T17:26:48Z

mvcc/backend/read_tx.go

+	rt.txmu.Unlock()
+
+	// FIXME: may have dup keys?
+	return append(k2, keys...), append(v2, vals...)


if we only enable write buffer for keys bucket, dedup should be easy. we never overwrite in keys bucket, but simply append. i need to check if it is possible to have dups in buffer and current read.tx though.

if we enable write buffer for buckets when keys overwriting might occur, we have to merge the result.

I think this can be safe without needing to hardcode the "keys" bucket so long as it can be guaranteed that range fetches of overwrites only happen in writetxs. I believe this is the case for the current codebase, but I will have to do an audit.

xiang90 · 2017-03-02T17:30:33Z

mvcc/backend/read_tx.go

+	return append(k2, keys...), append(v2, vals...)
+}
+
+func (rt *readTx) UnsafeForEach(bucketName []byte, visitor func(k, v []byte) error) error {


do we have ordering assumption perviously? now we visit write buffer first, which might break ordering.

everything using foreach (alarms, hash) make no ordering assumptions, but it may change the hash values if foreach expects the keys to go from minimum to maximum. The dup map can potentially get huge if it's done base txn first.

xiang90 · 2017-03-02T17:36:39Z

mvcc/backend/tx_buffer.go

+	return nil
+}
+
+func (txr *txReadBuffer) writeback(txw *txWriteBuffer) {


this is a little bit confusing. i would expect we write-back from txw to txr.

func (txw *txWriteBuffer) writeback(txr *txReadBuffer) { ... }

the thinking here was txReadBuffer is the receiver so it receives the data from the write buffer, but OK...

xiang90 · 2017-03-02T17:56:58Z

mvcc/backend/batch_tx.go

+
+func (t *batchTxBuffered) Unlock() {
+	if t.pending != 0 {
+		t.backend.readTx.mu.Lock()


A read tx might block a batch Tx from finishing here?

nvm. i guess it is fine.

xiang90 · 2017-03-02T18:17:52Z

@heyitsanthony Looks reasonable overall. Have you tried to re-run the read bench to make sure this does solve the contention?

heyitsanthony · 2017-03-02T18:30:02Z

Running on my desktop:

master:

$ ./tools/benchmark/benchmark --conns=128 --clients=128 range --total=100000 --consistency=s  a
bench with serializable range
 100000 / 100000 Booooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%3s

Summary:
  Total:        3.1045 secs.
  Slowest:      0.0533 secs.
  Fastest:      0.0001 secs.
  Average:      0.0039 secs.
  Stddev:       0.0048 secs.
  Requests/sec: 32210.9553

Response time histogram:
  0.0001 [1]
  0.0054 [73169]
  0.0108 [17721]
  0.0161 [6135]
  0.0214 [2078] 
  0.0267 [631]
  0.0320 [182]
  0.0374 [58]
  0.0427 [17]
  0.0480 [6]
  0.0533 [2]

Latency distribution:
  10% in 0.0002 secs.
  25% in 0.0002 secs.
  50% in 0.0019 secs.
  75% in 0.0058 secs.
  90% in 0.0103 secs.
  95% in 0.0136 secs.
  99% in 0.0209 secs.
  99.9% in 0.0313 secs.

readtx patches:

./tools/benchmark/benchmark --conns=128 --clients=128 range --total=100000 --consistency=s  a
bench with serializable range
 100000 / 100000 Booooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%2s

Summary:
  Total:        2.3548 secs.
  Slowest:      0.0298 secs.
  Fastest:      0.0001 secs.
  Average:      0.0028 secs.
  Stddev:       0.0019 secs.
  Requests/sec: 42466.7292

Response time histogram:
  0.0001 [1]
  0.0031 [65868]
  0.0061 [27617]
  0.0090 [5071]
  0.0120 [1064]
  0.0150 [295]
  0.0179 [54]
  0.0209 [16]
  0.0239 [6]
  0.0268 [4]
  0.0298 [4]

Latency distribution:
  10% in 0.0009 secs.
  25% in 0.0015 secs.
  50% in 0.0024 secs.
  75% in 0.0036 secs.
  90% in 0.0051 secs.
  95% in 0.0067 secs.
  99% in 0.0096 secs.
  99.9% in 0.0145 secs.

gyuho · 2017-03-06T18:45:32Z

mvcc/metrics_txn.go

+		return
+	}
+	switch {
+	case tw.ranges == 1:


do these uint fields get initialized after counter.Inc? Can't find where in this commit somehow.

They initialize to 0 in newMetricsTxn* and are incremented if there's a range/delete/put. If there's only one operation, the metric is treated like a singleton range/delete/put. If there's more than one operation, it only increments the txn counter. The Txn should be thrown away after calling End so there's no need to reset the values.

Ok makes sense now. Thanks!

gyuho · 2017-03-06T19:16:18Z

mvcc/backend/tx_buffer.go

+			widx++
+		}
+		bb.buf[widx] = bb.buf[ridx]
+


unnecessary blank line?

gyuho · 2017-03-06T19:59:09Z

mvcc/backend/tx_buffer.go

+	if bb.used == bbsrc.used {
+		return
+	}
+	if bytes.Compare(bb.buf[bb.used-bbsrc.used].key, bbsrc.buf[0].key) > 0 {


When does this happen?

Let's say

bb buf ['x','y','z'], used 3 bbsrc buf ['a','b'], used 2 bb.merge.bbsrc and then now bb.buf ['x','y','z','a','b'], used 5

Now

bb.buf[bb.used-bbsrc.used]==bb.buf[5-2]==bb.buf[3]=='a' bb.buf[3]=='a'==bbsrc.buf[0]

Seems like bytes.Compare(bb.buf[bb.used-bbsrc.used].key, bbsrc.buf[0].key) is always 0, no?

Yea, this is off by 1 and the comparison should be <. The index should be (bb.used-bbsrc.used)-1.

gyuho · 2017-03-07T00:46:32Z

@xiang90 @heyitsanthony Benchmark with 3 nodes

ETCDCTL_API=3 ${GOPATH}/src/${GIT_PATH}/bin/etcdctl \
    --endpoints 10.240.0.7:2379,10.240.0.8:2379,10.240.0.12:2379 \
    put a b

${GOPATH}/bin/benchmark \
    --endpoints=10.240.0.7:2379,10.240.0.8:2379,10.240.0.12:2379 \
    --conns=128 --clients=128 \
    range --total=100000 --consistency=s a


bench with serializable range
 100000 / 100000 Boooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%1s

Summary:
  Total:	1.2248 secs.
  Slowest:	0.0146 secs.
  Fastest:	0.0002 secs.
  Average:	0.0014 secs.
  Stddev:	0.0012 secs.
  Requests/sec:	81642.6844

Response time histogram:
  0.0002 [1]	|
  0.0017 [69640]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0031 [20870]	|∎∎∎∎∎∎∎∎∎∎∎
  0.0045 [6591]	|∎∎∎
  0.0060 [1987]	|∎
  0.0074 [580]	|
  0.0088 [159]	|
  0.0103 [107]	|
  0.0117 [47]	|
  0.0132 [15]	|
  0.0146 [3]	|

Latency distribution:
  10% in 0.0004 secs.
  25% in 0.0005 secs.
  50% in 0.0011 secs.
  75% in 0.0019 secs.
  90% in 0.0030 secs.
  95% in 0.0039 secs.
  99% in 0.0059 secs.
  99.9% in 0.0098 secs.

With patch

ETCDCTL_API=3 ${GOPATH}/src/${GIT_PATH}/bin/etcdctl \
    --endpoints 10.240.0.21:2379,10.240.0.22:2379,10.240.0.23:2379 \
    put a b

${GOPATH}/bin/benchmark \
    --endpoints=10.240.0.21:2379,10.240.0.22:2379,10.240.0.23:2379 \
    --conns=128 --clients=128 \
    range --total=100000 --consistency=s a

bench with serializable range
 100000 / 100000 Boooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%0s

Summary:
  Total:	0.9367 secs.
  Slowest:	0.0352 secs.
  Fastest:	0.0002 secs.
  Average:	0.0009 secs.
  Stddev:	0.0012 secs.
  Requests/sec:	106754.2142

Response time histogram:
  0.0002 [1]	|
  0.0037 [98528]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0072 [955]	|
  0.0107 [240]	|
  0.0142 [102]	|
  0.0177 [87]	|
  0.0212 [34]	|
  0.0247 [11]	|
  0.0282 [37]	|
  0.0317 [2]	|
  0.0352 [3]	|

Latency distribution:
  10% in 0.0003 secs.
  25% in 0.0004 secs.
  50% in 0.0006 secs.
  75% in 0.0009 secs.
  90% in 0.0015 secs.
  95% in 0.0023 secs.
  99% in 0.0045 secs.
  99.9% in 0.0171 secs.

gyuho · 2017-03-07T00:54:35Z

If we run the same tests with more clients (same configuration as in golang/go#18020), we get the best performance ever!

${GOPATH}/bin/benchmark     --endpoints=10.240.0.21:2379,10.240.0.22:2379,10.240.0.23:2379     --conns=100 --clients=1000     range --total=200000 --consistency=s foo
bench with serializable range
 200000 / 200000 Boooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%1s

Summary:
  Total:	1.4225 secs.
  Slowest:	0.0330 secs.
  Fastest:	0.0002 secs.
  Average:	0.0025 secs.
  Stddev:	0.0025 secs.
  Requests/sec:	140593.6432

Response time histogram:
  0.0002 [1]	|
  0.0035 [157259]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0068 [32554]	|∎∎∎∎∎∎∎∎
  0.0101 [7233]	|∎
  0.0133 [1537]	|
  0.0166 [385]	|
  0.0199 [122]	|
  0.0232 [851]	|
  0.0265 [56]	|
  0.0297 [1]	|
  0.0330 [1]	|

Latency distribution:
  10% in 0.0006 secs.
  25% in 0.0009 secs.
  50% in 0.0016 secs.
  75% in 0.0030 secs.
  90% in 0.0055 secs.
  95% in 0.0068 secs.
  99% in 0.0117 secs.
  99.9% in 0.0222 secs.

ReadTxs are designed for read-only accesses to the backend using a read-only boltDB transaction. Since BatchTx's are long-running transactions, all writes to BatchTx will writeback to ReadTx, overlaying the base read-only transaction.

Clean-up of the mvcc interfaces to use txn interfaces instead of an id. Adds support for concurrent read-only mvcc transactions. Fixes etcd-io#7083

codecov-io · 2017-03-09T08:24:48Z

Codecov Report

Merging #7105 into master will decrease coverage by -0.1%.
The diff coverage is 85.93%.

@@            Coverage Diff            @@
##           master    #7105     +/-   ##
=========================================
- Coverage   70.47%   70.38%   -0.1%     
=========================================
  Files         237      244      +7     
  Lines       21238    21342    +104     
=========================================
+ Hits        14968    15021     +53     
- Misses       5136     5186     +50     
- Partials     1134     1135      +1

Impacted Files	Coverage Δ
mvcc/kv.go	`0% <0%> (ø)`
etcdserver/server.go	`78.8% <0%> (-0.47%)`	❌
mvcc/watchable_store_txn.go	`100% <100%> (ø)`
mvcc/metrics_txn.go	`100% <100%> (ø)`
mvcc/kv_view.go	`100% <100%> (ø)`
lease/lessor.go	`86.75% <100%> (+1.45%)`	✅
mvcc/backend/read_tx.go	`45.16% <45.16%> (ø)`
etcdserver/v3_server.go	`62.5% <50%> (+0.43%)`	✅
etcdserver/apply_auth.go	`61.11% <66.66%> (ø)`	✅
mvcc/backend/backend.go	`73.75% <84.61%> (-0.58%)`	❌
... and 27 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 39dc531...d1dcc82. Read the comment docs.

heyitsanthony added the WIP label Jan 5, 2017

heyitsanthony force-pushed the mvcc-txn branch 2 times, most recently from ce1b298 to 475c9fe Compare January 7, 2017 01:30

heyitsanthony force-pushed the mvcc-txn branch from 475c9fe to bed1b26 Compare March 1, 2017 06:47

heyitsanthony removed the WIP label Mar 1, 2017

heyitsanthony force-pushed the mvcc-txn branch 3 times, most recently from d69233a to 8fb8b74 Compare March 1, 2017 22:12

gyuho reviewed Mar 1, 2017

View reviewed changes

xiang90 reviewed Mar 2, 2017

View reviewed changes

heyitsanthony force-pushed the mvcc-txn branch 3 times, most recently from fed846b to f7d5ca1 Compare March 4, 2017 00:42

gyuho reviewed Mar 6, 2017

View reviewed changes

mvcc/backend/tx_buffer.go Outdated

widx++

}

bb.buf[widx] = bb.buf[ridx]

Copy link

Contributor

gyuho Mar 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary blank line?

gyuho reviewed Mar 6, 2017

View reviewed changes

heyitsanthony force-pushed the mvcc-txn branch from f7d5ca1 to ebdef17 Compare March 6, 2017 23:12

heyitsanthony force-pushed the mvcc-txn branch 2 times, most recently from 36935eb to 69c7a82 Compare March 8, 2017 22:27

gyuho added the kind/release-note label Mar 9, 2017

heyitsanthony force-pushed the mvcc-txn branch from 69c7a82 to d1dcc82 Compare March 9, 2017 04:50

Anthony Romano added 6 commits March 8, 2017 20:52

backend: readtx

8d438c2

ReadTxs are designed for read-only accesses to the backend using a read-only boltDB transaction. Since BatchTx's are long-running transactions, all writes to BatchTx will writeback to ReadTx, overlaying the base read-only transaction.

mvcc: txns and r/w views

33acbb6

Clean-up of the mvcc interfaces to use txn interfaces instead of an id. Adds support for concurrent read-only mvcc transactions. Fixes etcd-io#7083

lease: support mvcc txn

f0c184b

etcdserver: support mvcc txn

58da8b1

benchmark: support mvcc txn

0ed3c83

etcdctl: support mvcc txn

d1dcc82

heyitsanthony merged commit ddcf141 into etcd-io:master Mar 9, 2017

heyitsanthony deleted the mvcc-txn branch March 9, 2017 08:25

gyuho removed the kind/need-release-note-v3.3 label Nov 22, 2017

jingyih mentioned this pull request Jul 18, 2019

[Performance] Etcd get node-lease latency is higher than put latency and not within SLO limits kubernetes/kubernetes#80209

Open

jingyih mentioned this pull request Aug 20, 2019

How does etcd compute the revision number in an atomic and scalable fashion? #11054

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mvcc: txns and r/w views #7105

mvcc: txns and r/w views #7105

heyitsanthony commented Jan 5, 2017

xiang90 commented Jan 5, 2017

heyitsanthony commented Jan 5, 2017

xiang90 commented Jan 5, 2017

heyitsanthony commented Mar 1, 2017

gyuho Mar 1, 2017

xiang90 Mar 2, 2017

heyitsanthony Mar 2, 2017

xiang90 Mar 2, 2017

heyitsanthony Mar 3, 2017

xiang90 Mar 2, 2017

heyitsanthony Mar 3, 2017

xiang90 Mar 2, 2017

heyitsanthony Mar 3, 2017

xiang90 Mar 2, 2017

xiang90 Mar 2, 2017

xiang90 commented Mar 2, 2017

heyitsanthony commented Mar 2, 2017

gyuho Mar 6, 2017 •

edited

Loading

heyitsanthony Mar 6, 2017

gyuho Mar 6, 2017

gyuho Mar 6, 2017

gyuho Mar 6, 2017

heyitsanthony Mar 6, 2017

gyuho commented Mar 7, 2017 •

edited

Loading

gyuho commented Mar 7, 2017 •

edited

Loading

codecov-io commented Mar 9, 2017

mvcc: txns and r/w views #7105

mvcc: txns and r/w views #7105

Conversation

heyitsanthony commented Jan 5, 2017

xiang90 commented Jan 5, 2017

heyitsanthony commented Jan 5, 2017

xiang90 commented Jan 5, 2017

heyitsanthony commented Mar 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiang90 commented Mar 2, 2017

heyitsanthony commented Mar 2, 2017

gyuho Mar 6, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gyuho commented Mar 7, 2017 • edited Loading

gyuho commented Mar 7, 2017 • edited Loading

codecov-io commented Mar 9, 2017

Codecov Report

gyuho Mar 6, 2017 •

edited

Loading

gyuho commented Mar 7, 2017 •

edited

Loading

gyuho commented Mar 7, 2017 •

edited

Loading