Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mvcc: txns and r/w views #7105

Merged
merged 6 commits into from
Mar 9, 2017
Merged

Conversation

heyitsanthony
Copy link
Contributor

Clean-up of the mvcc interfaces to use txn interfaces instead of an id.

Adds support for concurrent read-only mvcc transactions.

Fixes #7083

@xiang90
Copy link
Contributor

xiang90 commented Jan 5, 2017

@heyitsanthony Even with the change, a write will still block all reads?

@heyitsanthony
Copy link
Contributor Author

@xiang90 yes, still WIP

@xiang90
Copy link
Contributor

xiang90 commented Jan 5, 2017

@heyitsanthony ACK

@heyitsanthony
Copy link
Contributor Author

seemingly stable PTAL /cc @xiang90

k, v := rtx.UnsafeRange([]byte("test"), tt.key, tt.end, tt.limit)
rtx.Unlock()
if !reflect.DeepEqual(tt.wkey, k) || !reflect.DeepEqual(tt.wval, v) {
t.Errorf("#%d: want k=%+v, v=%d; got k=%+v, v=%+v", i, tt.wkey, tt.wval, k, v)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

%+v for tt.wval to be consistent?

mvcc/kv.go Outdated
}

// txnWriteReadOnly promotes a read txn to a write, panicking if any reads are called.
type txnReadWrite struct{ TxnRead }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment does not match the struct naming, and its implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it should be coerce, not promote

rt.txmu.Unlock()

// FIXME: may have dup keys?
return append(k2, keys...), append(v2, vals...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we only enable write buffer for keys bucket, dedup should be easy. we never overwrite in keys bucket, but simply append. i need to check if it is possible to have dups in buffer and current read.tx though.

if we enable write buffer for buckets when keys overwriting might occur, we have to merge the result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be safe without needing to hardcode the "keys" bucket so long as it can be guaranteed that range fetches of overwrites only happen in writetxs. I believe this is the case for the current codebase, but I will have to do an audit.

return append(k2, keys...), append(v2, vals...)
}

func (rt *readTx) UnsafeForEach(bucketName []byte, visitor func(k, v []byte) error) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have ordering assumption perviously? now we visit write buffer first, which might break ordering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything using foreach (alarms, hash) make no ordering assumptions, but it may change the hash values if foreach expects the keys to go from minimum to maximum. The dup map can potentially get huge if it's done base txn first.

return nil
}

func (txr *txReadBuffer) writeback(txw *txWriteBuffer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a little bit confusing. i would expect we write-back from txw to txr.

func (txw *txWriteBuffer) writeback(txr *txReadBuffer) {
...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the thinking here was txReadBuffer is the receiver so it receives the data from the write buffer, but OK...


func (t *batchTxBuffered) Unlock() {
if t.pending != 0 {
t.backend.readTx.mu.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A read tx might block a batch Tx from finishing here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm. i guess it is fine.

@xiang90
Copy link
Contributor

xiang90 commented Mar 2, 2017

@heyitsanthony Looks reasonable overall. Have you tried to re-run the read bench to make sure this does solve the contention?

@heyitsanthony
Copy link
Contributor Author

Running on my desktop:

master:

$ ./tools/benchmark/benchmark --conns=128 --clients=128 range --total=100000 --consistency=s  a
bench with serializable range
 100000 / 100000 Booooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%3s

Summary:
  Total:        3.1045 secs.
  Slowest:      0.0533 secs.
  Fastest:      0.0001 secs.
  Average:      0.0039 secs.
  Stddev:       0.0048 secs.
  Requests/sec: 32210.9553

Response time histogram:
  0.0001 [1]
  0.0054 [73169]
  0.0108 [17721]
  0.0161 [6135]
  0.0214 [2078] 
  0.0267 [631]
  0.0320 [182]
  0.0374 [58]
  0.0427 [17]
  0.0480 [6]
  0.0533 [2]

Latency distribution:
  10% in 0.0002 secs.
  25% in 0.0002 secs.
  50% in 0.0019 secs.
  75% in 0.0058 secs.
  90% in 0.0103 secs.
  95% in 0.0136 secs.
  99% in 0.0209 secs.
  99.9% in 0.0313 secs.

readtx patches:

./tools/benchmark/benchmark --conns=128 --clients=128 range --total=100000 --consistency=s  a
bench with serializable range
 100000 / 100000 Booooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%2s

Summary:
  Total:        2.3548 secs.
  Slowest:      0.0298 secs.
  Fastest:      0.0001 secs.
  Average:      0.0028 secs.
  Stddev:       0.0019 secs.
  Requests/sec: 42466.7292

Response time histogram:
  0.0001 [1]
  0.0031 [65868]
  0.0061 [27617]
  0.0090 [5071]
  0.0120 [1064]
  0.0150 [295]
  0.0179 [54]
  0.0209 [16]
  0.0239 [6]
  0.0268 [4]
  0.0298 [4]

Latency distribution:
  10% in 0.0009 secs.
  25% in 0.0015 secs.
  50% in 0.0024 secs.
  75% in 0.0036 secs.
  90% in 0.0051 secs.
  95% in 0.0067 secs.
  99% in 0.0096 secs.
  99.9% in 0.0145 secs.

@heyitsanthony heyitsanthony force-pushed the mvcc-txn branch 3 times, most recently from fed846b to f7d5ca1 Compare March 4, 2017 00:42
return
}
switch {
case tw.ranges == 1:
Copy link
Contributor

@gyuho gyuho Mar 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do these uint fields get initialized after counter.Inc? Can't find where in this commit somehow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They initialize to 0 in newMetricsTxn* and are incremented if there's a range/delete/put. If there's only one operation, the metric is treated like a singleton range/delete/put. If there's more than one operation, it only increments the txn counter. The Txn should be thrown away after calling End so there's no need to reset the values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok makes sense now. Thanks!

widx++
}
bb.buf[widx] = bb.buf[ridx]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary blank line?

if bb.used == bbsrc.used {
return
}
if bytes.Compare(bb.buf[bb.used-bbsrc.used].key, bbsrc.buf[0].key) > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does this happen?

Let's say

bb buf ['x','y','z'], used 3
bbsrc buf ['a','b'], used 2

bb.merge.bbsrc and then now

bb.buf ['x','y','z','a','b'], used 5

Now

bb.buf[bb.used-bbsrc.used]==bb.buf[5-2]==bb.buf[3]=='a'

bb.buf[3]=='a'==bbsrc.buf[0]

Seems like bytes.Compare(bb.buf[bb.used-bbsrc.used].key, bbsrc.buf[0].key) is always 0, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, this is off by 1 and the comparison should be <. The index should be (bb.used-bbsrc.used)-1.

@gyuho
Copy link
Contributor

gyuho commented Mar 7, 2017

@xiang90 @heyitsanthony Benchmark with 3 nodes

ETCDCTL_API=3 ${GOPATH}/src/${GIT_PATH}/bin/etcdctl \
    --endpoints 10.240.0.7:2379,10.240.0.8:2379,10.240.0.12:2379 \
    put a b

${GOPATH}/bin/benchmark \
    --endpoints=10.240.0.7:2379,10.240.0.8:2379,10.240.0.12:2379 \
    --conns=128 --clients=128 \
    range --total=100000 --consistency=s a


bench with serializable range
 100000 / 100000 Boooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%1s

Summary:
  Total:	1.2248 secs.
  Slowest:	0.0146 secs.
  Fastest:	0.0002 secs.
  Average:	0.0014 secs.
  Stddev:	0.0012 secs.
  Requests/sec:	81642.6844

Response time histogram:
  0.0002 [1]	|
  0.0017 [69640]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0031 [20870]	|∎∎∎∎∎∎∎∎∎∎∎
  0.0045 [6591]	|∎∎∎
  0.0060 [1987]	|∎
  0.0074 [580]	|
  0.0088 [159]	|
  0.0103 [107]	|
  0.0117 [47]	|
  0.0132 [15]	|
  0.0146 [3]	|

Latency distribution:
  10% in 0.0004 secs.
  25% in 0.0005 secs.
  50% in 0.0011 secs.
  75% in 0.0019 secs.
  90% in 0.0030 secs.
  95% in 0.0039 secs.
  99% in 0.0059 secs.
  99.9% in 0.0098 secs.

With patch

ETCDCTL_API=3 ${GOPATH}/src/${GIT_PATH}/bin/etcdctl \
    --endpoints 10.240.0.21:2379,10.240.0.22:2379,10.240.0.23:2379 \
    put a b

${GOPATH}/bin/benchmark \
    --endpoints=10.240.0.21:2379,10.240.0.22:2379,10.240.0.23:2379 \
    --conns=128 --clients=128 \
    range --total=100000 --consistency=s a

bench with serializable range
 100000 / 100000 Boooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%0s

Summary:
  Total:	0.9367 secs.
  Slowest:	0.0352 secs.
  Fastest:	0.0002 secs.
  Average:	0.0009 secs.
  Stddev:	0.0012 secs.
  Requests/sec:	106754.2142

Response time histogram:
  0.0002 [1]	|
  0.0037 [98528]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0072 [955]	|
  0.0107 [240]	|
  0.0142 [102]	|
  0.0177 [87]	|
  0.0212 [34]	|
  0.0247 [11]	|
  0.0282 [37]	|
  0.0317 [2]	|
  0.0352 [3]	|

Latency distribution:
  10% in 0.0003 secs.
  25% in 0.0004 secs.
  50% in 0.0006 secs.
  75% in 0.0009 secs.
  90% in 0.0015 secs.
  95% in 0.0023 secs.
  99% in 0.0045 secs.
  99.9% in 0.0171 secs.

@gyuho
Copy link
Contributor

gyuho commented Mar 7, 2017

If we run the same tests with more clients (same configuration as in golang/go#18020), we get the best performance ever!

${GOPATH}/bin/benchmark     --endpoints=10.240.0.21:2379,10.240.0.22:2379,10.240.0.23:2379     --conns=100 --clients=1000     range --total=200000 --consistency=s foo
bench with serializable range
 200000 / 200000 Boooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%1s

Summary:
  Total:	1.4225 secs.
  Slowest:	0.0330 secs.
  Fastest:	0.0002 secs.
  Average:	0.0025 secs.
  Stddev:	0.0025 secs.
  Requests/sec:	140593.6432

Response time histogram:
  0.0002 [1]	|
  0.0035 [157259]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0068 [32554]	|∎∎∎∎∎∎∎∎
  0.0101 [7233]	|∎
  0.0133 [1537]	|
  0.0166 [385]	|
  0.0199 [122]	|
  0.0232 [851]	|
  0.0265 [56]	|
  0.0297 [1]	|
  0.0330 [1]	|

Latency distribution:
  10% in 0.0006 secs.
  25% in 0.0009 secs.
  50% in 0.0016 secs.
  75% in 0.0030 secs.
  90% in 0.0055 secs.
  95% in 0.0068 secs.
  99% in 0.0117 secs.
  99.9% in 0.0222 secs.

Anthony Romano added 6 commits March 8, 2017 20:52
ReadTxs are designed for read-only accesses to the backend using a
read-only boltDB transaction. Since BatchTx's are long-running
transactions, all writes to BatchTx will writeback to ReadTx, overlaying
the base read-only transaction.
Clean-up of the mvcc interfaces to use txn interfaces instead of an id.

Adds support for concurrent read-only mvcc transactions.

Fixes etcd-io#7083
@codecov-io
Copy link

Codecov Report

Merging #7105 into master will decrease coverage by -0.1%.
The diff coverage is 85.93%.

@@            Coverage Diff            @@
##           master    #7105     +/-   ##
=========================================
- Coverage   70.47%   70.38%   -0.1%     
=========================================
  Files         237      244      +7     
  Lines       21238    21342    +104     
=========================================
+ Hits        14968    15021     +53     
- Misses       5136     5186     +50     
- Partials     1134     1135      +1
Impacted Files Coverage Δ
mvcc/kv.go 0% <0%> (ø)
etcdserver/server.go 78.8% <0%> (-0.47%)
mvcc/watchable_store_txn.go 100% <100%> (ø)
mvcc/metrics_txn.go 100% <100%> (ø)
mvcc/kv_view.go 100% <100%> (ø)
lease/lessor.go 86.75% <100%> (+1.45%)
mvcc/backend/read_tx.go 45.16% <45.16%> (ø)
etcdserver/v3_server.go 62.5% <50%> (+0.43%)
etcdserver/apply_auth.go 61.11% <66.66%> (ø)
mvcc/backend/backend.go 73.75% <84.61%> (-0.58%)
... and 27 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 39dc531...d1dcc82. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants