share a pool of (un)marshallers #38

Stebalien · 2018-07-17T16:47:28Z

Also, use a cloner instead of unmarshalling when wrapping an object.

This gives about a 2.5x speedup in my simple benchmark and probably saves the GC a lot of trouble.

Also, use a cloner instead of unmarshalling when wrapping an object.

Increasing `numWorkers` doesn't seem to help at all (not supprising). Really, we could probably drop to NumCPUs (instad of 2x that) and we'd still be fine.

More than that doesn't help. +1 helps for the case where a goroutine holding a worker gets unscheduled.

whyrusleeping · 2018-07-17T19:04:30Z

encoding/cloner.go

+// PooledCloner is a thread-safe pooled object cloner.
+type PooledCloner struct {
+	Count   int
+	cloners chan refmt.Cloner


is this better than a sync.Pool?

I was under the impression that the elements in a pool get GCed every GC cycle. However, it does appear faster in the benchmarks and we really care more about throughput than latency so I've switched to pools.

warpfork · 2018-07-17T19:16:32Z

encoding/cloner.go

+}
+
+// SetAtlas set sets the pool's atlas. It is *not* safe to call this
+// concurrently.


If it's not safe to call concurrently, make it only possible to do at construction time?_

Unfortunately, we need to update the atlas each time the user registers a new type (which must happen at init). We could lock everything in place with a sync.Once on first use but that'd add a bit of overhead and I'm not sure if it's worth it.

So make the caller register all types before starting to use it, and if they change their mind, they have to make a new instance of the whole pool.

I struggle with how else I'd use this in a semantically race-free way short of either A) doing the all-setup-before-use flow anyway, or B) mutexing the whole pool.

So make the caller register all types before starting to use it, and if they change their mind, they have to make a new instance of the whole pool.

That's what we do.

Although, yeah, we should just replace the entire pool instead of having this method (it exists because it used to do more).

encoding/marshaller.go

IIRC, pools are cleaned every GC cycle. However, by benchmarks aren't actually *faster* with this change so I figured we might as well go for it. Our issue here is really throughput, not latency, so even *if* these pools are cleaned on GC, it's still useful.

warpfork · 2018-07-17T21:39:43Z

So I had a long deep teatime and think about this, and a couple convolutions came to mind that existed before this branch, and aren't getting better. (That said, I'm also about to lament a bigger design issue with more inertia than what's changing here, so this whole comment shouldn't be construed as blocking this changeset if you want to merge it.)

We're trying to use pools to share shave off memory allocations (good) and improve performance;
This gets a little complicated because we end up with an interface for reconfiguring the {un,}marshallers and cloners on the fly;
And that issue comes into being because RegisterCborType is (I assume purely from the function type, not even looking at the implementation) doing something stateful in global vars.

Therefore I'd like to revisit the RegisterCborType stuff and see if we can migrate that to another way of doing business. Passing down an atlas per WrapObject call seems like it might be a good direcetion. In addition to unkinking some statefulness and thus making a lot of other code flow easier, atlas parameters can even let you do fancy things like using the atlas as a self-documenting whitelist of what types are valid in a protocol.

Doing that change might make it harder to pool the {un,}marshallers and cloners again -- I'm not sure refmt exposes the ability to provide an atlas argument while also reusing the rest of the memory in a marshaller. But if that's a problem, that's a bug in refmt, and should be fixed rather than worked around.

(But again, this is commentary about bigger scope than this changeset. Not something that has to be hashed out in this branch.)

We don't need to mutate, just replace.

Stebalien · 2018-07-17T22:44:05Z

Therefore I'd like to revisit the RegisterCborType stuff and see if we can migrate that to another way of doing business. Passing down an atlas per WrapObject call seems like it might be a good direcetion. In addition to unkinking some statefulness and thus making a lot of other code flow easier, atlas parameters can even let you do fancy things like using the atlas as a self-documenting whitelist of what types are valid in a protocol.

Is there any way to merge/extend an atlas? We always need the CID transformation one.

IMO, we actually don't want the current system as it allows users to break, e.g., Resolve by registering invasive transformations. However, this starts getting into an interface overhall. Ideally, we'd do very little when deserializing a block, lazily pathing and/or deserializing into some user-provided struct as needed.

warpfork

Okay, I had a chance to review this again, and I think basically we should go ahead and merge this: it's strictly better.

I made a few comments on benchmarks that need quick fixin'. 👼

The laments about some of the interface stand, but I think we're agreed on that. And after looking at it quite a while, yeah, it's gonna me a multi-step roadmap to getting everything I wish for, and this should definitely merge rather than wait :)

warpfork · 2018-09-07T13:57:13Z

wrap_test.go

+	foo   string
+	bar   []byte
+	baz   []int
+}


Slightly consequential typo... These should all be exported...!

Right now, the len(node.RawData) on this is one, because it's serializing to an empty map.

warpfork · 2018-09-07T13:58:30Z

wrap_test.go

+		}()
+	}
+	wg.Wait()
+}


I'd like to propose one more benchmark here, for going the other direction, obj->cbor:

func BenchmarkDumpObject(b *testing.B) { obj := testStruct() for i := 0; i < b.N; i++ { bytes, err := DumpObject(obj) if err != nil { b.Fatal(err, bytes) } } }

warpfork · 2018-09-07T14:14:01Z

So (once the fixture is fixed to be exported) here are some raw numbers "from my machine":

master:

BenchmarkWrapObject-8                      30000             46033 ns/op           50302 B/op        461 allocs/op
BenchmarkDecodeBlock-8                     50000             35428 ns/op           48237 B/op        395 allocs/op
BenchmarkWrapObjectParallel-8               2000           1194207 ns/op         5030224 B/op      46100 allocs/op
BenchmarkDecodeBlockParallel-8              2000            870421 ns/op         4823786 B/op      39500 allocs/op
BenchmarkDumpObject-8                     200000              5461 ns/op            1872 B/op         61 allocs/op

feat/refmt-pool branch:

BenchmarkWrapObject-8                      50000             30250 ns/op            9533 B/op        300 allocs/op
BenchmarkDecodeBlock-8                    100000             23534 ns/op            8058 B/op        306 allocs/op
BenchmarkWrapObjectParallel-8               2000            773224 ns/op          953807 B/op      30000 allocs/op
BenchmarkDecodeBlockParallel-8              2000            590684 ns/op          806067 B/op      30600 allocs/op
BenchmarkDumpObject-8                     300000              4993 ns/op            1272 B/op         31 allocs/op

These benchmarks were -benchmem (obviously) and GOGC=off (to reduce noise of the inter-benchmark influence of gc). The speedups are even faster when GC is on (not surprising, since alloc count is better), and remain when -benchmem is off as well (also not surprisingly, but well, I like to sanity check these things).

I'm both convinced this can be further improved, and simultaneously, it looks like it's already beating the non-refmt code on master in time, allocation size, and allocation count. More corpus would make for more confidence, but as it stands... That's cool.

@warpfork

Make sure we *actually* serialize/deserialize. Addresses @warpfork's CR.

Stebalien · 2018-09-07T17:12:46Z

Yeah, you're right. They're beating the current CBOR by a large margin. My benchmarks were completely broken.

warpfork · 2018-09-07T17:20:23Z

Benchmarking is a brutal thing. 🤕 It always seems to be roughly...

10: write benchmark
20: find interesting result
30: discover benchmark is actually wrong
40: goto 10

Fwiw, I did find a terribly interesting result from the old one-byte ones. All the results seemed to be equal-or-better, except for DecodeBlockParallel which was mysteriously worse on time, despite having less mem and the same alloc count. Something which disappeared in runs with the garbage collector disabled. So that's "fascinating"... apparently it's evidence that some kinds of gc load can be heavier than others. Isn't that neat? This is of course hypersuper irrelevant to our actual work, just thought I'd mention for giggles.

share a pool of (un)marshallers

6a86324

Also, use a cloner instead of unmarshalling when wrapping an object.

ghost assigned Stebalien Jul 17, 2018

ghost added the status/in-progress In progress label Jul 17, 2018

Stebalien requested review from warpfork and whyrusleeping July 17, 2018 16:49

Stebalien added 2 commits July 17, 2018 10:37

add a parallel benchmark

a7e0713

Increasing `numWorkers` doesn't seem to help at all (not supprising). Really, we could probably drop to NumCPUs (instad of 2x that) and we'd still be fine.

drop to NumCPU + 1 workers

1c730df

More than that doesn't help. +1 helps for the case where a goroutine holding a worker gets unscheduled.

whyrusleeping reviewed Jul 17, 2018

View reviewed changes

warpfork reviewed Jul 17, 2018

View reviewed changes

encoding/marshaller.go Show resolved Hide resolved

Stebalien mentioned this pull request Jul 17, 2018

slow parallel processing #37

Closed

replace SetAtlas method with a constructor

b76615f

We don't need to mutate, just replace.

warpfork requested changes Sep 7, 2018

View reviewed changes

fix benchmarks

59ca950

Make sure we *actually* serialize/deserialize. Addresses @warpfork's CR.

Stebalien requested a review from warpfork September 7, 2018 17:12

warpfork approved these changes Sep 7, 2018

View reviewed changes

Stebalien merged commit cb4b1f0 into refmt Sep 7, 2018

ghost removed the status/in-progress In progress label Sep 7, 2018

Stebalien deleted the feat/refmt-pool branch September 7, 2018 17:37

warpfork mentioned this pull request Nov 11, 2020

Fresh take on codec APIs, and some tokenization utilities. ipld/go-ipld-prime#101

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

share a pool of (un)marshallers #38

share a pool of (un)marshallers #38

Stebalien commented Jul 17, 2018 •

edited

Loading

whyrusleeping Jul 17, 2018

Stebalien Jul 17, 2018

warpfork Jul 17, 2018

Stebalien Jul 17, 2018

warpfork Jul 17, 2018

Stebalien Jul 17, 2018

warpfork commented Jul 17, 2018

Stebalien commented Jul 17, 2018

warpfork left a comment

warpfork Sep 7, 2018

warpfork Sep 7, 2018

warpfork commented Sep 7, 2018

Stebalien commented Sep 7, 2018

warpfork commented Sep 7, 2018

share a pool of (un)marshallers #38

share a pool of (un)marshallers #38

Conversation

Stebalien commented Jul 17, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

warpfork commented Jul 17, 2018

Stebalien commented Jul 17, 2018

warpfork left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

warpfork commented Sep 7, 2018

Stebalien commented Sep 7, 2018

warpfork commented Sep 7, 2018

Stebalien commented Jul 17, 2018 •

edited

Loading