Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

share a pool of (un)marshallers #38

Merged
merged 6 commits into from
Sep 7, 2018
Merged

share a pool of (un)marshallers #38

merged 6 commits into from
Sep 7, 2018

Conversation

Stebalien
Copy link
Member

@Stebalien Stebalien commented Jul 17, 2018

Also, use a cloner instead of unmarshalling when wrapping an object.

This gives about a 2.5x speedup in my simple benchmark and probably saves the GC a lot of trouble.

Also, use a cloner instead of unmarshalling when wrapping an object.
@ghost ghost assigned Stebalien Jul 17, 2018
@ghost ghost added the status/in-progress In progress label Jul 17, 2018
Increasing `numWorkers` doesn't seem to help at all (not supprising). Really, we
could probably drop to NumCPUs (instad of 2x that) and we'd still be fine.
More than that doesn't help. +1 helps for the case where a goroutine holding a
worker gets unscheduled.
// PooledCloner is a thread-safe pooled object cloner.
type PooledCloner struct {
Count int
cloners chan refmt.Cloner
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this better than a sync.Pool?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that the elements in a pool get GCed every GC cycle. However, it does appear faster in the benchmarks and we really care more about throughput than latency so I've switched to pools.

}

// SetAtlas set sets the pool's atlas. It is *not* safe to call this
// concurrently.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not safe to call concurrently, make it only possible to do at construction time?_

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we need to update the atlas each time the user registers a new type (which must happen at init). We could lock everything in place with a sync.Once on first use but that'd add a bit of overhead and I'm not sure if it's worth it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So make the caller register all types before starting to use it, and if they change their mind, they have to make a new instance of the whole pool.

I struggle with how else I'd use this in a semantically race-free way short of either A) doing the all-setup-before-use flow anyway, or B) mutexing the whole pool.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So make the caller register all types before starting to use it, and if they change their mind, they have to make a new instance of the whole pool.

That's what we do.

Although, yeah, we should just replace the entire pool instead of having this method (it exists because it used to do more).

IIRC, pools are cleaned every GC cycle. However, by benchmarks aren't
actually *faster* with this change so I figured we might as well go for it. Our
issue here is really throughput, not latency, so even *if* these pools are
cleaned on GC, it's still useful.
@warpfork
Copy link
Member

So I had a long deep teatime and think about this, and a couple convolutions came to mind that existed before this branch, and aren't getting better. (That said, I'm also about to lament a bigger design issue with more inertia than what's changing here, so this whole comment shouldn't be construed as blocking this changeset if you want to merge it.)

  • We're trying to use pools to share shave off memory allocations (good) and improve performance;
  • This gets a little complicated because we end up with an interface for reconfiguring the {un,}marshallers and cloners on the fly;
  • And that issue comes into being because RegisterCborType is (I assume purely from the function type, not even looking at the implementation) doing something stateful in global vars.

Therefore I'd like to revisit the RegisterCborType stuff and see if we can migrate that to another way of doing business. Passing down an atlas per WrapObject call seems like it might be a good direcetion. In addition to unkinking some statefulness and thus making a lot of other code flow easier, atlas parameters can even let you do fancy things like using the atlas as a self-documenting whitelist of what types are valid in a protocol.

Doing that change might make it harder to pool the {un,}marshallers and cloners again -- I'm not sure refmt exposes the ability to provide an atlas argument while also reusing the rest of the memory in a marshaller. But if that's a problem, that's a bug in refmt, and should be fixed rather than worked around.

(But again, this is commentary about bigger scope than this changeset. Not something that has to be hashed out in this branch.)

We don't need to mutate, just replace.
@Stebalien
Copy link
Member Author

Therefore I'd like to revisit the RegisterCborType stuff and see if we can migrate that to another way of doing business. Passing down an atlas per WrapObject call seems like it might be a good direcetion. In addition to unkinking some statefulness and thus making a lot of other code flow easier, atlas parameters can even let you do fancy things like using the atlas as a self-documenting whitelist of what types are valid in a protocol.

Is there any way to merge/extend an atlas? We always need the CID transformation one.

IMO, we actually don't want the current system as it allows users to break, e.g., Resolve by registering invasive transformations. However, this starts getting into an interface overhall. Ideally, we'd do very little when deserializing a block, lazily pathing and/or deserializing into some user-provided struct as needed.

Copy link
Member

@warpfork warpfork left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I had a chance to review this again, and I think basically we should go ahead and merge this: it's strictly better.

I made a few comments on benchmarks that need quick fixin'. 👼

The laments about some of the interface stand, but I think we're agreed on that. And after looking at it quite a while, yeah, it's gonna me a multi-step roadmap to getting everything I wish for, and this should definitely merge rather than wait :)

wrap_test.go Outdated
foo string
bar []byte
baz []int
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly consequential typo... These should all be exported...!

Right now, the len(node.RawData) on this is one, because it's serializing to an empty map.

}()
}
wg.Wait()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to propose one more benchmark here, for going the other direction, obj->cbor:

func BenchmarkDumpObject(b *testing.B) {
	obj := testStruct()
	for i := 0; i < b.N; i++ {
		bytes, err := DumpObject(obj)
		if err != nil {
			b.Fatal(err, bytes)
		}
	}
}

@warpfork
Copy link
Member

warpfork commented Sep 7, 2018

So (once the fixture is fixed to be exported) here are some raw numbers "from my machine":

master:

BenchmarkWrapObject-8                      30000             46033 ns/op           50302 B/op        461 allocs/op
BenchmarkDecodeBlock-8                     50000             35428 ns/op           48237 B/op        395 allocs/op
BenchmarkWrapObjectParallel-8               2000           1194207 ns/op         5030224 B/op      46100 allocs/op
BenchmarkDecodeBlockParallel-8              2000            870421 ns/op         4823786 B/op      39500 allocs/op
BenchmarkDumpObject-8                     200000              5461 ns/op            1872 B/op         61 allocs/op

feat/refmt-pool branch:

BenchmarkWrapObject-8                      50000             30250 ns/op            9533 B/op        300 allocs/op
BenchmarkDecodeBlock-8                    100000             23534 ns/op            8058 B/op        306 allocs/op
BenchmarkWrapObjectParallel-8               2000            773224 ns/op          953807 B/op      30000 allocs/op
BenchmarkDecodeBlockParallel-8              2000            590684 ns/op          806067 B/op      30600 allocs/op
BenchmarkDumpObject-8                     300000              4993 ns/op            1272 B/op         31 allocs/op

These benchmarks were -benchmem (obviously) and GOGC=off (to reduce noise of the inter-benchmark influence of gc). The speedups are even faster when GC is on (not surprising, since alloc count is better), and remain when -benchmem is off as well (also not surprisingly, but well, I like to sanity check these things).

I'm both convinced this can be further improved, and simultaneously, it looks like it's already beating the non-refmt code on master in time, allocation size, and allocation count. More corpus would make for more confidence, but as it stands... That's cool.

Make sure we *actually* serialize/deserialize. Addresses @warpfork's CR.
@Stebalien
Copy link
Member Author

Yeah, you're right. They're beating the current CBOR by a large margin. My benchmarks were completely broken.

@warpfork
Copy link
Member

warpfork commented Sep 7, 2018

Benchmarking is a brutal thing. 🤕 It always seems to be roughly...

10: write benchmark
20: find interesting result
30: discover benchmark is actually wrong
40: goto 10

Fwiw, I did find a terribly interesting result from the old one-byte ones. All the results seemed to be equal-or-better, except for DecodeBlockParallel which was mysteriously worse on time, despite having less mem and the same alloc count. Something which disappeared in runs with the garbage collector disabled. So that's "fascinating"... apparently it's evidence that some kinds of gc load can be heavier than others. Isn't that neat? This is of course hypersuper irrelevant to our actual work, just thought I'd mention for giggles.

@Stebalien Stebalien merged commit cb4b1f0 into refmt Sep 7, 2018
@ghost ghost removed the status/in-progress In progress label Sep 7, 2018
@Stebalien Stebalien deleted the feat/refmt-pool branch September 7, 2018 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants