Fresh take on codec APIs, and some tokenization utilities. #101

warpfork · 2020-10-22T19:30:05Z

The tokenization system may look familiar to refmt's tokens -- and
indeed it surely is inspired by and in the same pattern -- but it
hews a fair bit closer to the IPLD Data Model definitions of kinds,
and it also includes links as a token kind. Presense of link as
a token kind means if we build codecs around these, the handling
of links will be better and most consistently abstracted (the
current dagjson and dagcbor implementations are instructive for what
an odd mess it is when you have most of the tokenization happen
before you get to the level that figures out links; I think we can
improve on that code greatly by moving the barriers around a bit).

I made both all-at-once and pumpable versions of both the token
producers and the token consumers. Each are useful in different
scenarios. The pumpable versions are probably generally a bit slower,
but they're also more composable. (The all-at-once versions can't
be glued to each other; only to pumpable versions.)

Some new and much reduced contracts for codecs are added,
but not yet implemented by anything in this comment.
The comments on them are lengthy and detail the ways I'm thinking
that codecs should be (re)implemented in the future to maximize
usability and performance and also allow some configurability.
(The current interfaces "work", but irritate me a great deal every
time I use them; to be honest, I just plain guessed badly at what
the API here should be the first time I did it. Configurability
should be both easy to not engage in, but also easier if you do
(and in pariticular, not require reaching to another library's
packages to do it!).) More work will be required to bring this
to fruition.

It may be particularly interesting to notice that the tokenization
systems also allow complex keys -- maps and lists can show up as the
keys to maps! This is something not allowed by the data model (and
for dare I say obvious reasons)... but it's something that's possible
at the schema layer (e.g. structs with representation strategies that
make them representable as strings can be used as map keys), so,
these functions support it.

mvdan · 2020-11-10T16:25:59Z

codec/api.go

+// and is the equivalent of creating a zero-value ReusableEncoder (aka, default config)
+// and using its Encode method.
+// This package-scope function will typically also internally use a sync.Pool
+// to keep some ReusableEncoder values on hand to avoid unnecesary allocations.


Why use a sync.Pool instead of an Encoder implementation type with an Encode method, turning the type Encoder here into an interface?

That would also make ReusableEncoder non-necessary, as far as I can tell.

The idea of having a sync.Pool is mostly in service to the goal of convenience for that package-scope function. Some applications have logic where keeping one Encoder around and reusing it is easy; some don't. I want a handy function that will usually DTRT and reuse an Encoder in an alloc-efficient way, without bothering me about it and making me carry something around. (And this was found to be useful before in one of the generations of libraries I consider a learning experience to feed into this one. [‡])

Could we make a package-scope convenience method that's convenient and just ignores the alloc cost of creating a new ReusableEncoder every time? Well, yeah. Don't wanna though :) At least, I think I don't. sync.Pool is supposed to be pretty cheap, is my understanding. I guess this'll deserve re-checking when more of the code that this comment suggests actually gets implemented, though.

Regardless, ReusableEncoder will still need to be a concrete type for a couple of other reasons: namely, because it's often going to export codec-specific configuration fields, and because those things are codec-specific, it's not really possible to totally hide it behind any general interface.

(Suggesting that this concrete type should have "Reusable" right in the type name might be a little distracting? Maybe I should drop that word. It's something I'm concerned about -- ReusableEncoder tends to have at least one userland "stack" in it; I want them to be resettable and reusable in a way that zeros the length but retains the capacity of any such "stack" slice or other working memory buffers -- but maybe that detail should stay in my head or in docs rather than shouted in the type name.)

[‡] - there's probably a half-dozen red herrings in that older library PR too, admittedly. For example, I'm pretty sure the new encoders and decoders I'll be writing to go with this package will have many fewer allocations to initialize than that older library did, which may result in the performance impact of the pooling being less starkly visible. Still, I think overall I'm going to start by assuming the older library had a point.

mvdan · 2020-11-10T16:37:31Z

codec/codectools/token.go

+	"github.com/ipld/go-ipld-prime"
+)
+
+type Token struct {


I'm not a fan of the package name, because then the qualified names are a bit weird; codectools.Token sounds to me like it should be codec.Token.

I guess this is less important if the end user shouldn't see codectools, but then I would argue this should be an internal package.

Agree, this is way too much of a mouthful, especially for the Token stuff.

Gonna run with it a while longer, because I've stacked up so more code atop this already, but some rename or reshuffle here would be an improvement.

mvdan · 2020-11-10T16:38:56Z

codec/codectools/token.go

+	TokenKind_Float     = 'f'
+	TokenKind_String    = 's'
+	TokenKind_Bytes     = 'x'
+	TokenKind_Link      = '/'


do we really need the TokenKind_ prefixes? I don't think codectools.ListOpen or codectools.Link would be ambiguous or conflict with anything else.

I tend to write enums with prefixes like this for two reasons:

I like having a consistent rule about how enums look (especially since there's not much compiler support around them)

I lean heavily on autocomplete to keep my attention rolling in the right direction as I type, so a lack of shared prefix is fairly derailing to me.

Maybe shifting these into a package with fewer things in it would make shaving the name down more palatable. Part of the shared-prefix concern is that things that have shared prefixes and are at radically different abstraction levels are especially derailing to my brain, so if the prefix is only the package name, in this case that hits hard; if the package had fewer residents, it would hit less hard.

Alternatively, I did this with a a much shorter prefix -- just 'T' -- previously. Is that more pleasing?

In standard lib, reflect.Kind is readily comparable, and goes for no prefix. But I'd also count this as an example of an API that did this that I'm personally not at all fond of even after prolonged use.

Splitting an enum into its own package definitely feels worse :) Otherwise we'd end up with hundreds of packages in ipld-prime alone.

I generally disagree that the language/compiler has little support for enums. Giving the constants a specific type, like const Bytes TokenKind = 'x', is plenty in my opinion. Then any IDE should be able to autocomplete Bytes for a value of type TokenKind.

I guess a T prefix is at least shorter, but I'm still opposed to a prefix in general unless it's strictly needed.

mvdan · 2020-11-10T16:41:09Z

codec/codectools/token_producers.go

+	stk nodeTokenizerStack
+}
+
+func (nt *NodeTokenizer) Initialize(n ipld.Node) {


why not just have a constructor instead? you seem to pretty much always declare a variable and then initialize it, which is a roundabout way to construct a tokenizer

I'm aiming for all of these things to be embeddable. Initializer patterns seem to work better than constructors for things that are embedded.

One of my major lessons learned from refmt is that when you want to compose a bunch of these gadgets -- and you almost always want to pair up at least two things, one to produce and one to consume -- it's very desirable to be able to put them both in one struct, so you can allocate them together in one step (and reset them together in one step, later, too). Refmt didn't do super well at this: because I didn't design from the start with that goal, I ended up with gadgets that did "just one" allocation when they were created... but by the time you put everything together that was needed to serve a typical end-user purpose, several such "just one" allocations had snuck in, somehow. So, I'm trying to hew really hard to embeddable gadgets as much as possible from the start, this time around.

I guess it's possible to have constructors that return a value (and not the pointer), and assign those returned values over embedded fields, and get the desired effect.

But somehow things just seem to shake out better when you use explicit Init and Reset methods on a pointer receiver when aiming to do embed-friendly stuff, IME.

mvdan · 2020-11-10T16:43:17Z

codec/codectools/token_test.go

+)
+
+var tokenFixtures = []struct {
+	value    ipld.Node


it would be good to include name string here, like name: "ScalarString", so that these can be run as properly named subtests later on.

a codebase with lots of table-driven tests is much harder to maintain and test if the test cases aren't named and run via sub-tests :)

rvagg · 2020-11-11T00:51:48Z

codec/codectools/token.go

+type Token struct {
+	Kind TokenKind
+
+	Length int       // Present for MapOpen or ListOpen.  May be -1 for "unknown" (e.g. a json tokenizer will yield this).


Can someone talk me through the Go memory model decision process around structs like this that are essentially unions. If you make a new Token you usually only want to set Kind and one of the other properties, yet you have a memory layout that includes space for all of these things. An alternative might be to make everything a pointer, but then you allocate pointer space for each of them and you end up using the heap for the values. Is that the essential trade-off here, and where you have such a choice you should just choose this style? I've run into this a number of times and using pointers always feels more correct to me because I'm not putting anything into the fields I don't care about. But I suppose that's not the best way of thinking about it?

Yep, I think you've got it already.

If Go had unions -- either discriminated or even nasty dirty unsafe undiscriminated -- I'd use 'em in a heartbeat. Unfortunately, it does not.

If one tries to unify the space cost by using any kind of interface{}, you end up forcing allocations to box up the values. (Interfaces have to be two words: one word is a pointer to the type info, and the other word is a pointer to the value. That means you have to get a pointer from somewhere. And at present: yes, for everything: It's a pointer to the value even if the value is a single word and could just as well fit without a pointer indirection. Uuffah. (I've heard talk future versions of Go might improve this situation for word-sized types, but it hasn't happened yet. And wouldn't entirely absolve our issues here either, since byte slices and strings are both multi-word too.))

("Forcing" allocations might be a little strong: if you can get a pointer to data that's already in the heap (or inside some other data that's already on the heap), then you can get that pointer without a new allocation. But this generally requires some careful logic... and while sometimes that careful logic is worth the effort, it turns out to be pretty hard to apply in practice around how these tokens get used, IME.)

If one rewrote this structure with the same fields but all pointers (and not trying to unify stuff into a single interface{} field), the struct would get about three words shorter ([]byte is three words while *[]byte is one; string is two words while *string is one; the rest are all already one word and would still be one word as a pointer; except, okay, Link is also still two words and would remain two words because you wouldn't turn it into a pointer-to-an-interface since interfaces can already contain pointers). So a lot of space would still be reserved. And you'd indeed still have the same issues with getting a pointer tending to provoke an allocation.

You'll see people still use pointers for optional fields in golang fairly often, but usually the reason for this is that they're wanting to be able to use nil as a distinct sentinel value. (And when you see this, implicitly, the author also either didn't know or just didn't care about the allocations provoked. Which -- sometimes, is fair! But in this context, we really do care; this structure is getting used in hot loops in parsers, so allocations are extremely, extremely noticeable.) A lot of other programming languages try to encapsulate this concept with a type called Maybe or Optional or Either or something like that just to make it clear when the goal is to have a sentinel value -- and we've done it ourselves with the optional keyword in IPLD Schemas -- but Golang doesn't have any of these, so, people frequently overload this semantic onto making something into pointer even though it otherwise wouldn't be.

(Sorry for the wall of text; you asked about the one thing that touches all three of my biggest beefs about Golang -- lack of unions, avoidable heap allocations for word-sized residents of interfaces, and the pointer/optional bamboozle -- at once, ahheh.)

rvagg · 2020-11-11T00:54:03Z

I really like the TokenAssemble() and TokenWalk() patterns in here and would love to try and use them in the dag-pb code. I've essentially already manually written custom versions of these and custom tokens just for dag-pb to do tokenised decode and encode. But it'd be great to consume something directly from ipld-prime and not have to bother with all of the excessively verbose assembling and error checking (oh man that's verbose).

warpfork · 2020-11-11T11:32:55Z

If this can save you work in implementing a fresh new dag-pb codec, that will be awesome. I'm pretty confident I know where I'm going for making something reusable between cbor and json, but if this makes it to reusable for dag-pb this time too? Whewww that'll be cool.

warpfork · 2020-11-13T17:44:05Z

Thinking out loud about the package name/tree:

What we've got here is a bunch of transformers which are useful for turning node trees into linear (but not encoded!) sequences, and vice versa. What's interesting about that is it's not just applicable to codecs (though it certainly and often is -- one gets a codec when combining these things with the wire encoder and decoder): one could also imagine building generalized data transform functions which operate by streaming application on a sequence like this. Those functions might have no intention of doing anything relating to encoding or decoding and still find this abstraction useful.

So maybe that's a hint towards would make a good package name: it's not codectools at all (or at least, none of these current diff contents are), and it doesn't belong under the codec/* tree; these are...

Okay, this intuition still doesn't answer the naming question concretely. treestreamers? foldunfolds? I can't think of a good word that describes both directions at once. ("codec" is a portmanteau of "encode" and "decode", which is a fabulous word... but doesn't quite fit for this situation, if what we want to emphasize about these functions is that they know nothing about actual encoded bytes.)

Anyway, maybe this insight will help name this better in the future, and it does seem that it would make sense as a top level package, and not tucked under codec/*. I'll leave it where it is for now rather than get hung up on this naming (and also because I'll have another PR coming down the pipeline shortly that already built on this), but we'll probably expect to move it in the future.

The tokenization system may look familiar to refmt's tokens -- and indeed it surely is inspired by and in the same pattern -- but it hews a fair bit closer to the IPLD Data Model definitions of kinds, and it also includes links as a token kind. Presense of link as a token kind means if we build codecs around these, the handling of links will be better and most consistently abstracted (the current dagjson and dagcbor implementations are instructive for what an odd mess it is when you have most of the tokenization happen before you get to the level that figures out links; I think we can improve on that code greatly by moving the barriers around a bit). I made both all-at-once and pumpable versions of both the token producers and the token consumers. Each are useful in different scenarios. The pumpable versions are probably generally a bit slower, but they're also more composable. (The all-at-once versions can't be glued to each other; only to pumpable versions.) Some new and much reduced contracts for codecs are added, but not yet implemented by anything in this comment. The comments on them are lengthy and detail the ways I'm thinking that codecs should be (re)implemented in the future to maximize usability and performance and also allow some configurability. (The current interfaces "work", but irritate me a great deal every time I use them; to be honest, I just plain guessed badly at what the API here should be the first time I did it. Configurability should be both easy to *not* engage in, but also easier if you do (and in pariticular, not require reaching to *another* library's packages to do it!).) More work will be required to bring this to fruition. It may be particularly interesting to notice that the tokenization systems also allow complex keys -- maps and lists can show up as the keys to maps! This is something not allowed by the data model (and for dare I say obvious reasons)... but it's something that's possible at the schema layer (e.g. structs with representation strategies that make them representable as strings can be used as map keys), so, these functions support it.

We definitely did make a TokenWalker, heh. The other naming marsh (heh, see what I did there?) is still unresolved but can stay unresolved a while longer.

You can write a surprising amount of code where the compiler will shrug and silently coerce things for you. Right up until you can't. (Some test cases that'll be coming down the commit queue shortly happened to end up checking the type of the constants, and, well. Suddenly this was noticable.)

There were already comments about how this would be "probably" necessary; I don't know why I wavered, it certainly is.

This is far too useful in testing to reproduce in each package that needs something like it. It's already shown up as desirable again as soon as I start implementing even a little bit of even one codec tokenizer, and that's gonna keep happening. This might be worth moving to some kind of a 'tokentest' or 'codectest' package instead of cluttering up this one, but... we'll see; I've got a fair amount more code to flush into commits, and after that we can reshake things and see if packages settle out differently.

Useful for tests that do deep equality tests on structures. Same caveat about current placement of this method as in the previous commit: this might be worth detaching and shifting to a 'codectest' or 'tokentest' package. But let's see how it shakes out.

These aren't excersied yet -- and this is accordingly still highly subject to change -- but so far in developing this package, the pattern has been "if I say maybe this should have X", it's always turned out it indeed should have X. So let's just do that and then try it out, and have the experimental code instead of the comments.

warpfork · 2020-11-14T15:35:04Z

Okay -- this is still improvable (and the comments about naming and package path also stand; these features will probably move at least once in the near future before settling!) -- but I'm going to go ahead and merge this. I've got at least one more PR to launch which will build upon this, and I think by reviewing that one and carrying this codec revamp further there, we'll get a lot more insight into the final shape we want for all of these packages. So: Onward!

The original idea of this branch was to explore some reusable components for codecs; maybe to actually get a useful prettyprinter; and... actually turned a lot into being a bit of practical discovery about string encoding and escaping. Some of those goals were partially accomplished, but in general, this seems better to chalk up as a learning experience. #89 (comment) already does a good job of discussing why, and what as learned. A lot of the reusable codec components stuff has also now shaken out, just... in other PRs that were written after the learnings here. Namely, #101 was able to introduce some tree transformers; and then #112 demonstrates how those can compose into a complete codec end to end. There's still work to go on these, too, but they seem to have already grabbed the concept of reusable parts I was hoping for here and gotten farther with it, so. These diffs are interesting enough I want to keep them referencable in history. But I'm merging them with the "-s ours" strategy, so that the diffs don't actually land any impact on master. These commits are for reference only.

mvdan approved these changes Nov 10, 2020

View reviewed changes

rvagg reviewed Nov 11, 2020

View reviewed changes

warpfork force-pushed the codectools-tokenizers branch from b9ce851 to e8307bd Compare November 11, 2020 13:32

warpfork mentioned this pull request Nov 13, 2020

Introduce traversal function that selects links out of a tree. #110

Merged

warpfork added 7 commits November 14, 2020 15:13

Drop earlier design comments.

2143068

We definitely did make a TokenWalker, heh. The other naming marsh (heh, see what I did there?) is still unresolved but can stay unresolved a while longer.

Add budget parameter to TokenReader.

33fb7d9

There were already comments about how this would be "probably" necessary; I don't know why I wavered, it certainly is.

Token.Normalize utility method.

a8995f6

Useful for tests that do deep equality tests on structures. Same caveat about current placement of this method as in the previous commit: this might be worth detaching and shifting to a 'codectest' or 'tokentest' package. But let's see how it shakes out.

warpfork force-pushed the codectools-tokenizers branch from e8307bd to 1110155 Compare November 14, 2020 14:14

warpfork merged commit 624fae0 into master Nov 14, 2020

warpfork deleted the codectools-tokenizers branch November 14, 2020 15:35

warpfork mentioned this pull request Nov 14, 2020

Codec revamp #112

Merged

aschmahmann mentioned this pull request May 14, 2021

Release v0.9.0 ipfs/kubo#8058

Closed

71 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fresh take on codec APIs, and some tokenization utilities. #101

Fresh take on codec APIs, and some tokenization utilities. #101

warpfork commented Oct 22, 2020

mvdan Nov 10, 2020

warpfork Nov 11, 2020

mvdan Nov 10, 2020

warpfork Nov 11, 2020

mvdan Nov 10, 2020

warpfork Nov 11, 2020

warpfork Nov 11, 2020

warpfork Nov 11, 2020

mvdan Nov 11, 2020

mvdan Nov 10, 2020

warpfork Nov 11, 2020

mvdan Nov 10, 2020

rvagg Nov 11, 2020

warpfork Nov 11, 2020

rvagg commented Nov 11, 2020

warpfork commented Nov 11, 2020

warpfork commented Nov 13, 2020 •

edited

Loading

warpfork commented Nov 14, 2020

Fresh take on codec APIs, and some tokenization utilities. #101

Fresh take on codec APIs, and some tokenization utilities. #101

Conversation

warpfork commented Oct 22, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rvagg commented Nov 11, 2020

warpfork commented Nov 11, 2020

warpfork commented Nov 13, 2020 • edited Loading

warpfork commented Nov 14, 2020

warpfork commented Nov 13, 2020 •

edited

Loading