Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce pretty printing tool. #89

Closed
wants to merge 0 commits into from
Closed

Introduce pretty printing tool. #89

wants to merge 0 commits into from

Conversation

warpfork
Copy link
Collaborator

@warpfork warpfork commented Oct 1, 2020

Introduce a pretty printing tool.

I think this, or a variant of it, may be reasonable to rig up as a Stringer on the basicnode types, and recommend for other Node implementations to use as their Stringer too.

It's a fairly verbose output: I'm mostly aiming to use it in examples. There's a lot of information that's redundant.

Bytes in particular are fun: I decided to make them use the hex.Dump format. (Why not?)

I've put this in a codec sub-package, because in some ways it feels like a codec (I've corrected this, it was a huge mistake; it's now its own packge outside of the codec subtree) -- it's something you can apply to any node, and it streams data out to an io.Writer -- but it's also worth noting it's not meant to be a multicodec or generally written with an intention of use anywhere outside of debug printf sorts of uses.

The codectools package also introduced here, although it only has this one user, is a reaction to previous scenarios where I've wanted a quick debug method and desperately wanted something that gives me reasonable quoted strings... without reaching for a json package. We'll see if it's worth it over time; I'm betting yes, but not with infinite confidence. (This particular string escaping function it provides also has some particular details that make it distinctive from using json: it has the benefit of encoding even non-utf-8 strings without loss of information -- which is noteworthy, because I've recently noticed JSON does not; and I feel that's a bit yikes.)

This could be a solution to #88 . However, I've not yet hooked anything up to use this as a Stringer.

You can look at the fluent/reflect_example_test.go file for what this looks like to use.

@willscott
Copy link
Member

I wonder if you couldn't manage to fit something semantically into a json or toml or such format, while still achieving these semantics. perhaps by sticking the extra type information in comments or in a _type key or some-such. inventing a new format, even just for display, seems potentially hazardous

@warpfork
Copy link
Collaborator Author

warpfork commented Oct 1, 2020

I think the answer to "can I cram X into Y" is, generally, "yes".

The question is whether it results in ergonomics that are pleasant, and (in this case) whether it communicates the thing I want to communicate.

(Could we reserve magic keys? Sure. But it would be broken for some data which conflicts with the magic keys, despite that data being valid, so it would communicate horribly backwards things about how seriously we take isomorphism and completeness in this project.)

(Could we double the depth of the whole tree, and cram metadata in ever odd-numbered depth? Definitely. But it would start to suck horribly for human reading in demos and debug printf'ing, which is a primary objective here.)

And technicalities aside, a goal I have for a pretty printer is to be very obviously not aimed for being parseable again. It should scream "I'm a debug dump!" and ooze "This is not a serious codec!" from every single line.

I dunno, maybe I'll regret this in the light of the morning... but at the moment I'm feeling that using a real codec as the debug printf format would be a mistake even if it wasn't technically problematic. It might convince an incautious user that doing fmt.Printf("%v", theNode) and assuming the result is parseable is a valid assumption -- or worse, that the codec used for that is somehow the "natural" or "preferred" codec, which is an antigoal for IPLD -- and that seems like a potential misunderstanding that should be vanquished as early as possible.

@warpfork
Copy link
Collaborator Author

warpfork commented Oct 1, 2020

I should also add that this currently doesn't implement elision of deep nodes, or elisions of the middle of long lists, long bytes or strings, or big maps, but I think it probably should. This would be useful for when you want a quick check of "wtf is this variable" as you're hacking, but don't want to overwhelm your terminal.

I don't know whether we'd want to default any of those elisions to "on" in if we started using this (or something like it) as Stringers for nodes, but... maybe, yes.

@warpfork
Copy link
Collaborator Author

warpfork commented Oct 1, 2020

I'm not at all attached to this particular set of concatenations, either. (Saying "map node {" every time vs just saying { is 100% redundant, purely arbitrary, and utterly indefensible.) I just don't want to move the goalposts for this to "full, specified, bidirectional, widely support codec". That's... a bigger thing than than I'd toss out in a one-afternoon PR. :)

@rvagg
Copy link
Member

rvagg commented Oct 2, 2020

This is quite nice for %v, but I think I'm going to be reaching for a dag-json more than this tbh. What I'd really like is a solid dag-json and maybe a dag-json-pretty option too, that'd be really sweet for test fixtures and debugging. Maybe something I can work on. I know you have a stalled/dagjson-bytes branch on the go, perhaps I could/should pick that up?

Copy link
Contributor

@mvdan mvdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong opinions on "plaintext" vs json, but I do agree that reusing something that already exists might be nicer. I'm especially worried about all the extra code this seems to be adding.

codec/tools/strings.go Outdated Show resolved Hide resolved
case false:
st.write("bool node: false")
return st.checkErr()
default:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this case is quite literally unreachable though, so you can just remove it

or even better, use an if/else

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like to stick to uses switch cases when doing marshal/unmarshal stuff.

But it encounters a problem on bools: the golang compiler doesn't currently seem to recognize that booleans are exhausted after these two cases, and insists on this panic. If the panic isn't present, the whole function is considered to have a branch that doesn't have a return statement!

I suppose I might as well change to if, though, yes.

_, st.err = st.w.Write(b)
}
}
func (st *encodeState) checkErr() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I get the point of this func

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I should remove this. I must've expected it would do more work at first, and then didn't "gc" when it didn't turn out to be so.

@mvdan
Copy link
Contributor

mvdan commented Oct 2, 2020

It might convince an incautious user that doing fmt.Printf("%v", theNode) and assuming the result is parseable is a valid assumption

I think that's a valid point, but you're not implementing any String method here, so %v won't be affected. I also think that, whatever we do, %v shouldn't result in a multiline, indented, pretty-printing of a node. At most it should be some encoding/printing without any prettifying.

@mvdan
Copy link
Contributor

mvdan commented Oct 2, 2020

What I'd really like is a solid dag-json and maybe a dag-json-pretty option too, that'd be really sweet for test fixtures and debugging. Maybe something I can work on.

To add one more idea - perhaps we could export a DebugPrint somewhere, which would become the sort-of-official recommended way to pretty-print a node for debugging.

@warpfork
Copy link
Collaborator Author

warpfork commented Oct 4, 2020

In the light of the morning, I'll admit I'm not entirely a fan of everything I did here. I'm going to try to break down some lessons learned and comment on those here, and what happens to the code, well, we'll see.


First off, This is not actually particularly pretty.

Nuff said. I'm not attached to the output here.

The structure of the code might be fine, but the actual series of glyphs that get output are not very visually appealing.

The observation that we might want single-liner outputs for many debug purposes is probably true. That's sort of possible with this format, but not entirely unambiguous. (The output for Bytes in particular would look ridiculous if the Line []byte config field was set to nothing.)

Generally speaking, if we keep this code, I'd probably expect to iterate on the output.


I do still think the idea of a non-codec is useful -- both for the {it sometimes might be bad to use a codec because that communicates the wrong thing} reason, and also for the {i'd like to have debug output that tells me more than a data model tree would, because it's a debug output} reason.

But that's not to say this format is really nailing that. Or rather, it is hitting those two goals, but there's a lot of other outputs that would also hit those goals, and be better on other axis at the same time.

I've also put some further thought into "Well, what if we made a codec that just supported comment fields or something ignored when parsing back, and used that for the extra debug info?" and I'm still not feeling warm towards that -- the odds that such a codec would miscommunicate critical concepts about what's datamodel vs what's implementation detail looms big in my concerns. (Crockford's opinions about comments in JSON also loom sizable in my mind, hereabouts -- I think it had some solid points.) I dunno. Despite my misgivings, maybe it's a least-bad sort of thing.


I was hoping this would be interesting research for build-a-codec... but it either failed to be useful input, or, the answer is no. (I'm still not entirely clear on which; perhaps a bit of both.)

The overall structure of the code certainly resembles the same sort of switch statements that appear in marshaller functions. But extracting some sort of gadget which would be constructed with a bunch of callbacks for all those case bodies... didn't seem at all valuable, once I got going. The amount of additional code that would've added would be substantial for both the dreaming-of-reusability part and for the actual use of it; there's really no win visible there.

I'm not confident the lack of desirability of a build-a-codec gadget here is a downvote for the general case, though. This prettyprint function is exceptional in at least a couple of ways that are relevant: it doesn't make use of any tokenization abstractions (whereas codecs typically (if not always) do); and it doesn't have any resource limits attached to it (at least, again, not in the same way codecs do), both because those are less applicable during marshal (and we didn't build an unmarshal!), and also simply because... why would you bother with those for a debug function? So, the opportunity for sharable defaults was limited; and the tricky bits that make for a desire to have reusable core components also weren't a concern.

We did end up with a codectools package, though, and that's potentially interesting. It doesn't contain much, yet. But I think a freestanding function with a task like "here's human-readable string escaping that we recommend if you need a human readable string escaping" seems like it might have virtue. I could see growing more things of this sort. (If this was my first time writing a codec, I'd tell myself to chill, and not attempt to extract things prematurely; however, it is not. There's a list in my head of things that I think to myself "oh, dangit; I just want a widget for that" on a regular basis, and I think I may want to start acting on it.)


The string encoding function... is actually the one part here that I am extremely enthusiastic about. (Believe it or not.)

The docs on that function may seem a little flippant, but I did not make that function flippantly.
It's crafted to do its job losslessly, for all inputs. This is subtle, and important.

The key detail of this string escaping function is that when it encounters non-UTF sequences, it escapes the bytes as it finds them.
It sounds simple, right? And so it is. But simple is not the same as common.

If we compare this to the JSON string encode function found in Golang's standard library, for example, it's consequentially different.
In particular, observe these lines which throw away data and replace it with \ufffd:

https://github.com/golang/go/blob/846dce9d05f19a1f53465e62a304dea21b99f910/src/encoding/json/encode.go#L1065-L1070

The tests relating to that feature are also pretty clear that the behavior is lossy, and this is considered intentional: https://github.com/golang/go/blob/114719e16e9681bd1001326598ededa719c17944/src/encoding/json/decode_test.go#L670-L680

Now, mind, I'm not particularly mad at the golang implementation referenced here.
I think I'm actually mad at JSON itself.
JSON appears to specify that the only form of escaping valid in strings is the short list of \r, \n, etc, and the rest is all \u.

A restriction to '\u' escapes is necessarily[‡] lossy. The \u escape ranges can only describe unicode characters; it cannot describe byte sequences which are invalid unicode.

I do not think that an escaping function that cannot describe all byte sequences is an acceptable string escaping function; and therefore I don't think what JSON specifies about escaping is good. I realize that's a powerful statement. I am indeed making it, though. It is not 8-bit clean, and I do not like that.

This is something I feel fairly strongly about. After staring at the string situation in various guises over the last couple of years -- in the context of filesystems and the unixfs specs, in the context of serialization determinism, in the context of unicode canonicalization schemes, but most importantly and simply in the context and goal of not having lossy functions -- I've come to believe fairly adamantly that a string escaping function that's unable to encode all 8-bit bytes without loss is a catastrophic mistake.
The inability to losslessly regard all 8-bit bytes has massive knock-on repercussions that echo into other adjacent systems, and produces a swath of destruction, both in terms of increased complexity and in outright incorrectness when interfacing with systems which have different domains.

The string encoding function here simply encodes non-unicode bytes as \x followed by the two hex characters describing the byte. Simple.

[‡] - It may be that the claim "\u escapes are necessarily lossy" is a little strong. Some other folks have created a specification called "Clean 8" UTF ("UTF8-C8" for short) which also addresses this problem and successfully avoids loss: https://docs.raku.org/language/unicode#UTF8-C8 . The details are interesting, and I won't go into it further here, but suffice to say I think the goal of UTF8-C8 is certainly onto something. However: if one can simply escape individual bytes without a fuss, doing so is much more direct than UTF8-C8, and I suspect that more direct approach should be the preferred choice where possible.

So! That was probably more text than you expected about that string escaping function. Some of this I probably shouldn't leave buried in a comment on this PR, because it's fairly consequential for IPLD specifications as a whole. Expect to hear more about this later and elsewhere; but for now it is at least here.

@rvagg
Copy link
Member

rvagg commented Oct 5, 2020

I don't really want to make this all about UTF8 but I wouldn't mind hearing more on why you think \x.. is better than Clean8, especially when you're having to add extra escaping for \ in your scheme which Clean8 doesn't have to do with their mechanism. Clean8 is not as easy to read plaintext but is not unclear if you know what you're looking at. Is it simply the extreme novelty of the scheme that's the problem? That you don't know what you're looking at if you're not told about Clean8?


I was wondering about how much you could extend this debugging stringifier to schema types. Can you see a logical path from what you have here to a format that also embeds extra data gathered through the schema lens? That might get kind of interesting if you can start to hang additional metadata on top of it. As it is now, I can't really see what else you can add that an existing codec (like dag-json) couldn't since it's all just data model.

@warpfork
Copy link
Collaborator Author

warpfork commented Oct 5, 2020

We should probably open a discussion over in the specs repo for this string discussion, but in brief:

I wrote most of that screed before looking at UTF8-C8 again in detail, just to confess that first of all :) But now that I have looked at it more...

I think "\xXX" escaping is still preferable where it's possible, yes.

  • UTF8-C8 was concerned with readability, and did reasonably well on that goal. But I think "\x" is more readable than "{unencodable placeholderglyph}x" (or more literally, if your browser renders it well: "􏿽x").
  • The encoding size:
    • UTF8-C8's encoding for this is simply large:
      • three bytes for the 0x10FFFD
      • three more bytes for the 'x' literal and then the two hex chars
      • six bytes in total
    • The "\xXX" strategy is four bytes in total.
  • They're probably about equally slow to work with, in orders of magnitude; not something I'd worry about much, probably. But I'd bet there's a few more CPU cycles involved in UTF8-C8. I haven't measured; just betting.
  • I haven't figured out what UTF8-C8 does if it encounters another 0x10FFFD in the data. Does it escape it? Need to check this.

So it makes sense to use UTF8-C8 if there's no other choice -- as would appear to be the case in JSON, for example, since the JSON spec says \u is the only general purpose escaping allowed (\x is notably absent). But in contexts where it's just possible to make \x a thing, doing so seems preferable to me.

@warpfork
Copy link
Collaborator Author

warpfork commented Oct 5, 2020

I was wondering about how much you could extend this debugging stringifier to schema types. Can you see a logical path from what you have here to a format that also embeds extra data gathered through the schema lens?

That's... a really good question. Something like that would take us way out of codec territory, and also probably be very useful.

I don't know if I can visualize anything at the moment though. It would get really tricky anytime there's a representation strategy involved that goes from recursive structure on one side to scalar on the other (like structs with stringjoin representations, for example). Just printing the two things side by side might be one of the clearer options. It would be cool to make a debug format for this though, agree. I just don't know if I'm clever enough!

As it is now, I can't really see what else you can add that an existing codec (like dag-json) couldn't since it's all just data model.

Well, the part of this PR that makes it "not data model" is:

https://github.com/ipld/go-ipld-prime/pull/89/files#diff-706c344ef233e62869dc1e3ad7d3e0b0R75-R79

It's a tiny thing, but "not data model" is a rubicon line, and encoding that little bit of extra info crosses it.

@ribasushi
Copy link
Contributor

Also don't want to make it all about utf8, but this claim doesn't stand on its own

I think "\x" is more readable than "{unencodable placeholderglyph}x" (or more literally, if your browser renders it well: "􏿽x").

Consider that bytes will show up in pathing selectors. The difference there would be:
foo\xNN\\sl/\xMM\\bar
vs
foo􏿽N􏿽N\sl/􏿽M􏿽M\bar

The top one looks like 😎 perl from the turn of the century
The bottom one is ugly, but unambiguously eyeball-able

@mvdan
Copy link
Contributor

mvdan commented Oct 5, 2020

Are you trying to print any byte sequence string, or only valid unicode (utf-8) strings? If you only care about valid unicode, JSON would be fine for you.

Since you seem to want any byte sequence, then go for https://golang.org/pkg/strconv/#Quote. This will just work, because Go strings can contain anything, including null bytes. And, similarly to Clean8, it will only use character escapes when necessary.

This would be Go-specific, but you're writing a pretty-printer for Go nodes, so I think that's fine. If you do want it to be cross-language and standard, then I agree with the others that you should pick an existing string encoding standard, like Clean8 or JSON or whatever works for you. Coming up with a new standard seems unnecessary.

@warpfork
Copy link
Collaborator Author

warpfork commented Oct 5, 2020

Any. Definitely any.

https://golang.org/pkg/strconv/#Quote

Reading now. Looks valid aka lossless.

The most relevant sections of source appear to be approximately:

Interesting. Tries to use \u and \U escapes, and only falls back to \x when it can't. Appears to spend more time considering if unicode characters are printable, also, which I don't know what to make of -- it's interesting to me that the JSON encoder in golang stdlib doesn't regard this, but this quoting function does.

I'm... indifferent to this. This may deserve more reflection, but my first impression is: I don't actually care for quoting with \u characters at all in most cases.


To carry on with that "in most cases" thought: There are roughly three scenarios I can see being concerned with:

  1. I want to assume UTF8 and a good renderer, but still don't tolerate loss when my assumption is not met:
    • so, I want to emit anything that's valid UTF8 ranges without escaping; and if it's not UTF8, I want to \x escape.
  2. I am straightjacketed into UTF and can only use \uXXXX escapes:
    • I must use UTF8-C8, due to the straightjacket.
  3. I don't want to assume UTF8 at all, and would rather escape all non-ASCII.
    • so, I'd lean very heavily on \x.

Situation 1 is most of IPLD.

Situation 2 is what happens if something like the JSON spec straightjackets the possible escaping.

Situation 3 doesn't really come up in IPLD (or if it does, we'd rather disregard it anyway because can't really entirely predict where it will come up, because it's based on user data and moods, and we're not in the business of guessing on those things because it tends to wreck make things more complicated).

I'm not really sure I understand what situations exist where I'd want to escape some parts of unicode out of distrust for the thing rendering it. If there are some examples of these situations, I'd be interested. (I know golang JSON again provides interesting reference here: it specialcases \u2028 and \u2029. However, that seems to be due to things that seem to me like weird JSON/javascript historical baggage, and should not be uncritically ported to other situations, if indeed it belongs in JSON at all -- so, I don't think this is a good example.)

@warpfork
Copy link
Collaborator Author

warpfork commented Oct 5, 2020

I should clarify: I am not intending to push this to merge anytime soon, at this point. This PR has become the fuse-lighter for strings vs unicode discussions for IPLD, and I'm okay with that outcome. But we should move it out of this implementation PR and into our discussions during the community call tonight and in the specs repo in the future. I'll consider this PR roughly blocked until that conversation advances in some direction.

@warpfork
Copy link
Collaborator Author

warpfork commented Dec 1, 2020

I'm going to do a merge-ignore on this (more literally, git merge --no-ff -s ours: the commits will be referenced as ancestors of master, but the diffs won't actually have an effect because of the merge strategy).

The content explored here (and especially, the discussions generated) have had some merit and are worth keeping. But most of the good ideas have been picked up and carried further by other PRs that were written with the lessons learned here. So, I don't see any reason to try to carry this further. (If anything, it can be rebooted with a fresh take in the future.)

warpfork added a commit that referenced this pull request Dec 1, 2020
The original idea of this branch was to explore some reusable
components for codecs; maybe to actually get a useful
prettyprinter; and... actually turned a lot into being a bit of
practical discovery about string encoding and escaping.

Some of those goals were partially accomplished, but in general,
this seems better to chalk up as a learning experience.
#89 (comment)
already does a good job of discussing why, and what as learned.

A lot of the reusable codec components stuff has also now
shaken out, just... in other PRs that were written after the learnings
here.  Namely, #101
was able to introduce some tree transformers; and then
#112 demonstrates how those
can compose into a complete codec end to end.
There's still work to go on these, too, but they seem to have already
grabbed the concept of reusable parts I was hoping for here and gotten
farther with it, so.

These diffs are interesting enough I want to keep them referencable
in history.  But I'm merging them with the "-s ours" strategy, so that
the diffs don't actually land any impact on master.  These commits
are for reference only.
@warpfork warpfork closed this Dec 1, 2020
@warpfork warpfork deleted the prettyprinter branch December 1, 2020 03:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants