Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[json.exception.type_error.317] cannot serialize binary data to text JSON #2067

Closed
xamix opened this issue Apr 27, 2020 · 29 comments
Closed
Assignees
Labels
aspect: binary formats BSON, CBOR, MessagePack, UBJSON confirmed kind: bug solution: proposed fix a fix for the issue has been proposed and waits for confirmation
Milestone

Comments

@xamix
Copy link

xamix commented Apr 27, 2020

  • Describe what you want to achieve.
    I need to get a string representation of my JSON object containing binary type

  • Describe what you tried.
    Use function dump on my JSON object

  • Describe which system (OS, compiler) you are using.
    Linux Ubuntu GCC 9

  • Describe which version of the library you are using (release version, develop branch).
    Current develop d9d1279

In my current application, I receive MessagePack serialized object containing binary array type (0xC4)
In the current develop I saw that you added handling of this

The convertion from MessagePack data to json is working:

// Unpack message
json j = json::from_msgpack(data, data_len);

But using j.dump() or stream it into std::cout is not possible and throw the following exception:
[json.exception.type_error.317] cannot serialize binary data to text JSON

However I need to save the JSON and also print it in a LOG.

I understand that it cannot be serialized because the binary type would be "lost" when saved as an array and readed back since in the json format nothing can differenciate it from basic array.
Type will change to value_t::array but at least I can save and print my object.

Do you have a solution?

@xamix
Copy link
Author

xamix commented Apr 27, 2020

I saw that the dump function have the serialize_binary option,

It do not work however since at line 14763 and also at other place like 14774, the serialize_binary is not passed in the dump function...

However this will also not allow us to save and reload the json object isn't it?

I don't need the binary type:
Reloading object with value_t::array will be sufficient

@nlohmann
Copy link
Owner

This looks like a bug. I will check.

Can you provide your input as a quick example to trigger the issue?

@xamix
Copy link
Author

xamix commented Apr 27, 2020

Also at line 17726 and 17730
It should be changed to add the serialize binary at the good place:

 if (indent >= 0)
        {
            s.dump(*this, true, ensure_ascii, static_cast<unsigned int>(indent), 0, serialize_binary);
        }
        else
        {
            s.dump(*this, false, ensure_ascii, 0, 0, serialize_binary);
        }

I added the 0 just before the serialize_binary

This looks like a bug. I will check.

Can you provide your input as a quick example to trigger the issue?

I need some time to generate a short example to reproduce the issue.
I will make it and put it here

@xamix
Copy link
Author

xamix commented Apr 27, 2020

Here is minimal example:

#include <iostream>

#include <nlohmann/json.hpp>
using json = nlohmann::json;

int main()
{
    const unsigned char data[] = {0x81, 0xA4, 0x64, 0x61, 0x74, 0x61, 0xC4, 0x0F, 0x33, 0x30, 0x30, 0x32, 0x33, 0x34, 0x30, 0x31, 0x30, 0x37, 0x30, 0x35, 0x30, 0x31, 0x30};
    json j = json::from_msgpack(data, sizeof(data) / sizeof(data[0]));
    std::cout << j.dump(4,                              // Indent
                        ' ',                            // Indent char
                        false,                          // Ensure ascii
                        json::error_handler_t::strict,  // Error
                        true );                         // Print binary
    return 0;
}

The output should be something like:

{
    "data": b[51, 48, 48, 50, 51, 52, 48, 49, 48, 55, 48, 53, 48, 49, 48]
}
  1. Also removing the b before the array would allow us to parse back the json object.
  2. Also why not activate by default the print of binary, so even with binary data, we can dump easyly object and parse them back?
  3. Maybe adding the binary type inside json is not a good idea, the data from MessagePack should be parsed to array. The convertion to MessagePack will not be as efficient as with binary but will works anyway.

@nlohmann nlohmann added aspect: binary formats BSON, CBOR, MessagePack, UBJSON kind: bug and removed kind: question labels Apr 27, 2020
@nlohmann
Copy link
Owner

/cc @OmnipotentEntity

@nlohmann nlohmann added this to the Release 3.8.0 milestone Apr 27, 2020
@nlohmann
Copy link
Owner

As the whole binary feature has not yet been released, we are free to experiment with the behavior here. We may want to replace the bool serialize_binary with an enum that allows for different behaviors:

  • dump as array
  • dump as b-prefixed array
  • throw an error in case of binary data

What do you think?

@nlohmann
Copy link
Owner

(Good catch, I could confirm the bug and start a branch with a fix. Meanwhile, we can discuss #2067 (comment))

@nlohmann nlohmann added solution: proposed fix a fix for the issue has been proposed and waits for confirmation state: please discuss please discuss the issue or vote for your favorite option labels Apr 27, 2020
@nlohmann nlohmann self-assigned this Apr 27, 2020
@xamix
Copy link
Author

xamix commented Apr 27, 2020

Seem not a bad idea!

But what annoy me is the default settings if they are the following two:

  • Throw an exception
  • Dump as b-prefixed array.

I have many place where I do a dump in my code: to LOG but also to save as a string and parse it later.

To my understanding whatever is the JSON object, a call to the default dump() function allow us to serialize the object and to parse it later, so the default should be dump as array

Otherwise I need to know wether the current JSON contains binary to call dump() with the full parameters list, which is very long. I can add a wrapper to call always with the good parameters but
maybe this should be the default, as it works in any case (serializing to MessagePack and dumping to a string)

dump(4,                              // Indent
 ' ',                            // Indent char
false,                          // Ensure ascii
json::error_handler_t::strict,  // Error
true )

@nlohmann
Copy link
Owner

I understand your concern on the default value. Let's put that aside and think about the possible values. Let's then choose the default. Furthermore, I would like to get @OmnipotentEntity 's ideas on this as they came up with the original implementation and may share some insights in the choice of the default value.

@xamix
Copy link
Author

xamix commented Apr 27, 2020

At first I was just thinking if a new binary type was a good idea...
Since it's an array of byte, it overlap with the other type array defined in json no?
Maybe the array can be annoted with binary type or other type instead of a new whole type.

The binary type is usefull only to pack to some other library like MessagePack? (Which will compress better)
If you receive data from MessagePack format you could make data as array instead of binary and annotate array as binary data.

Then it should be the function wich transform to_msgpack for example which can be tuned to process array as binary or not

EDIT:
As a side note, the new binary type also will force some other library to implement this new type as schema validator here

@nlohmann
Copy link
Owner

  • The idea of having a dedicated type for binary data as it is just a sequence of bytes, whereas an array is a sequence of basic_json objects.
  • A binary value also has a type to help to interpret the data. That type is set by the different parsers (CBOR or MessagePack). This may be up to your specific application, so we just store the data internally for further investigation.
  • The normal use case is either to (1) parse the data from a binary format and process it afterward, or (2) create it in your application and serialize it using a binary format. In these use cases, dumping binary data to JSON is an edge case and rather suited for debugging. This is the reason for the current default behavior.
  • I do not think that the change affects JSON Schema. JSON data does not contain binary data. The binary data type is just for convenience to be able to read/write binary data in the use cases described above.

@xamix
Copy link
Author

xamix commented Apr 27, 2020

Also I have another cosmetic bug when pretty printing binary:

{
 "data": b[50,
48, 48, 50, 51, 52, 48, 49, 48, 55, 48, 53, 48, 49, 48]
}

As you can see a new line is inserted in the begining of the array.

At line 14895

It try to add a new line but it seem the condition is not good, for the first byte.

Can be changed to:

auto distance = std::distance(val.m_value.binary->cbegin(), i);
if (distance != 0 && (distance % 15) == 0)
{
    o->write_character('\n');
}
else
{
    o->write_character(' ');
}

EDIT:
Thinking about it, the condition to print new line is not really usefull since it do not take into account the current indent value, so the new line will be shifted
Also values can have different digit length and would not be aligned...

EDIT2:
In fact just take the same formatting logic from array type, and all will be good

@OmnipotentEntity
Copy link
Contributor

Ugh, how embarrassing about that serialization bug. The code is apparently in a rougher state than I thought. I'm still away from my computer on vacation, so I can't get a patch in until at least mid-next week.

So... my original thoughts about serialization of binary values is, frankly, there's no good way to do it in a way that everyone will agree with. Various ideas were thrown around: integer array, base64 string, optionally prefixed or not with a \0, etc. And so it seems like the only "correct" way to serialize to binary would be to, essentially, make the user do it.

The b[] syntax was something that I came up with on a whim and justified it as still LL1 parseable by observing that barewords are disallowed in json, so you can see the b and expect an array of integers afterward. But I didn't even write deserialization code, (or clearly even the serialization code all of the way, I plead having to rewrite (both the patch and the serialization code) a few times, and I wound up not actually needing it at all (it was originally only written to stop the previous testing harness from crashing when trying to output debugging data), so I completely forgot about keeping it up to date with the changes. It was a very large patch).

Probably a better way of implementing it would be to take an optional binary value formatting function along with the serializer and a matching binary value reading function along with the deserializer. I can leave in the b[] syntax as a default, mostly because it's easy, if not particularly efficient, and add a default deserializer for it. It's probably also good to provide a way to simply omit serialization of binary types as well.

Anyway, I seem to have chosen a horrible time to take a vacation, and I apologize. I also apologize for missing this thread for a few days. I'll have a patch to you as soon as I am able.

@nlohmann @xamix

@nlohmann
Copy link
Owner

@OmnipotentEntity No need to apologize! You did the heavy lifting of adding a binary type in the first place! It's in the develop branch now and we can spend some time polishing it and finding a nice API for it. I think I'll push the fix for passing down serialize_binary later today. I'll make a proposal for how to deal with binary data during serialization during the weekend.

Please enjoy your vacation!

@nlohmann
Copy link
Owner

The original issue is now fixed.

@OmnipotentEntity
Copy link
Contributor

OmnipotentEntity commented Apr 30, 2020

Thanks for taking care of that.

Thinking a little bit on the b[] syntax, in order to robustly round-trip these values you'd also have to bring along the subtype. This means another delimiter in order to carry it along. So it is possible to optionally use ()s or {}s to denote this. For instance, b(1)[3, 0, 2] could make a JSON binary value of subtype 1 with values 3, 0, 2. Using {}s might be bad because of existing pretty printers that don't perform full parsing on the JSON might become confused or mangle the input. (But, of course, full compatibility is probably impossible, because extension.)

Is this a good place to hash this out? Or is there a better one? (RFC process?)

@nlohmann
Copy link
Owner

nlohmann commented May 3, 2020

I thought about this topic a bit:

  • I am not really convinced to actually round-trip binary data via JSON. I do not like non-conforming additions to the parser, so reading b[] would always be a parse error. I also do not like encoding binary values into some convention, like an object like

     {
       "type": 12,
       "data": [1, 2, 3]
     }

    where only this library can derive that this is a binary value of subtype 12 with the bytes 1, 2, 3. There is just no binary type in JSON, and any work in that direction would break conformance.

  • As I tried to express in [json.exception.type_error.317] cannot serialize binary data to text JSON #2067 (comment), I think dumping a binary value as JSON is only used during debugging. in that scenario, any format that makes it easy to verify that binary data was correctly received and maybe also allows to quickly copy/paste it into another tool to process it.

  • Then the question is: What are the best formats to digest binary data during debugging. I think, a byte array would be the easiest - maybe also a base64-encoded string. In either case, the subtype should also added. Right now, I am torn between using an object like above or to use non-conforming JSON output like a b-prefix. The former has the advantage of being easy to digest (in particular in tools that can display JSON files), but has the disadvantage that binary data may get missed when inspecting the file. Furthermore, expressing binary data as valid JSON brings back the idea of roundtripping which I am not willing to do. The latter (b-prefix) makes binary data easy to spot and makes sure the result is not JSON. The disadvantage is that tools may choke on that data.

  • So right now, I can think about having an enum to choose from different options, and right now I would favor throwing an exception as default. Skipping binary values (i.e., replacing them with null) would also be nice.

What do you think?

@xamix
Copy link
Author

xamix commented May 3, 2020

I'm in favour of the enum solution at the current state.

Maybe the enum can have meaning not only for binary type but also for future type:
Something like DumpAsValidJson which would produce a valid json output not depending on the embedded datatype. And DumpWithType which can prepend the 'b' for binary for exemple.

Also I want to bring some problem I'm facing:

  1. I can receive packed data from unknown source, how can I skip easyly data which are containing such non standard Json type? Throwing an exception when such dara are unpacked for example?

For now I'm only aware that those data was embeded when I later call dump function which throw the exception...

  1. Also on the exception, is it possible to display the field name which throwed the exception:

[json.exception.type_error.317] cannot serialize binary data to text JSON miss the field name which throwed this exception.

Regards.

@nlohmann
Copy link
Owner

nlohmann commented May 6, 2020

FYI, I am currently working on #2082 to add missing tests and thereby make some minor refactorings or fixes to the code for binary values. Note the public API may still change as the feature is not released yet. Once I am done, I will be happy to discuss the PR.

@OmnipotentEntity
Copy link
Contributor

Ok, I'm finally back.

So it sounds like there are two major wants for the binary API:

  1. The ability to detect, convert, and/or exclude binary values from an arbitrary JSON tree.
  2. Different methods of dumping binary values to string.

For the first, utility functions would be useful to automatically walk the JSON tree and perform these actions, but my understanding is in principle, this should be possible already. I'm not sure if that should or should not be in the library. You could make a similar (but less robust) argument regarding all sorts of carry-along data that might be serialized with the data you want.

@nlohmann you seem to favor an enum based approach from a limited list of options. But I'm really not so sure that's flexible enough. This is the same interface used for general serialization to JSON, and different applications might have different requirements for their packed binary data, which is why I initially recommended passing a serialization function (and having some example/default serialization functions available).

That being said, I'm not certain that dumping binary values to JSON has much utility outside of debugging, so if the reasoning behind a simple enum is because few people would use it anyway, then point taken. It's also not as if these two options are mutually exclusive. It's entirely possible to have some default functions enumerated, and have the ability to pass a general serialization function via an overload.

@nlohmann
Copy link
Owner

I polished the binary types, see #2125. I decided against putting too much logics into the JSON serialization - I implemented a simple serialization as object like

{"bytes": [1,2,3], "subtype": 42}

or

{"bytes": [1,2,3], "subtype": null}

if no subtype is given. The result is valid JSON without throwing an exception.

@OmnipotentEntity @xamix Please take a look! From my side, this would be the last remaining PR before we can release 3.8.0.

@xamix
Copy link
Author

xamix commented May 19, 2020

  1. What are the values for subtype and what do they mean?
  2. When subtype is equal to null what this mean?
  3. When json containing binary data is dumped and is parsed back, the object change internal type to the standard json now?
  4. When parsing binary from msgpack (from my small example above), the subtype seem to be null, is it intended?
  5. Do we have a function to check if json contain binary value? As of now with the current release I catch exception when using function json::from_msgpack with data containing msgpack binary value type.
    But now the function will not fire any exception with this type, so how to check if parsed json contains "non dumpable" or binary type in order to drop the parsed message?

@nlohmann
Copy link
Owner

nlohmann commented May 19, 2020

  1. Formats like BSON use subtypes to define the type of the binary value. For instance, 0x05 denotes an MD5 hash, see http://bsonspec.org/spec.html.
  2. CBOR, for instance, does not have subtypes. As 0 is a proper subtype, we use null to denote the absence of a subtype.
  3. Yes, binary data cannot roundtrip via JSON. This is intentional.
  4. I did not check the value, but it may well be that in that case, no subtype was set. Update: checked the value, see below:
  5. For a value, you can call is_binary() to check if it is binary. The library does not have a means to make a recursive check that traverses arrays and objects, but it should be easy to write. You are right, dumping JSON will not throw due to a binary value. Maybe I need to better understand what you want to achieve.
#include <iostream>
#include "json.hpp"

using json = nlohmann::json;

int main()
{
    std::vector<std::uint8_t> v = {{0x81, 0xA4, 0x64, 0x61, 0x74, 0x61, 0xC4, 0x0F, 0x33, 0x30, 0x30, 0x32, 0x33, 0x34, 0x30, 0x31, 0x30, 0x37, 0x30, 0x35, 0x30, 0x31, 0x30}};
    json j = json::from_msgpack(v);
    std::cout << j.dump(2) << std::endl;
}

Output:

{
  "data": {
    "bytes": [51, 48, 48, 50, 51, 52, 48, 49, 48, 55, 48, 53, 48, 49, 48],
    "subtype": null
  }
}

The binary value is of type 0xC4 (bin 8) and has no subtype.

@xamix
Copy link
Author

xamix commented May 20, 2020

@nlohmann Thank you very much for all your answer!

Concerning 4):
The subtype when parsing with messagepack should maybe be filled with the parsed type (see the doc):

bin 8 11000100 0xc4
bin 16 11000101 0xc5
bin 32 11000110 0xc6

In your example, maybe subtype hould be 0xC4 or 196?

Concerning 5):
In the current application I receive messagepack data from an external source which I have no control on it.
I need to save the received data in a database in json text format.

As of today with current library it throw on data which contain binary because it is not possible to save it as text and then parse data back and showing the exactly contained fields i.e:

If I use the dump function with the new future release, now it generate some non existing fields in the json (in your example bytes and subtype)

So now with the new release, I must parse completely the received messagepack data,
then loop over each fields to check the type and that I will be able to dump() without generating non existing fields.
Which is not as efficient compared to the current release which stop parsing when it find binary type

Another solution is that I will implement a dump() function which will output binary data as a simple array of integer (no subtype or bytes fields), an so I will be able to also save binary data as text.

Maybe something to specialize the dump behavior is possible like pseudo code below:

namespace nlohmann {
    template<>
    struct serializer<BasicJsonType::binary_t> {
        static void dump( output_adapter_t<char>& o j,
                          BasicJsonType::binary_t& b,
                          const bool pretty_print,
                          const unsigned int indent_step,
                          const unsigned int current_indent = 0) {
            [...]
        }
    };
}

@nlohmann
Copy link
Owner

The bin8, bin16, and bin32 are no subtypes, but just different versions of the bin family which differ in the number of bytes. The difference is not related to the content.

I understand your usecase, but when you receive data that you have no control over, you must process it anyway.

Why would it be better to have binary data as array of integers or text rather than the current solution?

@xamix
Copy link
Author

xamix commented May 20, 2020

Why would it be better to have binary data as array of integers or text rather than the current solution?

The user/interpreter expect to see only existing fields:

{
  "desc": "hello",
  "data": [51, 48, 48, 50, 51, 52, 48, 49, 48, 55, 48, 53, 48, 49, 48]
}

With dump, it create other fields:

{
  "desc": "hello",
  "data": {
    "bytes": [51, 48, 48, 50, 51, 52, 48, 49, 48, 55, 48, 53, 48, 49, 48],
    "subtype": null
  }
}

So there is no way to determinate that data is only an array after using dump().
Here the used interpreter will think that data is an object with bytes and subtype fields.

And if I create an object wich is exactly the same format you dump binary, I cannot determine if data was an binaray array or a full existing object:

"data": {
    "bytes": [51, 48, 48, 50, 51, 52, 48, 49, 48, 55, 48, 53, 48, 49, 48],
    "subtype": null
  }

@nlohmann
Copy link
Owner

So in that case, you need to write a cleanup function that replaces binary values by the byte vector:

#include <iostream>
#include "json.hpp"

using json = nlohmann::json;

template<class UnaryFunction>
void recursive_apply(json& j, UnaryFunction f)
{
    for(auto it = j.begin(); it != j.end(); ++it)
    {
        if (it->is_structured())
        {
            recursive_apply(*it, f);
        }
        else
        {
            f(*it);
        }
    }
}

int main()
{
    json j;
    j[0]["desc"] = "an integer";
    j[0]["data"] = 1;
    j[1]["desc"] = "a binary";
    j[1]["data"] = json::binary({1,2,3});
    j[2]["desc"] = "an object";
    j[2]["data"] = {{"foo", "bar"}};

    std::cout << j << std::endl;

    recursive_apply(j, [](json& j) {
        if (j.is_binary()) {
            // replace binary value by integer vector
            j = static_cast<std::vector<std::uint8_t>>(j.get_binary());
        }
    });
    
    std::cout << j << std::endl;
}

Output:

[{"data":1,"desc":"an integer"},{"data":{"bytes":[1,2,3],"subtype":null},"desc":"a binary"},{"data":{"foo":"bar"},"desc":"an object"}]
[{"data":1,"desc":"an integer"},{"data":[1,2,3],"desc":"a binary"},{"data":{"foo":"bar"},"desc":"an object"}]

@xamix
Copy link
Author

xamix commented May 20, 2020

What a nice idea!
Thank you!
I didn't think to this at all!

As a side note, thank you very much for your library :-)

@nlohmann
Copy link
Owner

Closed by merging #2125.

@nlohmann nlohmann removed the state: please discuss please discuss the issue or vote for your favorite option label May 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aspect: binary formats BSON, CBOR, MessagePack, UBJSON confirmed kind: bug solution: proposed fix a fix for the issue has been proposed and waits for confirmation
Projects
None yet
Development

No branches or pull requests

3 participants