Optimize memory usage of json objects in combination with binary serialization #373

mwittgen · 2016-11-30T19:09:33Z

With binary serialization being implemented by msgpack and cbor I wonder if some of the binary format optimizations can be extended to json object memory representation by introducing more specific storage types for floating point numbers and integers by extinguishing between uint_8/16/32/64, int_8/16/32/64, float32 and float64. I am aware this does not reduce the size of a JSON object but would retain the additional type information from the binary formats. For optimizing memory size, what about introducing the concept of ArrayType<uint8_t> and so on and allow the optimized storage for arrays consisting of one specific type? The json object would store a pointer to the optimized array. On a 64-bit system one element in an array of uint8_t uses 16 bytes instead of one byte. A downside is sacrificing the indexing capability j["some_array"][0] would not work if j["some_array"] if it maps to a pointer to an array object.

nlohmann · 2016-11-30T19:27:48Z

Some thoughts:

This would introduce a lot of complexity, because any additional integer type would need another variant of the code. Maybe there could be some template magic to achieve this, but I am skeptical about the advantages compared to added complexity.
As pointed out, adding an ArrayType would only make sense if we have a arrays of the same type. I wonder if this is really a use case, and it would definitely break the library's goal of having a nice API if array[0] would not work.

mwittgen · 2016-11-30T19:44:24Z

Yes, that adds a lot of complexity. I went through some brute force approach to add these extra types: https://gitlab.cern.ch/slac_sandbox/ubjson/blob/ubjson/src/json.hpp
This is certainly not complete but basically added what I needed.
Writing a nice memory storage class in C++ would actually end up with a similar product than yours. I have a use case for storing large arrays (20000 entries) of float32 and int16_t, which were previously stored in plain text. The original JSON parser turned out to be very slow, so I decided to explore ubjson and msgpack in combination with nlohmann::json as storage class before I realized the msgpack and CBOR effort.

TurpentineDistillery · 2016-12-01T04:06:14Z

I'd say JSON is not a format that is well-suited for storing blobs or arrays of numbers large enough so that parsing, writing, and memory utilization issues become important.

The API should target usability and common use-cases.

nlohmann · 2016-12-01T21:55:58Z

@mwittgen I had a look at your code. I understand your use case, but the code grows up to an unmaintainable state by copy/pasting nearly the same code over and over again. Maybe some template magic could help, but I fear that this effort only serves a very particular edge case.

mwittgen · 2016-12-01T22:27:15Z

@nlohmann thanks for looking into it. I might continue to look into some template approach. I needed a fast straw man for proving some concepts. Certainly for my use case JSON parsing was was much slower compared to other preferred storage solutions in the physics community like CERN ROOT trees. The rapidJSON parser was as fast ROOT tree parsing. But with the new binary format support parsing should be very fast.
What is intriguing about this JSON library is the API. My thought was when supporting binary JSON-like formats why discarding the extra stored information in the memory representation? This also takes way the guess work when optimizing for the binary format.

nlohmann · 2016-12-01T22:37:02Z

So storing the exact type of integers (did I miss floating point types) is a separate concern from the compact integer vectors?

mwittgen · 2016-12-01T22:50:49Z

Yes. float32/64 and all the u(int) types. Preserving this information is separate from compact integer/float vectors. Unfortunately, there is no silver bullet when it comes to existing binary formats. UBJSON has optimized array support for all types, but lacks unsigned support, msgpack offer unsigned support, but limited optimized array support. I haven't looked into CBOR yet. WIth ubjson to parse into an optimized vector is trivial, with msgpack the parser would need to do some guess work or make use of user defined types. msgpack only directly supports storage of vector<uint8_t> through its byte array.

TurpentineDistillery · 2016-12-02T00:14:24Z

Maybe in binary mode relax the requirement that string must contain a UTF-8 string, and allow arbitrary bytes, so that the user can put any blob in there, e.g. an array of floats or any other POD. Serialization of such data as json should fail, of course.

nlohmann · 2016-12-11T15:17:52Z

FYI: I now merged the MessagePack/CBOR implementation to the develop branch.

nlohmann · 2017-01-02T16:33:20Z

I don't think that supporting all kinds of numeric types would bring broad benefit to the users of the library. For the exchange of large vectors, CBOR/MessagePack should help.

mwittgen · 2017-03-20T16:49:00Z

@nlohmann Using in the long run std::variant and for now similar variant classes implemented for C++11/14 like eggs::variant for json_value would work when implementing more numeric storage types without bloating the code. A lot of the switch(type)/case statements could become obsolete. I still would argue to conserve the type information stored in the binary formats in the json object has some benefits. I have started to play around with eggs::variant and nlohmann::json.

TurpentineDistillery · 2017-03-20T23:46:35Z

@mwittgen
FYI there's https://github.com/mpark/variant, which implements the standard with c++14.

nlohmann added kind: enhancement/improvement state: please discuss please discuss the issue or vote for your favorite option labels Nov 30, 2016

nlohmann modified the milestone: Release 2.0.9 Dec 2, 2016

nlohmann closed this as completed Jan 4, 2017

nlohmann added the aspect: binary formats BSON, CBOR, MessagePack, UBJSON label Mar 28, 2017

nlohmann mentioned this issue Jun 1, 2017

Use of the binary type in CBOR and Message Pack #601

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize memory usage of json objects in combination with binary serialization #373

Optimize memory usage of json objects in combination with binary serialization #373

mwittgen commented Nov 30, 2016

nlohmann commented Nov 30, 2016

mwittgen commented Nov 30, 2016

TurpentineDistillery commented Dec 1, 2016

nlohmann commented Dec 1, 2016

mwittgen commented Dec 1, 2016

nlohmann commented Dec 1, 2016

mwittgen commented Dec 1, 2016

TurpentineDistillery commented Dec 2, 2016

nlohmann commented Dec 11, 2016

nlohmann commented Jan 2, 2017

mwittgen commented Mar 20, 2017

TurpentineDistillery commented Mar 20, 2017

Optimize memory usage of json objects in combination with binary serialization #373

Optimize memory usage of json objects in combination with binary serialization #373

Comments

mwittgen commented Nov 30, 2016

nlohmann commented Nov 30, 2016

mwittgen commented Nov 30, 2016

TurpentineDistillery commented Dec 1, 2016

nlohmann commented Dec 1, 2016

mwittgen commented Dec 1, 2016

nlohmann commented Dec 1, 2016

mwittgen commented Dec 1, 2016

TurpentineDistillery commented Dec 2, 2016

nlohmann commented Dec 11, 2016

nlohmann commented Jan 2, 2017

mwittgen commented Mar 20, 2017

TurpentineDistillery commented Mar 20, 2017