Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize memory usage of json objects in combination with binary serialization #373

Closed
mwittgen opened this issue Nov 30, 2016 · 12 comments
Closed
Labels
aspect: binary formats BSON, CBOR, MessagePack, UBJSON kind: enhancement/improvement state: please discuss please discuss the issue or vote for your favorite option

Comments

@mwittgen
Copy link

With binary serialization being implemented by msgpack and cbor I wonder if some of the binary format optimizations can be extended to json object memory representation by introducing more specific storage types for floating point numbers and integers by extinguishing between uint_8/16/32/64, int_8/16/32/64, float32 and float64. I am aware this does not reduce the size of a JSON object but would retain the additional type information from the binary formats. For optimizing memory size, what about introducing the concept of ArrayType<uint8_t> and so on and allow the optimized storage for arrays consisting of one specific type? The json object would store a pointer to the optimized array. On a 64-bit system one element in an array of uint8_t uses 16 bytes instead of one byte. A downside is sacrificing the indexing capability j["some_array"][0] would not work if j["some_array"] if it maps to a pointer to an array object.

@nlohmann
Copy link
Owner

Some thoughts:

  • This would introduce a lot of complexity, because any additional integer type would need another variant of the code. Maybe there could be some template magic to achieve this, but I am skeptical about the advantages compared to added complexity.
  • As pointed out, adding an ArrayType would only make sense if we have a arrays of the same type. I wonder if this is really a use case, and it would definitely break the library's goal of having a nice API if array[0] would not work.

@nlohmann nlohmann added kind: enhancement/improvement state: please discuss please discuss the issue or vote for your favorite option labels Nov 30, 2016
@mwittgen
Copy link
Author

Yes, that adds a lot of complexity. I went through some brute force approach to add these extra types: https://gitlab.cern.ch/slac_sandbox/ubjson/blob/ubjson/src/json.hpp
This is certainly not complete but basically added what I needed.
Writing a nice memory storage class in C++ would actually end up with a similar product than yours. I have a use case for storing large arrays (20000 entries) of float32 and int16_t, which were previously stored in plain text. The original JSON parser turned out to be very slow, so I decided to explore ubjson and msgpack in combination with nlohmann::json as storage class before I realized the msgpack and CBOR effort.

@TurpentineDistillery
Copy link

I'd say JSON is not a format that is well-suited for storing blobs or arrays of numbers large enough so that parsing, writing, and memory utilization issues become important.

The API should target usability and common use-cases.

@nlohmann
Copy link
Owner

nlohmann commented Dec 1, 2016

@mwittgen I had a look at your code. I understand your use case, but the code grows up to an unmaintainable state by copy/pasting nearly the same code over and over again. Maybe some template magic could help, but I fear that this effort only serves a very particular edge case.

@mwittgen
Copy link
Author

mwittgen commented Dec 1, 2016

@nlohmann thanks for looking into it. I might continue to look into some template approach. I needed a fast straw man for proving some concepts. Certainly for my use case JSON parsing was was much slower compared to other preferred storage solutions in the physics community like CERN ROOT trees. The rapidJSON parser was as fast ROOT tree parsing. But with the new binary format support parsing should be very fast.
What is intriguing about this JSON library is the API. My thought was when supporting binary JSON-like formats why discarding the extra stored information in the memory representation? This also takes way the guess work when optimizing for the binary format.

@nlohmann
Copy link
Owner

nlohmann commented Dec 1, 2016

So storing the exact type of integers (did I miss floating point types) is a separate concern from the compact integer vectors?

@mwittgen
Copy link
Author

mwittgen commented Dec 1, 2016

Yes. float32/64 and all the u(int) types. Preserving this information is separate from compact integer/float vectors. Unfortunately, there is no silver bullet when it comes to existing binary formats. UBJSON has optimized array support for all types, but lacks unsigned support, msgpack offer unsigned support, but limited optimized array support. I haven't looked into CBOR yet. WIth ubjson to parse into an optimized vector is trivial, with msgpack the parser would need to do some guess work or make use of user defined types. msgpack only directly supports storage of vector<uint8_t> through its byte array.

@TurpentineDistillery
Copy link

Maybe in binary mode relax the requirement that string must contain a UTF-8 string, and allow arbitrary bytes, so that the user can put any blob in there, e.g. an array of floats or any other POD. Serialization of such data as json should fail, of course.

@nlohmann nlohmann modified the milestone: Release 2.0.9 Dec 2, 2016
@nlohmann
Copy link
Owner

FYI: I now merged the MessagePack/CBOR implementation to the develop branch.

@nlohmann
Copy link
Owner

nlohmann commented Jan 2, 2017

I don't think that supporting all kinds of numeric types would bring broad benefit to the users of the library. For the exchange of large vectors, CBOR/MessagePack should help.

@nlohmann nlohmann closed this as completed Jan 4, 2017
@mwittgen
Copy link
Author

@nlohmann Using in the long run std::variant and for now similar variant classes implemented for C++11/14 like eggs::variant for json_value would work when implementing more numeric storage types without bloating the code. A lot of the switch(type)/case statements could become obsolete. I still would argue to conserve the type information stored in the binary formats in the json object has some benefits. I have started to play around with eggs::variant and nlohmann::json.

@TurpentineDistillery
Copy link

@mwittgen
FYI there's https://github.com/mpark/variant, which implements the standard with c++14.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aspect: binary formats BSON, CBOR, MessagePack, UBJSON kind: enhancement/improvement state: please discuss please discuss the issue or vote for your favorite option
Projects
None yet
Development

No branches or pull requests

3 participants