Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CBOR parser doesn't skip tags #1968

Closed
Timmmm opened this issue Mar 2, 2020 · 14 comments · Fixed by #2273
Closed

CBOR parser doesn't skip tags #1968

Timmmm opened this issue Mar 2, 2020 · 14 comments · Fixed by #2273
Assignees
Labels
aspect: binary formats BSON, CBOR, MessagePack, UBJSON kind: enhancement/improvement solution: proposed fix a fix for the issue has been proposed and waits for confirmation
Milestone

Comments

@Timmmm
Copy link

Timmmm commented Mar 2, 2020

Using 456478b (3.7.3)

The following code should parse a CBOR value:

#include "json.hpp"

#include <sstream>

class NullSAXParser : public nlohmann::json_sax<nlohmann::json> {
public:
  bool null() override {
  	return true;
  }

  bool boolean(bool val) override {
    return true;
   }

  bool number_integer(number_integer_t val) override {
    return true;
  }

  bool number_unsigned(number_unsigned_t val) override {
    return true;
  }

  bool number_float(number_float_t val, const string_t &s) override {
    return true;
  }

  bool string(string_t &val) override {
    return true;
  }

  bool start_object(std::size_t elements) override {
    return true;
  }

  bool key(string_t &val) override {
    return true;
  }

  bool end_object() override {
    return true;
  }

  bool start_array(std::size_t elements) override {
    return true;
  }

  bool end_array() override {
    return true;
  }

  bool parse_error(std::size_t position, const std::string &last_token,
                   const nlohmann::detail::exception &ex) override {
    throw ex;
  }
};

int main() {
    std::stringstream in("\xd9\xd9\xf7");
    NullSAXParser parser;
	nlohmann::json::sax_parse(in, &parser, nlohmann::json::input_format_t::cbor);	
}

The value I'm parsing (0xd9d9f7) is simply the optimal "magic number" tag for CBOR documents. From the specification:

The serialization of this tag is 0xd9d9f7, which appears not to be in
use as a distinguishing mark for frequently used file types. In
particular, it is not a valid start of a Unicode text in any Unicode
encoding if followed by a valid CBOR data item.

Byte 0xd9 should be fine because it is equal to (6 << 5) | 25, in other words it has a major type of 6 (a tag), and lower 5 bits of 25, which for a tag means the actual tag value follows in a uint16, so it should just skip the following 2 bytes.

Some extra code needs to be added here. It doesn't understand tags at all.

@nlohmann
Copy link
Owner

nlohmann commented Mar 5, 2020

Can you provide the error message?

@nlohmann nlohmann added the aspect: binary formats BSON, CBOR, MessagePack, UBJSON label Mar 5, 2020
@Timmmm
Copy link
Author

Timmmm commented Mar 5, 2020

It's this one here: invalid byte: 0xD9.

@dota17
Copy link
Contributor

dota17 commented Mar 20, 2020

I‘v read the CBOR standard document and its implementation in the source code, it does not support data of major-type 6(Optional Tagging of Items). One of the reasons is that the other major-type data has satisfy various data types of JSON.
but user can parse to any kind of data, like 0xd9d9f7...
55799 is one of tags, it serialized to 0xd9d9f7 by CBOR spec. As written in the specification:

Tag 55799 is defined for this purpose. It does not impart any special semantics on the data item that follows; that is, the semantics of a data item tagged with tag 55799 is exactly identical to the semantics of the data item itself.

so can we add a case to skip it to avoid parsing error? @nlohmann

@nlohmann
Copy link
Owner

The documentation states that the library does not support tags at this point, so I would not speak of a bug. I am hesitant of skipping tags, because this would, to my understanding, silently change the type or intention of the encoding. Or, to put it differently, how would you propose mapping such read bytes to JSON values?

@stale

This comment has been minimized.

@stale stale bot added the state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated label Apr 19, 2020
@Timmmm
Copy link
Author

Timmmm commented Apr 19, 2020

I am hesitant of skipping tags, because this would, to my understanding, silently change the type or intention of the encoding. Or, to put it differently, how would you propose mapping such read bytes to JSON values?

Just ignore them. This is perfectly acceptable according to the standard:

A decoder that comes across a tag (Section 2.4) that it does not
recognize, such as a tag that was added to the IANA registry after
the decoder was deployed or a tag that the decoder chose not to
implement, might issue a warning, might stop processing altogether,
might handle the error and present the unknown tag value together
with the contained data item to the application (as is expected of
generic decoders), might ignore the tag and simply present the
contained data item only to the application, or take some other type
of action.

@stale stale bot removed the state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated label Apr 19, 2020
@stale

This comment has been minimized.

@stale stale bot added the state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated label May 19, 2020
@Timmmm
Copy link
Author

Timmmm commented May 19, 2020

It's a bit odd that you have a bot to close issues that nobody replies to after 1 month. I don't think it's going to magically fix itself!

@stale stale bot removed the state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated label May 19, 2020
@nlohmann
Copy link
Owner

This is a one-man-show, and I do not want to be overwhelmed by inactive issues. Since I will never be able to work on all of them, I'd rather keep the active ones open.

@stale

This comment has been minimized.

@stale stale bot added the state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated label Jun 20, 2020
@stale stale bot closed this as completed Jun 28, 2020
@nlohmann nlohmann removed kind: bug state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated labels Jul 10, 2020
@nlohmann nlohmann reopened this Jul 10, 2020
@nlohmann
Copy link
Owner

Tags should be skipped to allow roundtripping after #2244 is merged.

@Ericson2314
Copy link

CC @matthewbauer

@nlohmann
Copy link
Owner

Alright, I finally had time to work on this. My idea would be to pass the from_cbor function an enum parameter tag_handler that would have two values:

  • error: The current behavior of the library - in case a tag is read, a parse_error exception is thrown. This is the default to avoid breaking existing client code.
  • ignore: The tag is ignored and the following value is parsed as before. On case of tags 0xD8-0xDB, the following 1-8 bytes are read, but also discarded. This would treat any tagged value as a "standard" CBOR value.

Also possible (but more difficult to implement would be):

  • binary: Tagged values are parsed to binary values. The subtype is the tag value and the following bytes are just copied to the binary value as is. I am not sure whether this is helpful.

Any comments on this?

@nlohmann nlohmann added kind: enhancement/improvement state: please discuss please discuss the issue or vote for your favorite option labels Jul 12, 2020
nlohmann added a commit that referenced this issue Jul 12, 2020
@nlohmann nlohmann added the solution: proposed fix a fix for the issue has been proposed and waits for confirmation label Jul 15, 2020
@Timmmm
Copy link
Author

Timmmm commented Jul 15, 2020

Sounds sensible to me.

@nlohmann nlohmann removed the state: please discuss please discuss the issue or vote for your favorite option label Jul 17, 2020
@nlohmann nlohmann self-assigned this Jul 17, 2020
@nlohmann nlohmann added this to the Release 3.8.1 milestone Jul 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aspect: binary formats BSON, CBOR, MessagePack, UBJSON kind: enhancement/improvement solution: proposed fix a fix for the issue has been proposed and waits for confirmation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants