Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON to wstring? #1921

Closed
alfaunits opened this issue Jan 31, 2020 · 11 comments
Closed

JSON to wstring? #1921

alfaunits opened this issue Jan 31, 2020 · 11 comments

Comments

@alfaunits
Copy link

Is there a way to do:
wstring val = json["Name"];
?

The problem I am facing is that MSVC does not use UTF8 strings in std::string, and even something as simple as French accented letters turn into gibberish.
I know I can convert json["Name"] TO wstring, but the issue would remain :(
I tried both 3.1.1 and the latest 3.7.3, using MSVC 2017/2019.

Q2: I reckon u8string is not supported yet?

@nickaein
Copy link
Contributor

nickaein commented Feb 1, 2020

As long as you are following the C++ standard, std::string should be the same among the compilers. Therefore, storing UTF8 in std:string on MSVC should be fine and same as GCC, as long as you are aware of its limitations (e.g. std::string::size() doesn't give the actual length of string). Can you give a minimal code example that causes the issue?

You might also override std::string type and use std::wstring for the underlying string type, but I don't think that's really necessary:

using json = nlohmann::basic_json<std::map, std::vector, QString>;

@alfaunits
Copy link
Author

alfaunits commented Feb 1, 2020

This is the output (part of it..) we get from OneDrive when enumerating the root for example:
"name":"Pièces jointes"
It should be:
Pièces jointes
Note that the body of the response, before using json::parse(body) is correct:
"name":"_9_Pi\u00e8ces jointes"
The pseudo code is along the lines of:
send_GET_request
receive_body // << _9_Pi\u00e8ces jointes here
values = json::parse(body)
values contains "Pièces jointes" now, which is incorrect. So the issue might be somewhere in the parse() routine.

When that is converted to a wstring, the output is actually fine.
When converted back to a string, it is also fine :D (we use the same logging to output the string).

I printed out (to a PLog debug output) the actual JSONs we got back from OneDrive (since we dig into it via response["values"].at(i)) and all JSONs up to the top output the incorrect characters.

@alfaunits
Copy link
Author

alfaunits commented Feb 2, 2020

Simplest example:

            std::string test_string = R"(
                {
                    "happy": "_9_Pi\u00e8ces jointes"
                }
            )";
            json test_json = json::parse(test_string);
            cout << test_json;
            cout << test_json.dump(-1, ' ', true);```
In MSVC this outputs:
```{"happy":"_9_Pièces jointes"}```
in both cout lines.

@nickaein
Copy link
Contributor

nickaein commented Feb 2, 2020

Windows console might not correctly handle UTF8 for std::string. Can you write the output string to a file instead, and check its contents with an IDE that supports UTF8 (e.g.g vscode)?

Alternatively, you can print out the hex values like the following code (live code) and decode it (e.g. using this service) to see if the characters are not corrupted:

#include <nlohmann/json.hpp>
#include <iostream>

int main(void)
{
    nlohmann::json j = {{"happy", "_9_Pi\u00e8ces jointes"}};

    std::string str = j.dump();

    std::cout << std::hex;

    for(const uint8_t& ch: str)
    {
        std::cout << static_cast<int>(ch) << " ";
    }
    std::cout << std::endl;

    return 0;
}

The output of above code on GCC is:

7b 22 68 61 70 70 79 22 3a 22 5f 39 5f 50 69 c3 a8 63 65 73 20 6a 6f 69 6e 74 65 73 22 7d

@alfaunits
Copy link
Author

alfaunits commented Feb 2, 2020 via email

@nickaein
Copy link
Contributor

nickaein commented Feb 2, 2020

Also, MSVC might not interpret the input file as UTF8 so any string literal could be corrupted before the parsing kicks in. More info and the solutions:

https://stackoverflow.com/questions/840065
https://stackoverflow.com/questions/47690822

Can you try the above code on MSVC with a string you have issue with and compare it to the output of live code?

@alfaunits
Copy link
Author

Your code throws an exception in MSVC:
[json.exception.type_error.316] invalid UTF-8 byte at index 6: 0x63

The problem is definitely not the editor for the original issue - we get the string data from OneDrive as a JSON body, and use json::parse to get a JSON object.
It works with other characters, but it forms an invalid value string for the above sample ("_9_Pi\u00e8ces jointes")

@alfaunits
Copy link
Author

The online compile gives proper string output: (I added std::cout << str;)
7b 22 68 61 70 70 79 22 3a 22 5f 39 5f 50 69 c3 a8 63 65 73 20 6a 6f 69 6e 74 65 73 22 7d
{"happy":"_9_Pièces jointes"}

This is what MSVC outputs for the dump:
7b 22 68 61 70 70 79 22 3a 22 5f 39 5f 50 69 c3 a8 63 65 73 20 6a 6f 69 6e 74 65 73 22 7d

Now if I convert that to a wstring that is not an issue - I get the correct output.
But when I convert it back to a string, I get different bytes!
This is what MSVC dumps after the string is converted to wstring, then back to string:
7b 22 68 61 70 70 79 22 3a 22 5f 39 5f 50 69 e8 63 65 73 20 6a 6f 69 6e 74 65 73 22 7d
Notice the difference at c3 a8 vs e8. e8 is the correct character for the Lowercase e-grave for UNICODE, but c3 a8 is correct for UTF-8....

I am not sure where the issue is now :)

@nickaein
Copy link
Contributor

nickaein commented Feb 2, 2020

I got my hands on a MSVC compiler and tried the example code (#1921 (comment)) with /utf-8 compiler flag set (see links in #1921 (comment)). It generated the following output which the exact same as GCC compiler:

7b 22 68 61 70 70 79 22 3a 22 5f 39 5f 50 69 c3 a8 63 65 73 20 6a 6f 69 6e 74 65 73 22 7d

In addition, to simulate receiving the JSON string, I replaced the following line,

nlohmann::json j = {{"happy", "_9_Pi\u00e8ces jointes"}};

with:

std::ifstream fin("input.txt");
auto j = nlohmann::json::parse(fin);

where input.txt content is:

{"happy": "_9_Pi\u00e8ces jointes"}

This also outputs the exact bytes as above case. Note that in this case, there is no need for \utf-8 flag.


std::wstring is intended for fixed-length character encoding (e.g. UTF16). There is no straightforward conversion between these types, except with the help of helpers in codevct header. Nevertheless, there is no need for std::wstring to deal with UTF8 strings with this library. This library supports (and even defaults to) storing strings as UTF8 format inside std::string type.

@alfaunits
Copy link
Author

UR right. It seems we need to change JSONs to wstring ourselves :(

@eagle-dot
Copy link

Please see the solution

#1592

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants