Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode, and how it inter-plays with Regex, JSON and XML libraries #1

Closed
mristin opened this issue Jun 28, 2023 · 1 comment
Closed

Comments

@mristin
Copy link
Contributor

mristin commented Jun 28, 2023

C++, being an old language, has many ways to deal with Unicode character points. How should we deal with Unicode in this library?

It seems that the easiest solution is to use std::wstring to represent all the strings: Question on StackOverflow regarding wstring and string.
Each character takes either 2 or 4 bytes (depending on the platform; 2 on Windows, 4 on Linux).

Regular Expressions

This seems to play ok-ish with regular expressions: Question about Unicode ranges on StackOverflow.
We have to be careful here, though!
Regular expressions should be encoded as UTF-32 on Linux (or any other platform which stores wchar with 4 bytes), and decomposed in multiple character points in UTF-16 on Windows (or any platform with 2 bytes for wchar).

Using UTF-8 might be more trouble with std::regex: Question about UTF-8 and std::regex on StackOverflow

JSON

We think about using nlohmann/json to parse JSON as of today (2023-06-28).
It decodes strings to UTF-8 by default so we have to be careful with std::wstring's:

Alternatively, if we ever want to use [RapidJSON], we also have to be equally careful and stick to UTF-8: Question about RapidJSON and Unicode on StackOverflow

XML

Given its light weight, we are thinking about using PugiXML to parse XML.
PugiXML seems to support Unicode well: Section in the PugiXML docs about Unicode.

@mristin
Copy link
Contributor Author

mristin commented Jan 25, 2024

We decided to use nlohmann/json for JSON and expat for XML. Expat seems to be available on most platforms.

@mristin mristin closed this as completed Jan 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant