Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue: UTF-8 decoder in libcore #33906

Closed
strake opened this issue May 27, 2016 · 11 comments
Closed

Tracking issue: UTF-8 decoder in libcore #33906

strake opened this issue May 27, 2016 · 11 comments
Labels
B-unstable Blocker: Implemented in the nightly compiler and unstable. C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@strake
Copy link
Contributor

strake commented May 27, 2016

Update (@SimonSapin): this is now the tracking issue for these items in both core::char and std::char:

  • decode_utf8() which takes an iterable of u8 and return DecodeUtf8
  • DecodeUtf8 which implements Iterator<Item=Result<char, InvalidSequence>>
  • InvalidSequence which is opaque

Original issue:

In libcore we have a facility to encode a character to UTF-8, i.e. char::EncodeUtf8, but no facility to decode a character from potentially-invalid UTF-8, and return 0xFFFD if it reads an invalid sequence, which seems a surprising omission to me as a libcore user, given in libstd we have string::String::from_utf8_lossy.

These options came to mind:

  • A function str::next_code_point_lossy or so which behaves as str::next_code_point but checks whether its input is valid and returns 0xFFFD if not
  • An iterator DecodeUtf8 which one can make from an arbitrary iterator of bytes, which decodes them
@alexcrichton alexcrichton added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. B-unstable Blocker: Implemented in the nightly compiler and unstable. labels Jul 14, 2016
bors added a commit that referenced this issue Jul 14, 2016
add core::char::DecodeUtf8

See [issue](#33906)
@SimonSapin
Copy link
Contributor

I’ve submitted #35947 to make this emit errors as specified in Unicode, like String::from_utf8_lossy does.

@jethrogb
Copy link
Contributor

jethrogb commented Nov 7, 2016

I think this is the tracking issue for this functionality?

DecodeUtf8, like DecodeUtf16, should have a way to recover invalid byte sequences encountered.

@bluss
Copy link
Member

bluss commented Nov 21, 2016

std::str::next_code_point being public is bad; it assumes valid UTF-8 input and libcore needs the agility to use unsafe code in this function if it turns out to be beneficial.

@SimonSapin
Copy link
Contributor

The initial message of this issue is somewhat misleading now that this is the tracking issue for the std::char::DecodUtf8 iterator adaptor.

It is not anymore about core::str::next_code_point which is marked #[unstable(feature = "str_internals", issue = "0")] and is only public so that std can use it.

@aturon aturon changed the title UTF-8 decoder in libcore Tracking issue: UTF-8 decoder in libcore Mar 3, 2017
@Mark-Simulacrum Mark-Simulacrum added the C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. label Jul 22, 2017
@SimonSapin
Copy link
Contributor

SimonSapin commented Mar 17, 2018

I’m inclined to not stabilize this.

Now that str::Utf8Error has an error_len method it is possible to solve the original issue (equivalent of String::from_utf8_lossy in libcore) by calling str::from_utf8 repeatedly. I’ve submitted #49105 to add an example showing how.

from_utf8 can optimize its inner loop for example to look at multiple bytes at a time. This API however is forced to only work on one char at a time.

If you want to build a &str-like (like ArrayString), with DecodeUtf8 you end up converting to chars (effectively UTF-32) only to then convert back to str (UTF-8). str::from_utf8 in comparison only does UTF-8 validation. (The chars method can then actually decode if that’s desired.)

It’s tempting to add new APIs to libcore for something like the example above, but there’s a lot of possible variation: returning an Iterator instead of taking a callback, using Result instead of hard-coding U+FFFD, supporting chunked input (e.g. where a single code point can be made of bytes that end up in separate network packets), …

https://docs.rs/utf-8/ tries to support all of these use cases (still on top of str::from_utf8), but ends up with a non-trivial API surface. Something like this is probably better off in a crate outside of the standard library.

@strake what do you think?

kennytm added a commit to kennytm/rust that referenced this issue Mar 22, 2018
… r=alexcrichton

Add an example of lossy decoding to str::Utf8Error docs

CC rust-lang#33906
@SimonSapin
Copy link
Contributor

The libs team discussed this and the consensus was to deprecate this feature. The use case motivating it can be handled by using str::from_utf8 with code similar to https://doc.rust-lang.org/nightly/std/str/struct.Utf8Error.html#examples

@rfcbot fcp close

@rfcbot
Copy link

rfcbot commented Mar 30, 2018

Team member @SimonSapin has proposed to close this. The next step is review by the rest of the tagged teams:

No concerns currently listed.

Once a majority of reviewers approve (and none object), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

@rfcbot rfcbot added the proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. label Mar 30, 2018
@rfcbot
Copy link

rfcbot commented Apr 3, 2018

🔔 This is now entering its final comment period, as per the review above. 🔔

@rfcbot rfcbot added final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. and removed proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. labels Apr 3, 2018
@strake
Copy link
Contributor Author

strake commented Apr 9, 2018

@SimonSapin No objection here — this should be the common use case, and for cases where one truly wants to operate on a single byte at a time from an iterator, the code need not be in libcore.

@rfcbot
Copy link

rfcbot commented Apr 13, 2018

The final comment period is now complete.

kennytm added a commit to kennytm/rust that referenced this issue Apr 24, 2018
@SimonSapin
Copy link
Contributor

Deprecated in #49970

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B-unstable Blocker: Implemented in the nightly compiler and unstable. C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

9 participants