Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add core::char::DecodeUtf8 #33907

Merged
merged 1 commit into from
Jul 14, 2016
Merged

add core::char::DecodeUtf8 #33907

merged 1 commit into from
Jul 14, 2016

Conversation

strake
Copy link
Contributor

@strake strake commented May 27, 2016

See issue

@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @alexcrichton (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@tbu-
Copy link
Contributor

tbu- commented May 30, 2016

It would be more flexible to make the iterator's item type Option<char>, so the user of this API can decide what to do on bad characters: Erroring out, replacing it with the replacement character (.unwrap_or(replacement)), ...

type Item = char;
#[inline] fn next(&mut self) -> Option<char> {
let repl = '\u{FFFD}';
self.0.next().map(|b| if b & 0x80 == 0 { unsafe { from_u32_unchecked(b as u32) } } else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can just be b as char, no need for the unsafe from_u32_unchecked.

@alexcrichton alexcrichton added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label May 31, 2016

#[unstable(feature = "decode_utf8", issue = "33906")]
impl<I: Iterator<Item = u8>> Iterator for DecodeUtf8<I> {
type Item = Option<char>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idiomatically we tend to prefer to return a Result rather than an Option to indicate an error. This helps us give an entry point for an opaque struct to attach error information to.

@alexcrichton
Copy link
Member

Thanks @strake! Could you also reexport this type from librustc_unicode? That way it'll make its way into std::char

@strake
Copy link
Contributor Author

strake commented Jun 1, 2016

@alexcrichton All done. Not sure whether github notified you when i pushed...

@tbu-
Copy link
Contributor

tbu- commented Jun 1, 2016

(@strake Github doesn't notify on pushes.)

/// `<DecodeUtf8 as Iterator>::next` returns this for an invalid input sequence.
#[unstable(feature = "decode_utf8", issue = "33906")]
#[derive(Debug)]
pub struct InvalidSequence;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be struct InvalidSequence(()) so we can backwards-compatibly add more variants? You may want to also derive PartialEq for it so assert_eq! can work.

@alexcrichton
Copy link
Member

Thanks @strake!

@strake
Copy link
Contributor Author

strake commented Jun 2, 2016

@alexcrichton Done. The check failed on Travis but passes on my machine; the failure seems to be in APT, not due to my modifications.

@alexcrichton
Copy link
Member

Ah yeah travis is unfortunately having issues as of late, but otherwise looks good to me. This is tagged with T-libs to ensure it comes up during triage, and we'll discuss there!

@alexcrichton
Copy link
Member

Thanks again for the PR @strake! The libs team got a chance to discuss this today and there were a few concerns, but the overall feeling was somewhat positive.

  • Would it be possible to reuse code between std::str and here somehow to avoid open-coding decoding utf-8 twice?
  • Do you have a particular use case in mind for this functionality beyond simply consistency with other APIs? (which is of course still good, however)

@strake
Copy link
Contributor Author

strake commented Jun 21, 2016

Would it be possible to reuse code between std::str and here somehow to avoid open-coding decoding utf-8 twice?

All the methods i see in str take a &[u8] or a slice::Iter rather than an arbitrary Iterator, and i'd prefer to take an arbitrary Iterator.

Do you have a particular use case in mind for this functionality beyond simply consistency with other APIs? (which is of course still good, however)

Well, i'm writing a text editor which must deal with strings (e.g. file paths) which are UTF-8-coded by default but potentially invalid, in which case it ought to show the user what valid subsequences of the path it can. (I consider crashing on allocational failure ill-behavior in a text editor, so i am not using std.)

@alexcrichton
Copy link
Member

Would it be possible to refactor a bit to share an implementation? It's not obvious unfortunately that this is correct as it seems some subtle logic is happening here, but it'd be nice to lean on the existing code.

Also just out of curiosity, but how come str::from_utf8 won't suffice for this use case? Presumably the bytes are held in memory somewhere?

@strake
Copy link
Contributor Author

strake commented Jun 21, 2016

It's not obvious unfortunately that this is correct as it seems some subtle logic is happening here

I actually since found some helper functions in str we can use here. I think it clarifies the DecodeUtf8::next code. Sorry i missed this last time.

how come str::from_utf8 won't suffice for this use case?

If its input has any invalid sequences, str::from_utf8 fails completely. I need to decode the valid parts of a byte string or byte stream which potentially includes invalid sequences.

@bors
Copy link
Contributor

bors commented Jun 24, 2016

☔ The latest upstream changes (presumably #34399) made this pull request unmergeable. Please resolve the merge conflicts.

@alexcrichton
Copy link
Member

Hm yeah it's probably good to use at least some helper functions, but I believe we on the libs team were thinking more of wholesale using an implementation in str or something like that. It looks like there's still plenty of opportunity to go wrong in the DecodeUtf8 iterator as a significant amount of it's still open-coded.

It may not be the case that it's possible to share as str::from_utf8 is so specialized, but it was a passing thought of ours. In general we've avoided util modules, so if it comes to that it's probably not worth sharing if it's still open coded as much as it is still.

@strake
Copy link
Contributor Author

strake commented Jun 25, 2016

It may not be the case that it's possible to share as str::from_utf8 is so specialized, but it was a passing thought of ours.

Yeah, i couldn't see how to so modify run_utf8_validation as it is so specialized to to validation rather than decoding, and next_code_point has a comment saying "Performance is sensitive to the exact formulation here" so i dare not modify it.

@tbu-
Copy link
Contributor

tbu- commented Jun 25, 2016

The code does not decode UTF-8 according to the spec: https://encoding.spec.whatwg.org/#utf-8-decoder. It eats characters after an invalid byte sequence and does not check for overlong encodings. I suggest you to just implement the described algorithm, that also makes it easier to verify it's doing the right thing.

@strake
Copy link
Contributor Author

strake commented Jun 28, 2016

I modified it to not eat characters and to check for too-long sequences. It now returns the reason why a sequence is invalid.

I looked at the algorithm on that page, and it seems more complicated, having more internal state, and is not of the same form as Iterator::next, potentially returning multiple tokens or "continue".

type Item = Result<(u32, usize), ()>;
#[inline]
fn next(&mut self) -> Option<Result<(u32, usize), ()>> {
self.0.next().map(|b| if b & 0x80 == 0 { Ok((b as u32, 1)) } else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stylistically we tend to have multi-line closures always have braces around them, putting this if on the next line.

You could make this similarly intended by doing:

foo.map(|b| {
    if condition {
        return Ok(...)
    }
    ...
})

@alexcrichton
Copy link
Member

@rust-lang/libs thoughts on having another utf-8 decoding loop by hand in libcore? It seems like our previous ones don't quite apply, and if it at least clearly follows a spec it won't exactly change that often so perhaps not the worst!

@tbu-
Copy link
Contributor

tbu- commented Jul 5, 2016

I think it shouldn't return precise errors. This makes it hard to change the underlying algorithm and is probably not useful to the user of the API anyway.

@brson
Copy link
Contributor

brson commented Jul 5, 2016

@alexcrichton I don't really care about having yet another UTF-8 decoder. The one test looks light - @strake does the test cover all the cases?

@strake
Copy link
Contributor Author

strake commented Jul 6, 2016

  • Braced closure
  • No more precise errors
  • More tests, which i think cover all the cases now

("�", &[0xFEu8] as &[u8]),
("�A", &[0xFEu8, 0x41u8] as &[u8]),
("�", &[0xFFu8] as &[u8]),
("�A", &[0xFFu8, 0x41u8] as &[u8])].into_iter() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are any of these test cases checking for overlong sequences?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, [0xC1, 0x81] is an overlong sequence for 'A'.

@aturon
Copy link
Member

aturon commented Jul 13, 2016

@alexcrichton I don't object to having an additional hand-written loop. I expect us to grow this kind of micro-optimization over time in std.

@alexcrichton
Copy link
Member

Ok, @strake could you also squash the commits down? After that I'll r+

@strake
Copy link
Contributor Author

strake commented Jul 14, 2016

@alexcrichton yes, done

@alexcrichton
Copy link
Member

@bors: r+ 837029f

Thanks!

@bors
Copy link
Contributor

bors commented Jul 14, 2016

⌛ Testing commit 837029f with merge 6998018...

bors added a commit that referenced this pull request Jul 14, 2016
add core::char::DecodeUtf8

See [issue](#33906)
@bors bors merged commit 837029f into rust-lang:master Jul 14, 2016
@strake strake deleted the decode_utf8 branch October 29, 2017 08:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants