Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding module #439

Merged
merged 2 commits into from
Jul 24, 2022
Merged

Encoding module #439

merged 2 commits into from
Jul 24, 2022

Conversation

dralley
Copy link
Collaborator

@dralley dralley commented Jul 24, 2022

Move the core functionality for encoding / decoding to a new module and provide some freestanding utilities for decoding buffers.

@@ -60,7 +60,7 @@ impl Decoder {
///
/// If you instead want to use XML declared encoding, use the `encoding` feature
pub fn decode_with_bom_removal<'b>(&self, bytes: &'b [u8]) -> Result<Cow<'b, str>> {
let bytes = if bytes.starts_with(b"\xEF\xBB\xBF") {
let bytes = if bytes.starts_with(&[0xEF, 0xBB, 0xBF]) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aesthetic preference, I feel like this is easier to read and better communicates the non-text-ness.

@codecov-commenter
Copy link

codecov-commenter commented Jul 24, 2022

Codecov Report

Merging #439 (c6fc0ba) into master (c590fdf) will increase coverage by 0.02%.
The diff coverage is 80.00%.

@@            Coverage Diff             @@
##           master     #439      +/-   ##
==========================================
+ Coverage   51.14%   51.16%   +0.02%     
==========================================
  Files          26       27       +1     
  Lines       13308    13314       +6     
==========================================
+ Hits         6806     6812       +6     
  Misses       6502     6502              
Flag Coverage Δ
unittests 51.16% <80.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/de/escape.rs 21.05% <ø> (ø)
src/de/mod.rs 75.16% <ø> (+0.09%) ⬆️
src/de/seq.rs 91.83% <ø> (ø)
src/de/simple_type.rs 90.63% <ø> (ø)
src/events/mod.rs 68.60% <ø> (ø)
src/lib.rs 12.33% <0.00%> (ø)
src/reader/buffered_reader.rs 65.68% <ø> (ø)
src/reader/mod.rs 90.90% <ø> (+0.42%) ⬆️
src/encoding.rs 83.87% <83.87%> (ø)
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c590fdf...c6fc0ba. Read the comment docs.

Copy link
Collaborator

@Mingun Mingun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, but fix the markdown link

src/encoding.rs Outdated Show resolved Hide resolved
src/encoding.rs Outdated
Comment on lines 156 to 158
/// | Bytes |Detected encoding
/// |-------------|------------------------------------------
/// |`00 00 FE FF`|UCS-4, big-endian machine (1234 order)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because now the function is public, maybe more explicitly mark rows, where autodetection is supported?

Copy link
Collaborator Author

@dralley dralley Jul 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just removed the non-supported rows entirely. It is unlikely that support for the other encodings will ever be available, at least using this library. Plus anyone can just visit the link to see the full table.

}
}

// TODO: add some tests for functions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't mind to add tests before merge?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered it but I figured it will be easier to do a big testing push at the end. We've only got a few sample documents and entering the data manually would be painful.

Copy link
Collaborator Author

@dralley dralley Jul 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, it needs to happen eventually - it would just be helpful to have the full picture of how encoding works together in mind while doing that work.

}

#[cfg(feature = "encoding")]
fn split_at_bom<'b>(bytes: &'b [u8], encoding: &'static Encoding) -> (&'b [u8], &'b [u8]) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems redundant, because first part is not used anywhere. Why just cut off the beginning, as was before, is not enough?

Copy link
Collaborator Author

@dralley dralley Jul 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, but I was thinking that A) may want to add some variants that return the BOM the same way that we provide StartText and B) at the very least it would be useful for testing.

On the other hand, I kinda feel like both StartText and returning the BOM have limited utility in practice. But it feels like an open question.

I'll leave it as-is for now but I wouldn't be be upset if we end up removing it later.

@dralley dralley merged commit 6d883b5 into tafia:master Jul 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants