Encoding module #439

dralley · 2022-07-24T18:01:07Z

Move the core functionality for encoding / decoding to a new module and provide some freestanding utilities for decoding buffers.

dralley · 2022-07-24T18:04:34Z

src/encoding.rs

@@ -60,7 +60,7 @@ impl Decoder {
    ///
    /// If you instead want to use XML declared encoding, use the `encoding` feature
    pub fn decode_with_bom_removal<'b>(&self, bytes: &'b [u8]) -> Result<Cow<'b, str>> {
-        let bytes = if bytes.starts_with(b"\xEF\xBB\xBF") {
+        let bytes = if bytes.starts_with(&[0xEF, 0xBB, 0xBF]) {


Aesthetic preference, I feel like this is easier to read and better communicates the non-text-ness.

codecov-commenter · 2022-07-24T18:13:49Z

Codecov Report

Merging #439 (c6fc0ba) into master (c590fdf) will increase coverage by 0.02%.
The diff coverage is 80.00%.

@@            Coverage Diff             @@
##           master     #439      +/-   ##
==========================================
+ Coverage   51.14%   51.16%   +0.02%     
==========================================
  Files          26       27       +1     
  Lines       13308    13314       +6     
==========================================
+ Hits         6806     6812       +6     
  Misses       6502     6502

Flag	Coverage Δ
unittests	`51.16% <80.00%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/de/escape.rs	`21.05% <ø> (ø)`
src/de/mod.rs	`75.16% <ø> (+0.09%)`	⬆️
src/de/seq.rs	`91.83% <ø> (ø)`
src/de/simple_type.rs	`90.63% <ø> (ø)`
src/events/mod.rs	`68.60% <ø> (ø)`
src/lib.rs	`12.33% <0.00%> (ø)`
src/reader/buffered_reader.rs	`65.68% <ø> (ø)`
src/reader/mod.rs	`90.90% <ø> (+0.42%)`	⬆️
src/encoding.rs	`83.87% <83.87%> (ø)`
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c590fdf...c6fc0ba. Read the comment docs.

Mingun

Approved, but fix the markdown link

src/encoding.rs

Mingun · 2022-07-24T18:47:13Z

src/encoding.rs

+/// | Bytes       |Detected encoding
+/// |-------------|------------------------------------------
+/// |`00 00 FE FF`|UCS-4, big-endian machine (1234 order)


Because now the function is public, maybe more explicitly mark rows, where autodetection is supported?

I've just removed the non-supported rows entirely. It is unlikely that support for the other encodings will ever be available, at least using this library. Plus anyone can just visit the link to see the full table.

Mingun · 2022-07-24T18:48:07Z

src/encoding.rs

+    }
+}
+
+// TODO: add some tests for functions


Don't mind to add tests before merge?

I considered it but I figured it will be easier to do a big testing push at the end. We've only got a few sample documents and entering the data manually would be painful.

Of course, it needs to happen eventually - it would just be helpful to have the full picture of how encoding works together in mind while doing that work.

Mingun · 2022-07-24T18:51:50Z

src/encoding.rs

+}
+
+#[cfg(feature = "encoding")]
+fn split_at_bom<'b>(bytes: &'b [u8], encoding: &'static Encoding) -> (&'b [u8], &'b [u8]) {


This seems redundant, because first part is not used anywhere. Why just cut off the beginning, as was before, is not enough?

You're right, but I was thinking that A) may want to add some variants that return the BOM the same way that we provide StartText and B) at the very least it would be useful for testing.

On the other hand, I kinda feel like both StartText and returning the BOM have limited utility in practice. But it feels like an open question.

I'll leave it as-is for now but I wouldn't be be upset if we end up removing it later.

dralley requested a review from Mingun July 24, 2022 18:01

dralley force-pushed the encoding-module branch from 49c101b to 9f162cd Compare July 24, 2022 18:03

dralley commented Jul 24, 2022

View reviewed changes

dralley force-pushed the encoding-module branch from 9f162cd to 7b99926 Compare July 24, 2022 18:07

Move everything related to actually decoding text to a new module

bee8ff6

dralley force-pushed the encoding-module branch from 7b99926 to faf13a9 Compare July 24, 2022 18:13

Mingun approved these changes Jul 24, 2022

View reviewed changes

Provide some utilities for decoding entire buffers

c6fc0ba

dralley force-pushed the encoding-module branch from faf13a9 to c6fc0ba Compare July 24, 2022 20:06

dralley merged commit 6d883b5 into tafia:master Jul 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding module #439

Encoding module #439

dralley commented Jul 24, 2022 •

edited

Loading

dralley Jul 24, 2022

codecov-commenter commented Jul 24, 2022 •

edited

Loading

Mingun left a comment •

edited

Loading

Mingun Jul 24, 2022

dralley Jul 24, 2022 •

edited

Loading

Mingun Jul 24, 2022

dralley Jul 24, 2022

dralley Jul 24, 2022 •

edited

Loading

Mingun Jul 24, 2022

dralley Jul 24, 2022 •

edited

Loading

Encoding module #439

Encoding module #439

Conversation

dralley commented Jul 24, 2022 • edited Loading

dralley Jul 24, 2022

Choose a reason for hiding this comment

codecov-commenter commented Jul 24, 2022 • edited Loading

Codecov Report

Mingun left a comment • edited Loading

Choose a reason for hiding this comment

Mingun Jul 24, 2022

Choose a reason for hiding this comment

dralley Jul 24, 2022 • edited Loading

Choose a reason for hiding this comment

Mingun Jul 24, 2022

Choose a reason for hiding this comment

dralley Jul 24, 2022

Choose a reason for hiding this comment

dralley Jul 24, 2022 • edited Loading

Choose a reason for hiding this comment

Mingun Jul 24, 2022

Choose a reason for hiding this comment

dralley Jul 24, 2022 • edited Loading

Choose a reason for hiding this comment

dralley commented Jul 24, 2022 •

edited

Loading

codecov-commenter commented Jul 24, 2022 •

edited

Loading

Mingun left a comment •

edited

Loading

dralley Jul 24, 2022 •

edited

Loading

dralley Jul 24, 2022 •

edited

Loading

dralley Jul 24, 2022 •

edited

Loading