Error During the Deserialization of `String` Fields from Nested `struct`s #354

dtoniolo · 2024-02-14T13:04:41Z

First of all, thanks for your work on this crate. It has very convenient API and allows me to process large amounts of data quickly.

I'm opening this issue because I might have stubled on a bug in version 1.3.0.

Description

I'm parsing data from the rows of a csv file into custom data structures. The fields of each row are grouped into multiple structs in the Rust code, both because they logically belong to different groups and because the various groups are then used in distinct ways. Each row is then modelled by CsvRow, a struct that combines all of the others. All its fields are marked with #[serde(flatten)].

I'm encountering the problem when I'm attempting to parse strings that contain only digits into fields that:

are defined in one of the sub-structs of CsvRows.
Have String as their type.

When these conditions are met, the deserialization code incorrectly detects the input to be an integer. This then creates a type mismatch between the datum (some kind of integer) and the field (a String), which results in a runtime error.

Reproducible Example

I've created a minimum reproducible example to show the problem. Here's my Cargo.toml:

[package]
name = "csv-bug"
version = "0.1.0"
edition = "2021"

[dependencies]
csv = "=1.3.0"
serde = { version = "=1.0.196", features = ["derive"] }

and the main.rs:

use csv::ReaderBuilder;
use serde::Deserialize;

#[derive(Debug, Deserialize)]
struct Flat {
    first_field: String,
    second_field: String,
}

/// Logically the same as [`Flat`], but it uses nested fields.
#[derive(Debug, Deserialize)]
struct Nested {
    first_field: String,
    #[serde(flatten)]
    inner: Inner,
}

#[derive(Debug, Deserialize)]
struct Inner {
    second_field: String,
}

fn main() {
    let data = "first_field,second_field\n1234,1234";
    let mut csv_reader = ReaderBuilder::new()
        .has_headers(true)
        .from_reader(data.as_bytes());
    // I'm iterating over the records and then deserializing
    // manually in order to be able to deserialize both into
    // `Flat` and into `Nested`.
    let headers = csv_reader.headers().unwrap().clone();
    for s in csv_reader.records() {
        let s = s.unwrap();
        // This works.
        println!("{:?}", s.deserialize::<Flat>(Some(&headers)));
        // This doesn't.
        println!("{:?}", s.deserialize::<Nested>(Some(&headers)));
    }
}

Expected Behaviour

As far as csvs are concerned, Flat and Nested should be equivalent. Specifically, attempting to deserialize the rows of the input files into these structs should result in the same errors and data.
While 1234 could be interpreted as an integer, it's also a valid string. Given that all fields are Strings, it should be parsed as such even when using nested structs.

Actual Behaviour

Building and running this code with cargo run results in

Ok(Flat { first_field: "1234", second_field: "1234" })
Err(Error(Deserialize { pos: Some(Position { byte: 25, line: 2, record: 1 }), err: DeserializeError { field: None, kind: Message("invalid type: integer `1234`, expected a string") } }))

This means that:

first_field is always parsed correctly as a String.
When deserializing Nested the data corresponding to second_field is incorrectly detected as an integer. I guess that the deserializer is performing some type inference on the input datum, even though it has no need to do so.

The text was updated successfully, but these errors were encountered:

BurntSushi · 2024-02-14T13:48:05Z

This is almost certainly a bug, I agree. There are many issues on this tracker related to serde and its flatten directive (and many issues about just serde integration in general). I basically agree that your code snippet should work. The problem is that serde(flatten) is a somewhat leaky directive. It does not behave identically to code where the structs are manually flattened into one another. You should be able to see this by running cargo expand on your program. For example, the derived Deserialize impl for your Nested type deserializes as a map instead of a struct. Note the number of issues in Serde related to flatten. flatten is an amazing feature, but there are definitely some things wrong with it. Even worse, I don't actually know whether the bug you've reported here is something fundamental about serde-and-csv interaction, or if it's just something bogus with how I implemented Serde support. (I'd bet on the latter personally.)

It has been a long time since I wrote the Serde code in this crate, and I no longer remember what the essential challenges are here. IMO, the Serde support in this crate needs to be re-thought from first principles while taking existing issues into account. If something can't be supported because of how Serde (and CSV) works, then we should document it clearly with proposed work-arounds.

dtoniolo · 2024-02-14T14:03:57Z

Thanks for your prompt answer. I'm looking both at the code produced by cargo expand and rust-csv's source code, hope I can find the issue.

dtoniolo · 2024-02-16T10:27:12Z

I think I found the problem: it's a consequence of a long standing bug in serde related to flatten.

Basically, when deserializing Flat serde simply iterates over the input map and

deserializes the fields that correspond to the "first_field" and "second_field" identifiers as Strings.
Ignores everything else.

When deserializing Nested, however, it first searches for "first_field", storing everything else it encounters in a temporary data struture called __collect. It then passes this data structure to Inner::deserialize and its related methods.

The data that is memorized into __collect is not a simple copy of (or a reference to) the raw input bytes, but instances of a private enum called Content, which performs type inference during the deserialization from the input data. This is type inference is what incorrectly detects the input to be an integer. Inner::desrialize then (correctly) produces an error because it recieves as input a map in which "second_field" is associated with an integer and not a String.

I'm closing this issue given that it's not actually specific rust-csv. Thanks again for your support, it really helped me in finding the root cause of the problem.

BurntSushi · 2024-02-16T10:43:57Z

Oof, that's frustrating. I still wonder whether we can do something better here, but I'm not going to have time to work on it any time soon.

BurntSushi added the bug label Feb 14, 2024

dtoniolo closed this as completed Feb 16, 2024

dtoniolo mentioned this issue Feb 21, 2024

Improve Interoperability with Error Reporting Tools #355

Open

BurntSushi mentioned this issue Mar 3, 2024

Flattened maps incorrectly (?) assumed to be maps #188

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error During the Deserialization of `String` Fields from Nested `struct`s #354

Error During the Deserialization of `String` Fields from Nested `struct`s #354

dtoniolo commented Feb 14, 2024

BurntSushi commented Feb 14, 2024

dtoniolo commented Feb 14, 2024

dtoniolo commented Feb 16, 2024

BurntSushi commented Feb 16, 2024

Error During the Deserialization of String Fields from Nested structs #354

Error During the Deserialization of String Fields from Nested structs #354

Comments

dtoniolo commented Feb 14, 2024

Description

Reproducible Example

Expected Behaviour

Actual Behaviour

BurntSushi commented Feb 14, 2024

dtoniolo commented Feb 14, 2024

dtoniolo commented Feb 16, 2024

BurntSushi commented Feb 16, 2024

Error During the Deserialization of `String` Fields from Nested `struct`s #354

Error During the Deserialization of `String` Fields from Nested `struct`s #354