Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error During the Deserialization of String Fields from Nested structs #354

Closed
dtoniolo opened this issue Feb 14, 2024 · 4 comments
Closed
Labels

Comments

@dtoniolo
Copy link

First of all, thanks for your work on this crate. It has very convenient API and allows me to process large amounts of data quickly.

I'm opening this issue because I might have stubled on a bug in version 1.3.0.

Description

I'm parsing data from the rows of a csv file into custom data structures. The fields of each row are grouped into multiple structs in the Rust code, both because they logically belong to different groups and because the various groups are then used in distinct ways. Each row is then modelled by CsvRow, a struct that combines all of the others. All its fields are marked with #[serde(flatten)].

I'm encountering the problem when I'm attempting to parse strings that contain only digits into fields that:

  • are defined in one of the sub-structs of CsvRows.
  • Have String as their type.

When these conditions are met, the deserialization code incorrectly detects the input to be an integer. This then creates a type mismatch between the datum (some kind of integer) and the field (a String), which results in a runtime error.

Reproducible Example

I've created a minimum reproducible example to show the problem. Here's my Cargo.toml:

[package]
name = "csv-bug"
version = "0.1.0"
edition = "2021"

[dependencies]
csv = "=1.3.0"
serde = { version = "=1.0.196", features = ["derive"] }

and the main.rs:

use csv::ReaderBuilder;
use serde::Deserialize;

#[derive(Debug, Deserialize)]
struct Flat {
    first_field: String,
    second_field: String,
}

/// Logically the same as [`Flat`], but it uses nested fields.
#[derive(Debug, Deserialize)]
struct Nested {
    first_field: String,
    #[serde(flatten)]
    inner: Inner,
}

#[derive(Debug, Deserialize)]
struct Inner {
    second_field: String,
}

fn main() {
    let data = "first_field,second_field\n1234,1234";
    let mut csv_reader = ReaderBuilder::new()
        .has_headers(true)
        .from_reader(data.as_bytes());
    // I'm iterating over the records and then deserializing
    // manually in order to be able to deserialize both into
    // `Flat` and into `Nested`.
    let headers = csv_reader.headers().unwrap().clone();
    for s in csv_reader.records() {
        let s = s.unwrap();
        // This works.
        println!("{:?}", s.deserialize::<Flat>(Some(&headers)));
        // This doesn't.
        println!("{:?}", s.deserialize::<Nested>(Some(&headers)));
    }
}

Expected Behaviour

  1. As far as csvs are concerned, Flat and Nested should be equivalent. Specifically, attempting to deserialize the rows of the input files into these structs should result in the same errors and data.
  2. While 1234 could be interpreted as an integer, it's also a valid string. Given that all fields are Strings, it should be parsed as such even when using nested structs.

Actual Behaviour

Building and running this code with cargo run results in

Ok(Flat { first_field: "1234", second_field: "1234" })
Err(Error(Deserialize { pos: Some(Position { byte: 25, line: 2, record: 1 }), err: DeserializeError { field: None, kind: Message("invalid type: integer `1234`, expected a string") } }))

This means that:

  1. first_field is always parsed correctly as a String.
  2. When deserializing Nested the data corresponding to second_field is incorrectly detected as an integer. I guess that the deserializer is performing some type inference on the input datum, even though it has no need to do so.
@BurntSushi
Copy link
Owner

This is almost certainly a bug, I agree. There are many issues on this tracker related to serde and its flatten directive (and many issues about just serde integration in general). I basically agree that your code snippet should work. The problem is that serde(flatten) is a somewhat leaky directive. It does not behave identically to code where the structs are manually flattened into one another. You should be able to see this by running cargo expand on your program. For example, the derived Deserialize impl for your Nested type deserializes as a map instead of a struct. Note the number of issues in Serde related to flatten. flatten is an amazing feature, but there are definitely some things wrong with it. Even worse, I don't actually know whether the bug you've reported here is something fundamental about serde-and-csv interaction, or if it's just something bogus with how I implemented Serde support. (I'd bet on the latter personally.)

It has been a long time since I wrote the Serde code in this crate, and I no longer remember what the essential challenges are here. IMO, the Serde support in this crate needs to be re-thought from first principles while taking existing issues into account. If something can't be supported because of how Serde (and CSV) works, then we should document it clearly with proposed work-arounds.

@BurntSushi BurntSushi added the bug label Feb 14, 2024
@dtoniolo
Copy link
Author

Thanks for your prompt answer. I'm looking both at the code produced by cargo expand and rust-csv's source code, hope I can find the issue.

@dtoniolo
Copy link
Author

I think I found the problem: it's a consequence of a long standing bug in serde related to flatten.

Basically, when deserializing Flat serde simply iterates over the input map and

  • deserializes the fields that correspond to the "first_field" and "second_field" identifiers as Strings.
  • Ignores everything else.

When deserializing Nested, however, it first searches for "first_field", storing everything else it encounters in a temporary data struture called __collect. It then passes this data structure to Inner::deserialize and its related methods.

The data that is memorized into __collect is not a simple copy of (or a reference to) the raw input bytes, but instances of a private enum called Content, which performs type inference during the deserialization from the input data. This is type inference is what incorrectly detects the input to be an integer. Inner::desrialize then (correctly) produces an error because it recieves as input a map in which "second_field" is associated with an integer and not a String.

I'm closing this issue given that it's not actually specific rust-csv. Thanks again for your support, it really helped me in finding the root cause of the problem.

@BurntSushi
Copy link
Owner

Oof, that's frustrating. I still wonder whether we can do something better here, but I'm not going to have time to work on it any time soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants