Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deserialize into a pre-existing struct #855

Closed
dtolnay opened this issue Apr 8, 2017 · 16 comments
Closed

Deserialize into a pre-existing struct #855

dtolnay opened this issue Apr 8, 2017 · 16 comments
Labels

Comments

@dtolnay
Copy link
Member

dtolnay commented Apr 8, 2017

Example from @BurntSushi - when iterating over rows of a CSV file, we would like to be able to reuse the same five String buffers on every row.

#[derive(Debug, Deserialize, PartialEq)]
struct MBTARow {
    trip_id: String,
    arrival_time: String,
    departure_time: String,
    stop_id: String,
    stop_sequence: i32,
    stop_headsign: String,
    pickup_type: i32,
    drop_off_type: i32,
    timepoint: i32,
}

The use case is similar to Clone::clone_from.

@dtolnay dtolnay added the derive label Apr 8, 2017
@dtolnay
Copy link
Member Author

dtolnay commented Apr 8, 2017

The analogous approach would be something like:

pub trait Deserialize<'de>: Sized {
    fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
        where D: Deserializer<'de>;

    fn deserialize_from<D>(&mut self, deserializer: D) -> Result<(), D::Error>
        where D: Deserializer<'de>
    {
        *self = Deserialize::deserialize(deserializer)?;
        Ok(())
    }
}

@dtolnay
Copy link
Member Author

dtolnay commented Apr 9, 2017

@nox mentioned being interested in working on this.

@Gankra
Copy link
Contributor

Gankra commented Oct 24, 2017

I might look into this to further optimize webrender deserialization.

@Gankra
Copy link
Contributor

Gankra commented Oct 25, 2017

Note the semantics we'd like in webrender is that it blindly clobbers the source struct. On failure it's in a partially overwritten (but memory-safe) state. We want this because serde errors are hard aborts for us (all sources are trusted), and we want to minimize work.

Not sure if that matches what y'all want.

@dtolnay
Copy link
Member Author

dtolnay commented Oct 25, 2017

Yes, partially overwritten but memory-safe state is acceptable.

The important thing is that this is about reusing allocations, not merging one deserialized struct into another. So if you have:

struct S {
    a: Option<T>,
    b: Option<T>,
}

Deserializing {"a": ..., "b": ...} followed by deserializing {"a": ...} over top of the same data needs to result in b=None, not retaining its previous value.

@Gankra
Copy link
Contributor

Gankra commented Oct 30, 2017

Blugh, this is getting to a bit of a clumsy mess. Just to check that I'm understanding this correctly: I basically need to duplicate the entire of the deserialization subsystem with _from variants?

Like every single method on Visitor will need a _from variant? Or I'll need to make a FromVisitor trait...?

@dtolnay
Copy link
Member Author

dtolnay commented Oct 31, 2017

@gankro I hope not... Here is a quick prototype.

@Gankra
Copy link
Contributor

Gankra commented Nov 1, 2017

Status: this will be useless for webrender as-is because enums represent a fundamental barrier to this optimization. Currently pushing on rust-lang/rfcs#2195 to fix this.

@dtolnay
Copy link
Member Author

dtolnay commented Nov 1, 2017

Is the fundamental barrier enums in general, or consecutive messages that contain different variants of the same enum field? If the common case is that consecutive messages contain the same variant, we can certainly optimize that.

@Gankra
Copy link
Contributor

Gankra commented Nov 1, 2017

The item we're deserializing is an enum whose variant is ~random. In theory, yes, we could do something clever if we got the same variant multiple times in a row, but this basically never happens for us. We need a way to build an enum in-place regardless of its old value.

@Gankra
Copy link
Contributor

Gankra commented Nov 1, 2017

Proof of concept here: rust-lang/rust#45688

@Gankra
Copy link
Contributor

Gankra commented Nov 3, 2017

Ok been a bit sick but I've had a chance to work through enough of this to have an idea of how it looks, and what I need to do for my use-case.

So the trick you showed me makes the API work nice and clean, and I should have structs deriving nice and well with that. Enums are more complicated, and I'll probably need to fork serde_derive to provide a webrender-custom implementation to meet our needs, since you won't want my solution.

So my enum of interest has two nice properties: it's Copy (no dtors), and it has a None-like (C-like?) variant. The fact that it's Copy isn't something I'm going to be able to rely on, though. There are some plans to break that. In addition, I'm not super happy with relying on Copyness since that does nothing for UB like bool = 3. Now in principle we can ignore that UB since we plan to abort of failure, but I want our system to be a bit more robust than that.

So I plan to write a derive(LeakyEnumDeserialize) that does the following:

fn deserialize_enum_from(&mut self, &mut deserializer) -> Result<(), ()> {
  drop_in_place(self);           // clear out old stuff
  self.tag = #nonelike_variant;  // mark all contents as invalid, by setting enum to "None"
  
  match read_tag()? {
    A => {
      // deserialize fields
      self.A.field1.deserialize_from(deserializer)?
      self.A.field2.deserialize_from(deserializer)?
      // Now that we're done, we can actually say we're A 
      self.tag = A;
      Ok(())
   ...

The basic idea here is the one we use for leak-amplification in Vec::Drain: we mark the enum as empty before starting our work, and then only at the end do we tell it that it has contents, for minimal overhead. For a Copy enum this is perfect, but for one with destructors this can end up leaking stuff. We could set up code to keep track of where we are in the parse and "unwind" our work, but I expect this will pessimize codegen for something that, in our application, is a show-stopping programming error.

Emitting this implementation will be conditional on detecting that the enum has a repr(int), no other reprs (for forward safety compatibility), and a none-like variant. I've already prototyped out the repr detecting logic, since it was unclear to me that repr() annotations would show up in a derive (it does). Presumably slurping up a None-like variant should be trivial.

@Gankra
Copy link
Contributor

Gankra commented Nov 13, 2017

Very promising early results having only implemented structs, tuples, and primitives (which can be upstreamed to y'all): https://gist.github.com/Gankro/63d1028a0b69e418add58c23fc828243

Need to do some actual profiling though, the assembly is a bit hard to follow since different changes lead to loop funrolling.

@Gankra
Copy link
Contributor

Gankra commented Nov 13, 2017

One annoying thing I'm seeing for a [f32; 16] is that even in my most optimized result we're still generating what appears to be:

copy-float
check-len
copy-float
check-len
copy-float
check-len
...

rather than

check-len-has-16
copy-float
copy-float
copy-float
...

It should be within the contract of deserialize_from to do emit the latter, since we don't guarantee partial deserializations to have any particular form. That said I'm not sure if I can actually coax the three abstractions of serde+bincode+Read to understand this transformation.

@Gankra
Copy link
Contributor

Gankra commented Nov 13, 2017

Possibly fixing https://bugs.llvm.org/show_bug.cgi?id=34911 would be sufficient.

@dtolnay
Copy link
Member Author

dtolnay commented Dec 12, 2017

Implemented in #1094.

@dtolnay dtolnay closed this as completed Dec 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants