Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: Support streaming deserialization of JSON objects #64182

Closed
joelverhagen opened this issue Jan 24, 2022 · 5 comments
Closed

[API Proposal]: Support streaming deserialization of JSON objects #64182

joelverhagen opened this issue Jan 24, 2022 · 5 comments
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Text.Json untriaged New issue has not been triaged by the area owner

Comments

@joelverhagen
Copy link
Member

Background and motivation

My team has a large JSON blob that has the following format:

{
   "package-id-1": ["owner-1", "owner-2"],
   "package-id-2": ["owner-1"],
   ... megabytes and megabytes later ...
   "package-id-9001": ["owner-42"],
}

I thought that perhaps this file could be read in a streaming way via some implementation of IAsyncEnumerable<KeyValuePair<string, List<string>>> provided by System.Text.Json.

Currently, it appears that JsonSerializer.DeserializeAsyncEnumerable<T> only supports documents that are rooted as arrays. This definitely makes sense as the main use case. However, it seems to me that this general concept could also work for streaming across very large objects where the keys are more like data than schema and therefore allowing unbounded properties. In the JSON example above, both the keys and the values are "data" so to speak rather than a more typical JSON document using object property names as "schema".

From the blog post I read, it appears that this limitation is expected for now.

It only supports reading from root-level JSON arrays, although that could be relaxed in the future based on feedback.

Currently, if a KVP is provided for T, the following strange error is thrown mentioning a Queue (appears to be an implementation detail). I would have expected an error saying "unexpected JsonTokenType.StartObject, expected JsonTokenType.StartArray" or something.

System.Text.Json.JsonException: The JSON value could not be converted to System.Collections.Generic.Queue`1[System.Collections.Generic.KeyValuePair`2[System.String,System.Collections.Generic.List`1[System.String]]]. Path: $ | LineNumber: 0 | BytePositionInLine: 1.
   at System.Text.Json.ThrowHelper.ThrowJsonException_DeserializeUnableToConvertValue(Type propertyType)
   at System.Text.Json.Serialization.JsonCollectionConverter`2.OnTryRead(Utf8JsonReader& reader, Type typeToConvert, JsonSerializerOptions options, ReadStack& state, TCollection& value)
   at System.Text.Json.Serialization.JsonConverter`1.TryRead(Utf8JsonReader& reader, Type typeToConvert, JsonSerializerOptions options, ReadStack& state, T& value)
   at System.Text.Json.Serialization.JsonConverter`1.ReadCore(Utf8JsonReader& reader, JsonSerializerOptions options, ReadStack& state)
   at System.Text.Json.JsonSerializer.ReadCore[TValue](JsonConverter jsonConverter, Utf8JsonReader& reader, JsonSerializerOptions options, ReadStack& state)
   at System.Text.Json.JsonSerializer.ReadCore[TValue](JsonReaderState& readerState, Boolean isFinalBlock, ReadOnlySpan`1 buffer, JsonSerializerOptions options, ReadStack& state, JsonConverter converterBase)
   at System.Text.Json.JsonSerializer.ContinueDeserialize[TValue](ReadBufferState& bufferState, JsonReaderState& jsonReaderState, ReadStack& readStack, JsonConverter converter, JsonSerializerOptions options)
   at System.Text.Json.JsonSerializer.<DeserializeAsyncEnumerable>g__CreateAsyncEnumerableDeserializer|63_0[TValue](Stream utf8Json, JsonSerializerOptions options, CancellationToken cancellationToken)+MoveNext()
   at System.Text.Json.JsonSerializer.<DeserializeAsyncEnumerable>g__CreateAsyncEnumerableDeserializer|63_0[TValue](Stream utf8Json, JsonSerializerOptions options, CancellationToken cancellationToken)+System.Threading.Tasks.Sources.IValueTaskSource<System.Boolean>.GetResult()

I attempted to write my own code to produce an IAsyncEnumerable from a Utf8JsonReader but found it quite challenging. The analogous code with Newtonsoft.Json (using JsonTextReader is straightforward.

API Proposal

I propose an overload of JsonSerializer.DeserializeAsyncEnumerable is added to support the parsing of objects:

IAsyncEnumerable<KeyValuePair<TKey, TValue>> DeserializeAsyncEnumerable<TKey, TValue>(
    Stream utf8Json,
    JsonSerializerOptions? options = null,
    CancellationToken cancellationToken = default(CancellationToken));

By default, the method would work best when the property values are homogenous in type (e.g. List<string> in my example above) but this could be enhanced using a JsonConverter that handles all of the different property types and returning them as TValue. TValue could be left as object indicating that the value should be returned as a JSON DOM object.

I believe this is superior to passing a KeyValuePair as T for the existing DeserializeAsyncEnumerable<T> since it provides a hint at the call site that the expected document is an object, not an array.

API Usage

JSON:

{
  "a": [ 1, 2 ],
  "b": [ 2, 3 ]
}

Code:

using System.Text.Json;

using var json = File.OpenRead("example.json");

// Returns a IAsyncEnumerable<KeyValuePair<string, List<int>>>
var pairs = JsonSerializer.DeserializeAsyncEnumerable<string, List<int>>(json);

await foreach (var pair in pairs)
{
    Console.WriteLine($"{pair.Key}: {string.Join(" + ", pair.Value)} = {pair.Value.Sum()}");
}

Output:

a: 1 + 2 = 3
b: 2 + 3 = 5

Alternative Designs

A new type could be introduced to contain both the property name and value. However, I see the symmetry between IAsyncEnumerable<KeyValuePair<TKey, TValue>> and Dictionary<TKey, TValue> implementing IEnumerable<KeyValuePair<TKey, TValue>>.

Alternatively, the existing method with a single type parameter T could be enhanced to have a special case to allow objects when T is a KVP. I think this alternative is a bit more confusing and not discoverable

An alternative design for the end-user would be to format the JSON in a different (more sane, yet more verbose) way, e.g.

[
   { "id": "package-id-1", "owners": ["owner-1", "owner-2"] },
   { "id": "package-id-2", "owners": ["owner-1"] },
   ... megabytes and megabytes later ...
   { "id": "package-id-9001", "owners": ["owner-42"] }
]

This may not be possible given constraints on the producer of the JSON document.

Risks

It might be entirely unclear that is how you do streaming object deserialization. The nuance between one and two type parameters is perhaps too subtle.

This suggested feature may be a bit frustrating in that, I wager, most JSON objects do not have homogenous property values. So perhaps a lot of folks just will use object as the TValue which (from what I can tell) falls through to the DOM API for the returned object values.

It is quite likely that this method would need to allow duplicate property names. Otherwise, the streaming state would need to track property names that have already been seen in order to error out. It would need to be abundantly clear to callers that they need to do duplicate property name checks themselves (if necessary).

@joelverhagen joelverhagen added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Jan 24, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Text.Json untriaged New issue has not been triaged by the area owner labels Jan 24, 2022
@ghost
Copy link

ghost commented Jan 24, 2022

Tagging subscribers to this area: @dotnet/area-system-text-json
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

My team has a large JSON blob that has the following format:

{
   "package-id-1": ["owner-1", "owner-2"],
   "package-id-2": ["owner-1"],
   ... megabytes and megabytes later ...
   "package-id-9001": ["owner-42"],
}

I thought that perhaps this file could be read in a streaming way via some implementation of IAsyncEnumerable<KeyValuePair<string, List<string>>> provided by System.Text.Json.

Currently, it appears that JsonSerializer.DeserializeAsyncEnumerable<T> only supports documents that are rooted as arrays. This definitely makes sense as the main use case. However, it seems to me that this general concept could also work for streaming across very large objects where the keys are more like data than schema and therefore allowing unbounded properties. In the JSON example above, both the keys and the values are "data" so to speak rather than a more typical JSON document using object property names as "schema".

From the blog post I read, it appears that this limitation is expected for now.

It only supports reading from root-level JSON arrays, although that could be relaxed in the future based on feedback.

Currently, if a KVP is provided for T, the following strange error is thrown mentioning a Queue (appears to be an implementation detail). I would have expected an error saying "unexpected JsonTokenType.StartObject, expected JsonTokenType.StartArray" or something.

System.Text.Json.JsonException: The JSON value could not be converted to System.Collections.Generic.Queue`1[System.Collections.Generic.KeyValuePair`2[System.String,System.Collections.Generic.List`1[System.String]]]. Path: $ | LineNumber: 0 | BytePositionInLine: 1.
   at System.Text.Json.ThrowHelper.ThrowJsonException_DeserializeUnableToConvertValue(Type propertyType)
   at System.Text.Json.Serialization.JsonCollectionConverter`2.OnTryRead(Utf8JsonReader& reader, Type typeToConvert, JsonSerializerOptions options, ReadStack& state, TCollection& value)
   at System.Text.Json.Serialization.JsonConverter`1.TryRead(Utf8JsonReader& reader, Type typeToConvert, JsonSerializerOptions options, ReadStack& state, T& value)
   at System.Text.Json.Serialization.JsonConverter`1.ReadCore(Utf8JsonReader& reader, JsonSerializerOptions options, ReadStack& state)
   at System.Text.Json.JsonSerializer.ReadCore[TValue](JsonConverter jsonConverter, Utf8JsonReader& reader, JsonSerializerOptions options, ReadStack& state)
   at System.Text.Json.JsonSerializer.ReadCore[TValue](JsonReaderState& readerState, Boolean isFinalBlock, ReadOnlySpan`1 buffer, JsonSerializerOptions options, ReadStack& state, JsonConverter converterBase)
   at System.Text.Json.JsonSerializer.ContinueDeserialize[TValue](ReadBufferState& bufferState, JsonReaderState& jsonReaderState, ReadStack& readStack, JsonConverter converter, JsonSerializerOptions options)
   at System.Text.Json.JsonSerializer.<DeserializeAsyncEnumerable>g__CreateAsyncEnumerableDeserializer|63_0[TValue](Stream utf8Json, JsonSerializerOptions options, CancellationToken cancellationToken)+MoveNext()
   at System.Text.Json.JsonSerializer.<DeserializeAsyncEnumerable>g__CreateAsyncEnumerableDeserializer|63_0[TValue](Stream utf8Json, JsonSerializerOptions options, CancellationToken cancellationToken)+System.Threading.Tasks.Sources.IValueTaskSource<System.Boolean>.GetResult()

I attempted to write my own code to produce an IAsyncEnumerable from a Utf8JsonReader but found it quite challenging. The analogous code with Newtonsoft.Json (using JsonTextReader is straightforward.

API Proposal

I propose an overload of JsonSerializer.DeserializeAsyncEnumerable is added to support the parsing of objects:

IAsyncEnumerable<KeyValuePair<TKey, TValue>> DeserializeAsyncEnumerable<TKey, TValue>(
    Stream utf8Json,
    JsonSerializerOptions? options = null,
    CancellationToken cancellationToken = default(CancellationToken));

By default, the method would work best when the property values are homogenous in type (e.g. List<string> in my example above) but this could be enhanced using a JsonConverter that handles all of the different property types and returning them as TValue. TValue could be left as object indicating that the value should be returned as a JSON DOM object.

I believe this is superior to passing a KeyValuePair as T for the existing DeserializeAsyncEnumerable<T> since it provides a hint at the call site that the expected document is an object, not an array.

API Usage

JSON:

{
  "a": [ 1, 2 ],
  "b": [ 2, 3 ]
}

Code:

using System.Text.Json;

using var json = File.OpenRead("example.json");

// Returns a IAsyncEnumerable<KeyValuePair<string, List<int>>>
var pairs = JsonSerializer.DeserializeAsyncEnumerable<string, List<int>>(json);

await foreach (var pair in pairs)
{
    Console.WriteLine($"{pair.Key}: {string.Join(" + ", pair.Value)} = {pair.Value.Sum()}");
}

Output:

a: 1 + 2 = 3
b: 2 + 3 = 5

Alternative Designs

A new type could be introduced to contain both the property name and value. However, I see the symmetry between IAsyncEnumerable<KeyValuePair<TKey, TValue>> and Dictionary<TKey, TValue> implementing IEnumerable<KeyValuePair<TKey, TValue>>.

Alternatively, the existing method with a single type parameter T could be enhanced to have a special case to allow objects when T is a KVP. I think this alternative is a bit more confusing and not discoverable

An alternative design for the end-user would be to format the JSON in a different (more sane, yet more verbose) way, e.g.

[
   { "id": "package-id-1", "owners": ["owner-1", "owner-2"] },
   { "id": "package-id-2", "owners": ["owner-1"] },
   ... megabytes and megabytes later ...
   { "id": "package-id-9001", "owners": ["owner-42"] }
]

This may not be possible given constraints on the producer of the JSON document.

Risks

It might be entirely unclear that is how you do streaming object deserialization. The nuance between one and two type parameters is perhaps too subtle.

This suggested feature may be a bit frustrating in that, I wager, most JSON objects do not have homogenous property values. So perhaps a lot of folks just will use object as the TValue which (from what I can tell) falls through to the DOM API for the returned object values.

It is quite likely that this method would need to allow duplicate property names. Otherwise, the streaming state would need to track property names that have already been seen in order to error out. It would need to be abundantly clear to callers that they need to do duplicate property name checks themselves (if necessary).

Author: joelverhagen
Assignees: -
Labels:

api-suggestion, area-System.Text.Json, untriaged

Milestone: -

@eiriktsarpalis
Copy link
Member

This suggested feature may be a bit frustrating in that, I wager, most JSON objects do not have homogenous property values.

That's an excellent point, which in my view illustrates that this is a niche application. Rather than exposing such functionality as a dedicated method, we should instead offer extensibility points that let users write extensions that support their bespoke scenaria. I believe it could be addressed by #63795.

@eiriktsarpalis
Copy link
Member

Assuming #63795 is implemented, it should be possible to write an async converter that reads the root-level object as a Queue<KeyValuePair<string, T>> converter. Then wrap deserialization into an IAsyncEnumerable using root-level partial read methods as discussed in #29902 (and tracked by #63795).

@joelverhagen
Copy link
Member Author

Awesome! I'll follow #63795 then and give it a try whenever it lands. Thanks for your time, @eiriktsarpalis!

@steveharter
Copy link
Member

Supporting async JSON object deserialization with strongly-typed <T> values implies returning a "partial object" which is unsafe (e.g. property getters\setters may depend on other properties not yet deserialize), thus I understand why Queue<KeyValuePair<string,>> is suggested instead of the actual object.

This suggested feature may be a bit frustrating in that, I wager, most JSON objects do not have homogenous property values. So perhaps a lot of folks just will use object as the TValue which (from what I can tell) falls through to the DOM API for the returned object values.

Assuming property types are not homogenous, then yes a DOM type would be easiest especially if type-specific POCO logic on getters\setters is not present. Deserializing into Queue<KeyValuePair<string, JsonNode>> would be the most straightforward since the JsonNode DOM allows modifications, unlike the read-only JsonElement DOM.

With JsonNode, the client-side IAsyncEnumerable + Queue<KeyValuePair<string, JsonNode>> behavior would return each top-level property that includes any nested objects\arrays for example. JsonNode, and its lower-level implementation, JsonElement, do not support an async pattern for each nested JSON element so the entire JSON element (including all child properties\arrays) are loaded into memory before processing. This top-level-property-async-granularity is probably desired anyway to avoid having to deal with multiple IAsyncEnumerable instances and "partial properties", but I suppose for some scenarios they may want IAsyncEnumerable for every nested JSON object\array.

@ghost ghost locked as resolved and limited conversation to collaborators Feb 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Text.Json untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

No branches or pull requests

3 participants