`ak.from_json` with a `schema=` argument raises exception in `ak_from_buffers` due to size difference #2709

douglasdavis · 2023-09-13T16:47:29Z

Version of Awkward Array

2.4.2

Description and code to reproduce

Working on usage of dask-contrib/dask-awkward#94 and I've come across an issue in ak.from_json which is raising from ak_from_buffers:

TypeError: size of array (360) is less than size of form (7234)

reproducer:

(The schema is meant to produce an array of records with top level field "payload" which contains a subfield "pull_request" that is either null or has subfield "merged_at" that is either a string or null.)

In [15]: schema = {
    ...:     "title": "untitled",
    ...:     "description": "Auto generated by dask-awkward",
    ...:     "type": "object",
    ...:     "properties": {
    ...:         "payload": {
    ...:             "type": "object",
    ...:             "properties": {
    ...:                 "pull_request": {
    ...:                     "type": ["object", "null"],
    ...:                     "properties": {"merged_at": {"type": ["string", "null"]}},
    ...:                 }
    ...:             },
    ...:         }
    ...:     },
    ...: }

In [16]: with fsspec.open(
    ...:     "https://data.gharchive.org/2015-01-01-10.json.gz", compression="infer", mode="rt"
    ...: ) as f:
    ...:     array = ak.from_json(f.read(), line_delimited=True, schema=schema)

More info:

My assumption is that because the field that is sandwiched by payload and merged_at, pull_request, is of type object or null, and only 360 of them are not null, something is going wrong with building the array of this nested nullable type. (I double checked the numbers via):

# here array was created without schema=!)
In [19]: ak.drop_none(array.payload.pull_request)
Out[19]: <Array [{url: ..., ...}, ..., {url: ..., ...}] type='360 * {url: string, id...'>

More info on how I came across the issue via dask-awkward:

The schema can be created manually with dask-awkward code:

# here array was created without schema=!)
In [21]: schema = dak.layout_to_jsonschema(
    ...:     array[["payload"], ["pull_request"], ["merged_at"]].layout
    ...: )

This is the same schema that gets created automatically by the column optimizer if we try use field access on a dak collection and run compute:

In [28]: dak_array = dak.from_json(["https://data.gharchive.org/2015-01-01-10.json.gz"])

In [29]: dak_array.payload.pull_request.merged_at.compute()

The text was updated successfully, but these errors were encountered:

jpivarski · 2023-09-15T14:26:28Z

A key aspect of this bug is that "merged_at" is not a key in all of these JSON objects. The ArrayBuilder (from_json without a schema) chooses to make a set of JSON objects with different fields at the same level a single RecordArray with missing JSON objects as null values, instead of a UnionArray of different RecordArrays, with and without the missing field.

To show that this is the issue, see what happens when we look only at the JSON objects with this field:

>>> import fsspec, awkward as ak
>>> schema = {
...     "title": "untitled",
...     "description": "Auto generated by dask-awkward",
...     "type": "object",
...     "properties": {
...         "payload": {
...             "type": "object",
...             "properties": {
...                 "pull_request": {
...                     "type": ["object", "null"],
...                     "properties": {"merged_at": {"type": ["string", "null"]}},
...                 }
...             },
...         }
...     },
... }
>>> with fsspec.open(
...     "https://data.gharchive.org/2015-01-01-10.json.gz", compression="infer", mode="rt"
... ) as f:
...     subset = "".join([x for x in list(f) if "\"merged_at\":" in x])
... 
>>> array = ak.from_json(subset, line_delimited=True, schema=schema)
>>> array.show(type=True)
type: 360 * {
    payload: {
        pull_request: ?{
            merged_at: ?string
        }
    }
}
[{payload: {pull_request: {merged_at: None}}},
 {payload: {pull_request: {merged_at: None}}},
 {payload: {pull_request: {merged_at: None}}},
 {payload: {pull_request: {merged_at: '2015-01-01T10:00:32Z'}}},
 {payload: {pull_request: {merged_at: None}}},
 {payload: {pull_request: {merged_at: '2015-01-01T10:01:07Z'}}},
 {payload: {pull_request: {merged_at: '2015-01-01T10:01:08Z'}}},
 {payload: {pull_request: {merged_at: '2015-01-01T10:01:08Z'}}},
 {payload: {pull_request: {merged_at: '2015-01-01T10:01:11Z'}}},
 {payload: {pull_request: {merged_at: '2015-01-01T10:01:23Z'}}},
 ...,
 {payload: {pull_request: {merged_at: '2015-01-01T10:58:31Z'}}},
 {payload: {pull_request: {merged_at: None}}},
 {payload: {pull_request: {merged_at: '2015-01-01T10:59:00Z'}}},
 {payload: {pull_request: {merged_at: None}}},
 {payload: {pull_request: {merged_at: None}}},
 {payload: {pull_request: {merged_at: None}}},
 {payload: {pull_request: {merged_at: None}}},
 {payload: {pull_request: {merged_at: '2015-01-01T10:59:44Z'}}},
 {payload: {pull_request: {merged_at: '2015-01-01T10:59:55Z'}}}]

(Those "merged_at" fields that are None are distinct from the cases that we dropped, which didn't even have a "merged_at" field.)

This could either be solved by from_json with schema raising an error when the field it has been requested doesn't exist or by filling in those cases with a null value. If the schema declares it as option-type, then we could fill it with a null value; if the schema does not declare it as option-type, then we must raise an error (we wouldn't have any valid value to fill it with).

douglasdavis added the bug (unverified) The problem described would be a bug, but needs to be triaged label Sep 13, 2023

douglasdavis changed the title ~~ak.from_json with a schema argument raises exception in ak_from_buffers due to size difference~~ ak.from_json with a schema= argument raises exception in ak_from_buffers due to size difference Sep 13, 2023

jpivarski added bug The problem described is something that must be fixed and removed bug (unverified) The problem described would be a bug, but needs to be triaged labels Sep 15, 2023

jpivarski linked a pull request Sep 15, 2023 that will close this issue

fix: catch cases in which fields required by a JSON schema are not in the JSON object #2712

Merged

4 tasks

jpivarski closed this as completed in #2712 Sep 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ak.from_json` with a `schema=` argument raises exception in `ak_from_buffers` due to size difference #2709

`ak.from_json` with a `schema=` argument raises exception in `ak_from_buffers` due to size difference #2709

douglasdavis commented Sep 13, 2023 •

edited

Loading

jpivarski commented Sep 15, 2023 •

edited

Loading

ak.from_json with a schema= argument raises exception in ak_from_buffers due to size difference #2709

ak.from_json with a schema= argument raises exception in ak_from_buffers due to size difference #2709

Comments

douglasdavis commented Sep 13, 2023 • edited Loading

Version of Awkward Array

Description and code to reproduce

jpivarski commented Sep 15, 2023 • edited Loading

`ak.from_json` with a `schema=` argument raises exception in `ak_from_buffers` due to size difference #2709

`ak.from_json` with a `schema=` argument raises exception in `ak_from_buffers` due to size difference #2709

douglasdavis commented Sep 13, 2023 •

edited

Loading

jpivarski commented Sep 15, 2023 •

edited

Loading