Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ak.from_json with a schema= argument raises exception in ak_from_buffers due to size difference #2709

Closed
douglasdavis opened this issue Sep 13, 2023 · 1 comment · Fixed by #2712
Labels
bug The problem described is something that must be fixed

Comments

@douglasdavis
Copy link
Contributor

douglasdavis commented Sep 13, 2023

Version of Awkward Array

2.4.2

Description and code to reproduce

Working on usage of dask-contrib/dask-awkward#94 and I've come across an issue in ak.from_json which is raising from ak_from_buffers:

TypeError: size of array (360) is less than size of form (7234)

reproducer:

(The schema is meant to produce an array of records with top level field "payload" which contains a subfield "pull_request" that is either null or has subfield "merged_at" that is either a string or null.)

In [15]: schema = {
    ...:     "title": "untitled",
    ...:     "description": "Auto generated by dask-awkward",
    ...:     "type": "object",
    ...:     "properties": {
    ...:         "payload": {
    ...:             "type": "object",
    ...:             "properties": {
    ...:                 "pull_request": {
    ...:                     "type": ["object", "null"],
    ...:                     "properties": {"merged_at": {"type": ["string", "null"]}},
    ...:                 }
    ...:             },
    ...:         }
    ...:     },
    ...: }

In [16]: with fsspec.open(
    ...:     "https://data.gharchive.org/2015-01-01-10.json.gz", compression="infer", mode="rt"
    ...: ) as f:
    ...:     array = ak.from_json(f.read(), line_delimited=True, schema=schema)

More info:

My assumption is that because the field that is sandwiched by payload and merged_at, pull_request, is of type object or null, and only 360 of them are not null, something is going wrong with building the array of this nested nullable type. (I double checked the numbers via):

# here array was created without schema=!)
In [19]: ak.drop_none(array.payload.pull_request)
Out[19]: <Array [{url: ..., ...}, ..., {url: ..., ...}] type='360 * {url: string, id...'>

More info on how I came across the issue via dask-awkward:

The schema can be created manually with dask-awkward code:

# here array was created without schema=!)
In [21]: schema = dak.layout_to_jsonschema(
    ...:     array[["payload"], ["pull_request"], ["merged_at"]].layout
    ...: )

This is the same schema that gets created automatically by the column optimizer if we try use field access on a dak collection and run compute:

In [28]: dak_array = dak.from_json(["https://data.gharchive.org/2015-01-01-10.json.gz"])

In [29]: dak_array.payload.pull_request.merged_at.compute()
@douglasdavis douglasdavis added the bug (unverified) The problem described would be a bug, but needs to be triaged label Sep 13, 2023
@douglasdavis douglasdavis changed the title ak.from_json with a schema argument raises exception in ak_from_buffers due to size difference ak.from_json with a schema= argument raises exception in ak_from_buffers due to size difference Sep 13, 2023
@jpivarski jpivarski added bug The problem described is something that must be fixed and removed bug (unverified) The problem described would be a bug, but needs to be triaged labels Sep 15, 2023
@jpivarski
Copy link
Member

jpivarski commented Sep 15, 2023

A key aspect of this bug is that "merged_at" is not a key in all of these JSON objects. The ArrayBuilder (from_json without a schema) chooses to make a set of JSON objects with different fields at the same level a single RecordArray with missing JSON objects as null values, instead of a UnionArray of different RecordArrays, with and without the missing field.

To show that this is the issue, see what happens when we look only at the JSON objects with this field:

>>> import fsspec, awkward as ak
>>> schema = {
...     "title": "untitled",
...     "description": "Auto generated by dask-awkward",
...     "type": "object",
...     "properties": {
...         "payload": {
...             "type": "object",
...             "properties": {
...                 "pull_request": {
...                     "type": ["object", "null"],
...                     "properties": {"merged_at": {"type": ["string", "null"]}},
...                 }
...             },
...         }
...     },
... }
>>> with fsspec.open(
...     "https://data.gharchive.org/2015-01-01-10.json.gz", compression="infer", mode="rt"
... ) as f:
...     subset = "".join([x for x in list(f) if "\"merged_at\":" in x])
... 
>>> array = ak.from_json(subset, line_delimited=True, schema=schema)
>>> array.show(type=True)
type: 360 * {
    payload: {
        pull_request: ?{
            merged_at: ?string
        }
    }
}
[{payload: {pull_request: {merged_at: None}}},
 {payload: {pull_request: {merged_at: None}}},
 {payload: {pull_request: {merged_at: None}}},
 {payload: {pull_request: {merged_at: '2015-01-01T10:00:32Z'}}},
 {payload: {pull_request: {merged_at: None}}},
 {payload: {pull_request: {merged_at: '2015-01-01T10:01:07Z'}}},
 {payload: {pull_request: {merged_at: '2015-01-01T10:01:08Z'}}},
 {payload: {pull_request: {merged_at: '2015-01-01T10:01:08Z'}}},
 {payload: {pull_request: {merged_at: '2015-01-01T10:01:11Z'}}},
 {payload: {pull_request: {merged_at: '2015-01-01T10:01:23Z'}}},
 ...,
 {payload: {pull_request: {merged_at: '2015-01-01T10:58:31Z'}}},
 {payload: {pull_request: {merged_at: None}}},
 {payload: {pull_request: {merged_at: '2015-01-01T10:59:00Z'}}},
 {payload: {pull_request: {merged_at: None}}},
 {payload: {pull_request: {merged_at: None}}},
 {payload: {pull_request: {merged_at: None}}},
 {payload: {pull_request: {merged_at: None}}},
 {payload: {pull_request: {merged_at: '2015-01-01T10:59:44Z'}}},
 {payload: {pull_request: {merged_at: '2015-01-01T10:59:55Z'}}}]

(Those "merged_at" fields that are None are distinct from the cases that we dropped, which didn't even have a "merged_at" field.)

This could either be solved by from_json with schema raising an error when the field it has been requested doesn't exist or by filling in those cases with a null value. If the schema declares it as option-type, then we could fill it with a null value; if the schema does not declare it as option-type, then we must raise an error (we wouldn't have any valid value to fill it with).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The problem described is something that must be fixed
Projects
None yet
2 participants