Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: catch cases in which fields required by a JSON schema are not in the JSON object #2712

Merged
merged 22 commits into from
Sep 26, 2023

Conversation

jpivarski
Copy link
Member

@jpivarski jpivarski commented Sep 15, 2023

To do this, every object must now keep a table of which of their fields have been filled and ensure that the table is completely filled at the end of an object. This adds more cross-checking, so let's try to not slow it down too much.

This PR adds the following.

  • When setting up the instructions (not performance-critical), a bitmap record_checklist_init_ is set to 0 everywhere other than positions for each field. (So if a record has $n$ fields, the $n$ least significant bits are 1 and the rest are 0.) Each record has its own checklist.
  • In the RapidJSON-SAX parser's StartObject step (encountered a {), record_checklist_init_ is copied into record_checklist_.
  • Whenever we find an expected field in find_key, the bit corresponding to that field in record_checklist_ is set to 0.
  • In the RapidJSON-SAX parser's EndObject step (encountered a }), we | all of the bits in record_checklist_ and see if it's different from 0.

We use a bitmap because setting up the table and checking the table can be done 64 fields at a time, using the vectorization built into ordinary 64-bit registers. The "copying record_checklist_init_ into record_checklist_" and "checking to see if it's different from zero" steps each only involves a single uint64_t if there are fewer than 64 fields.1

There's some indirection because we have std::vector<std::vector<uint64_t>> but these vectors remain a single allocation in the performance-critical code; they should stay in CPU cache.

Doing this required me to introduce "record identifiers" (to find the right element of the outer std::vector). Since that was the blocker for an unrelated optimization, optimistically guessing that the fields in data will have the same order as the required fields in the schema, I put that in as a test of the machinery and left it because it's an optimization.

Tasks remaining for this PR:

  • When finding a field in find_key, make sure that its bit is not already set, to raise an error (by returning -1) whenever a required field is set more than once.
  • When we reach the condition "not all required fields were seen," iterate through the expected fields and fill any option-type fields with a missing value (two kinds: FillIndexedOptionArray and FillByteMaskedArray) and raise an error (by returning false) on any missing non-option-type fields. This case is allowed to be slow. The implementation could be complex and may need to be put off for another PR. It's also a policy choice: if a field is required but not present, should we really consider that okay because it's option-type?
  • Lots of tests, including nested records and more than 64 fields in a record.
  • Get rid of the big lookup tables. A bit-shift and inverse isn't going to hurt anybody. And maybe using up precious L1 space makes it worse, anyway.

Footnotes

  1. No comment on getting 512 fields in one go by invoking AVX-512...

@jpivarski
Copy link
Member Author

@douglasdavis, what do you think of that second checkbox? If the schema says that option-typed fields x, y, z are required and the JSON doesn't have them, should we raise an error? Raising an error would be the simplest thing to do because we can quickly check the bitmap for uncleared bits.

Does the JSONSchema specification have anything to say about the distinction between fields whose values are null and fields that are simply not there?

@codecov
Copy link

codecov bot commented Sep 15, 2023

Codecov Report

Merging #2712 (9927f9f) into main (ffa1e47) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Additional details and impacted files
Files Coverage Δ
src/awkward/operations/ak_from_json.py 93.69% <100.00%> (+1.68%) ⬆️

@jpivarski jpivarski temporarily deployed to docs-preview September 15, 2023 22:17 — with GitHub Actions Inactive
@jpivarski
Copy link
Member Author

The one error on MacOS Python 3.11 is in pyarrow: tests/test_2616_use_pyarrow_for_strings.py:930

We'll see if it's consistent.

@agoose77
Copy link
Collaborator

agoose77 commented Sep 16, 2023

@jpivarski JSONschema has required to indicate fields that can be omitted. It doesn't say whether they should be null, but for awkward that might be better than making a union.

@agoose77
Copy link
Collaborator

The one error on MacOS Python 3.11 is in pyarrow: tests/test_2616_use_pyarrow_for_strings.py:930

We'll see if it's consistent.

It's a flaky error. I've a mind to disable the test for this as we see it in other PRs, and we can't easily fix it iirc.

@douglasdavis
Copy link
Contributor

@douglasdavis, what do you think of that second checkbox? If the schema says that option-typed fields x, y, z are required and the JSON doesn't have them, should we raise an error? Raising an error would be the simplest thing to do because we can quickly check the bitmap for uncleared bits.

My first thought is an error makes sense here; if the JSONSchema has the field (and listed with "type": ["string", "null"]), I'd expect that property to always appear in the JSON and have null if necessary (instead of of being omitted). If it's omitted it feels like a mismatch. But it does cross my mind that in real world messy datasets people have dropped nulls in their workflows for any number of reasons.

Does the JSONSchema specification have anything to say about the distinction between fields whose values are null and fields that are simply not there?

There is the "required" parameter for property definition: https://json-schema.org/understanding-json-schema/reference/object.html#required-properties

@douglasdavis
Copy link
Contributor

I just realized Angus had already pointed out "required" before my comment above :)

@jpivarski jpivarski temporarily deployed to docs-preview September 18, 2023 23:31 — with GitHub Actions Inactive
@jpivarski jpivarski marked this pull request as draft September 19, 2023 02:22
@jpivarski
Copy link
Member Author

Don't let this one hold up the next release. I'll get it done soon, though.

@jpivarski jpivarski temporarily deployed to docs-preview September 19, 2023 21:25 — with GitHub Actions Inactive
@jpivarski jpivarski temporarily deployed to docs-preview September 19, 2023 22:13 — with GitHub Actions Inactive
@jpivarski jpivarski temporarily deployed to docs-preview September 19, 2023 22:50 — with GitHub Actions Inactive
@jpivarski
Copy link
Member Author

All this needs now is a big round of tests. The JSON example that motivated this (which has missing keys where option-type is expected) is now parsable:

>>> import awkward as ak
>>> schema = {
...     "title": "untitled",
...     "description": "Auto generated by dask-awkward",
...     "type": "object",
...     "properties": {
...         "payload": {
...             "type": "object",
...             "properties": {
...                 "pull_request": {
...                     "type": ["object", "null"],
...                     "properties": {"merged_at": {"type": ["string", "null"]}},
...                 }
...             },
...         }
...     },
... }
>>> with open("tests/samples/complex-nested.json", "rb") as file:
...     array = ak.from_json(file, line_delimited=True, schema=schema)
... 
>>> array.show(type=True)
type: 10 * {
    payload: {
        pull_request: ?{
            merged_at: ?string
        }
    }
}
[{payload: {pull_request: None}},
 {payload: {pull_request: {merged_at: None}}},
 {payload: {pull_request: None}},
 {payload: {pull_request: None}},
 {payload: {pull_request: None}},
 {payload: {pull_request: None}},
 {payload: {pull_request: None}},
 {payload: {pull_request: None}},
 {payload: {pull_request: None}},
 {payload: {pull_request: None}}]

(tests/samples/complex-nested.json is the first 10 lines of https://data.gharchive.org/2015-01-01-10.json.gz)

By the way, you can field-slice through missing data:

>>> array["payload", "pull_request", "merged_at"]
<Array [None, None, None, None, ..., None, None, None] type='10 * ?string'>

We'd do better with a less empty dataset...

>>> array = ak.from_iter([
...     {"payload": None},
...     {"payload": {"pull_request": None}},
...     {"payload": {"pull_request": {"merged_at": None}}},
...     {"payload": {"pull_request": {"merged_at": "1970-01-01"}}},
... ])
>>> array["payload", "pull_request", "merged_at"]
<Array [None, None, None, '1970-01-01'] type='4 * ?string'>

@jpivarski jpivarski temporarily deployed to docs-preview September 20, 2023 00:16 — with GitHub Actions Inactive
@jpivarski jpivarski temporarily deployed to docs-preview September 20, 2023 00:45 — with GitHub Actions Inactive
@jpivarski jpivarski temporarily deployed to docs-preview September 20, 2023 02:33 — with GitHub Actions Inactive
@jpivarski jpivarski temporarily deployed to docs-preview September 20, 2023 20:07 — with GitHub Actions Inactive
@jpivarski jpivarski temporarily deployed to docs-preview September 20, 2023 20:25 — with GitHub Actions Inactive
@jpivarski jpivarski marked this pull request as ready for review September 20, 2023 20:54
@pre-commit-ci pre-commit-ci bot temporarily deployed to docs-preview September 20, 2023 21:40 Inactive
@jpivarski
Copy link
Member Author

This is done, though in the future, we may want to add string → date parsing in the C++ layer (avoid making lots of big date strings).

It adds a bitmap called "checklist" for all of the fields in every record (every record has its own checklist), and

  • StartObject (encountering a { token) initializes the checklist, copying 64 bits at a time from a constant prototype.
  • Key (encountering a "...": token)
    • checks to see if the bit has already been checked; duplicate keys after the first are ignored
    • checks off the bit corresponding to the field
  • EndObject (encountering a } token) collapses the checklist with bitwise-or and checks for differences from zero, as a quick way to see if each field has been encountered.
  • If any fields were missed, it enters a slow clean-up phase, in which it iterates over every field and for each unchecked bit,
    • an option-type field gets filled with None (thus ignoring the distinction between option-type field has value null and option-type field is not present).
    • other types of fields raise errors ("JSON data does not match schema").
  • Along the way, I also implemented an optimization: keys are now expected to be encountered in the order they appear in the requires list of the schema. If the keys in the data have the same order, they are always the first to be string-compared, minimizing string-comparisons. If not, it still works, but it may check more fields.

So this is how it stands: all of the features have been added, though some cases are favored over others for optimization. Having all keys from the requires list of the schema, in the order specified by the requires list, is the fastest case. Missing option-type fields and out-of-order fields are allowed, but slower.

Copy link
Collaborator

@agoose77 agoose77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment on lines +180 to +181
chunkmask = (uint64_t)1 << (j & 0x3f);
if ((record_checklist_[argument2()][j >> 6] & chunkmask) == 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be key_already_filled(argument2(), j)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wouldn't get to reuse the chunkmask. (Not a big deal, but this is a hot loop in an inline function.)

stringi = instructions_.data()[i * 4 + 1];
start = offsets[stringi];
stop = offsets[stringi + 1];
if (strncmp(str, &chars[start], (size_t)(stop - start)) == 0) {
return instructions_.data()[i * 4 + 2];
// ensure that the checklist bit is 1
chunkmask = (uint64_t)1 << (j & 0x3f);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the intention of the mask with 0x3f? It will ensure that j & 0x3f is always (0 <= j < 64), but one could also impose argument2() < 64. I don't know enough about this logic beyond asking this question :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because you could have more than 64 fields in a RecordArray, and & 0x3f is % 64. I'd have to remind myself of the logic, but this is part of dealing with batches.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, my question was more concerning what should happen when you get j > 64, i.e. why wrap around? In any case, if it passes the tests, that's good enough for me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ummm. (It's been a while since I wrote it, and I can't easily look at the code now.) There was a reason to have both a division by 64 and a modulo by 64; the bit-shift is the division and the bit-mask is the modulo.

The tests do test this, since they include records with more than 64 fields.

@jpivarski jpivarski enabled auto-merge (squash) September 26, 2023 21:15
@jpivarski
Copy link
Member Author

I set this to auto-merge. I just learned from @henryiii that there's an advantage to releasing a new awkward-cpp before the Python 3.12 release in a week and a half, and this PR is a good excuse to release a new awkward-cpp in the next few days.

@jpivarski jpivarski temporarily deployed to docs-preview September 26, 2023 21:26 — with GitHub Actions Inactive
@jpivarski jpivarski merged commit d9d020d into main Sep 26, 2023
37 checks passed
@jpivarski jpivarski deleted the jpivarski/fix-json-from-schema-with-missing-field branch September 26, 2023 21:31
@jpivarski
Copy link
Member Author

I was walking and I remembered the reason for the division and modulo by 64. Since records can have more than 64 fields, we can't have one uint64_t integer be the bitmap for them all. That's why it's a std::vector<uint64_t> for each distinct record (and therefore a std::vector<std::vector<uint64_t>>). The bits for each field have to be distributed among the 64-bit integers in the same way as place value arithmetic, but base 64 instead of base 10. If we need to get at a middle digit, we need to divide to get rid of the lower significance part, and modulo to get rid of the higher significance part.

If we never had more than 64 fields, we could use a single uint64_t and only divide, no modulo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ak.from_json with a schema= argument raises exception in ak_from_buffers due to size difference
3 participants