fix: catch cases in which fields required by a JSON schema are not in the JSON object #2712

jpivarski · 2023-09-15T21:52:05Z

To do this, every object must now keep a table of which of their fields have been filled and ensure that the table is completely filled at the end of an object. This adds more cross-checking, so let's try to not slow it down too much.

This PR adds the following.

When setting up the instructions (not performance-critical), a bitmap record_checklist_init_ is set to 0 everywhere other than positions for each field. (So if a record has $n$ fields, the $n$ least significant bits are 1 and the rest are 0.) Each record has its own checklist.
In the RapidJSON-SAX parser's StartObject step (encountered a {), record_checklist_init_ is copied into record_checklist_.
Whenever we find an expected field in find_key, the bit corresponding to that field in record_checklist_ is set to 0.
In the RapidJSON-SAX parser's EndObject step (encountered a }), we | all of the bits in record_checklist_ and see if it's different from 0.

We use a bitmap because setting up the table and checking the table can be done 64 fields at a time, using the vectorization built into ordinary 64-bit registers. The "copying record_checklist_init_ into record_checklist_" and "checking to see if it's different from zero" steps each only involves a single uint64_t if there are fewer than 64 fields.¹

There's some indirection because we have std::vector<std::vector<uint64_t>> but these vectors remain a single allocation in the performance-critical code; they should stay in CPU cache.

Doing this required me to introduce "record identifiers" (to find the right element of the outer std::vector). Since that was the blocker for an unrelated optimization, optimistically guessing that the fields in data will have the same order as the required fields in the schema, I put that in as a test of the machinery and left it because it's an optimization.

Tasks remaining for this PR:

When finding a field in find_key, make sure that its bit is not already set, to raise an error (by returning -1) whenever a required field is set more than once.
When we reach the condition "not all required fields were seen," iterate through the expected fields and fill any option-type fields with a missing value (two kinds: FillIndexedOptionArray and FillByteMaskedArray) and raise an error (by returning false) on any missing non-option-type fields. This case is allowed to be slow. The implementation could be complex and may need to be put off for another PR. It's also a policy choice: if a field is required but not present, should we really consider that okay because it's option-type?
Lots of tests, including nested records and more than 64 fields in a record.
Get rid of the big lookup tables. A bit-shift and inverse isn't going to hurt anybody. And maybe using up precious L1 space makes it worse, anyway.

No comment on getting 512 fields in one go by invoking AVX-512... ↩

…ntifiers.

…ich will be used in the checklist-handling.

…ses right.

jpivarski · 2023-09-15T21:55:18Z

@douglasdavis, what do you think of that second checkbox? If the schema says that option-typed fields x, y, z are required and the JSON doesn't have them, should we raise an error? Raising an error would be the simplest thing to do because we can quickly check the bitmap for uncleared bits.

Does the JSONSchema specification have anything to say about the distinction between fields whose values are null and fields that are simply not there?

codecov · 2023-09-15T21:59:49Z

Codecov Report

Merging #2712 (9927f9f) into main (ffa1e47) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Additional details and impacted files

Files	Coverage Δ
src/awkward/operations/ak_from_json.py	`93.69% <100.00%> (+1.68%)`	⬆️

jpivarski · 2023-09-16T00:56:33Z

The one error on MacOS Python 3.11 is in pyarrow: tests/test_2616_use_pyarrow_for_strings.py:930

We'll see if it's consistent.

agoose77 · 2023-09-16T10:06:16Z

@jpivarski JSONschema has required to indicate fields that can be omitted. It doesn't say whether they should be null, but for awkward that might be better than making a union.

agoose77 · 2023-09-16T10:08:28Z

The one error on MacOS Python 3.11 is in pyarrow: tests/test_2616_use_pyarrow_for_strings.py:930

We'll see if it's consistent.

It's a flaky error. I've a mind to disable the test for this as we see it in other PRs, and we can't easily fix it iirc.

douglasdavis · 2023-09-18T14:16:23Z

@douglasdavis, what do you think of that second checkbox? If the schema says that option-typed fields x, y, z are required and the JSON doesn't have them, should we raise an error? Raising an error would be the simplest thing to do because we can quickly check the bitmap for uncleared bits.

My first thought is an error makes sense here; if the JSONSchema has the field (and listed with "type": ["string", "null"]), I'd expect that property to always appear in the JSON and have null if necessary (instead of of being omitted). If it's omitted it feels like a mismatch. But it does cross my mind that in real world messy datasets people have dropped nulls in their workflows for any number of reasons.

Does the JSONSchema specification have anything to say about the distinction between fields whose values are null and fields that are simply not there?

There is the "required" parameter for property definition: https://json-schema.org/understanding-json-schema/reference/object.html#required-properties

douglasdavis · 2023-09-18T15:55:11Z

I just realized Angus had already pointed out "required" before my comment above :)

…field

jpivarski · 2023-09-19T02:24:33Z

Don't let this one hold up the next release. I'll get it done soon, though.

…field

…cted key).

jpivarski · 2023-09-19T23:57:22Z

All this needs now is a big round of tests. The JSON example that motivated this (which has missing keys where option-type is expected) is now parsable:

>>> import awkward as ak
>>> schema = {
...     "title": "untitled",
...     "description": "Auto generated by dask-awkward",
...     "type": "object",
...     "properties": {
...         "payload": {
...             "type": "object",
...             "properties": {
...                 "pull_request": {
...                     "type": ["object", "null"],
...                     "properties": {"merged_at": {"type": ["string", "null"]}},
...                 }
...             },
...         }
...     },
... }
>>> with open("tests/samples/complex-nested.json", "rb") as file:
...     array = ak.from_json(file, line_delimited=True, schema=schema)
... 
>>> array.show(type=True)
type: 10 * {
    payload: {
        pull_request: ?{
            merged_at: ?string
        }
    }
}
[{payload: {pull_request: None}},
 {payload: {pull_request: {merged_at: None}}},
 {payload: {pull_request: None}},
 {payload: {pull_request: None}},
 {payload: {pull_request: None}},
 {payload: {pull_request: None}},
 {payload: {pull_request: None}},
 {payload: {pull_request: None}},
 {payload: {pull_request: None}},
 {payload: {pull_request: None}}]

(tests/samples/complex-nested.json is the first 10 lines of https://data.gharchive.org/2015-01-01-10.json.gz)

By the way, you can field-slice through missing data:

>>> array["payload", "pull_request", "merged_at"]
<Array [None, None, None, None, ..., None, None, None] type='10 * ?string'>

We'd do better with a less empty dataset...

>>> array = ak.from_iter([
...     {"payload": None},
...     {"payload": {"pull_request": None}},
...     {"payload": {"pull_request": {"merged_at": None}}},
...     {"payload": {"pull_request": {"merged_at": "1970-01-01"}}},
... ])
>>> array["payload", "pull_request", "merged_at"]
<Array [None, None, None, '1970-01-01'] type='4 * ?string'>

…field

…ttps://github.com/scikit-hep/awkward into jpivarski/fix-json-from-schema-with-missing-field

jpivarski · 2023-09-20T22:02:33Z

This is done, though in the future, we may want to add string → date parsing in the C++ layer (avoid making lots of big date strings).

It adds a bitmap called "checklist" for all of the fields in every record (every record has its own checklist), and

StartObject (encountering a { token) initializes the checklist, copying 64 bits at a time from a constant prototype.
Key (encountering a "...": token)
- checks to see if the bit has already been checked; duplicate keys after the first are ignored
- checks off the bit corresponding to the field
EndObject (encountering a } token) collapses the checklist with bitwise-or and checks for differences from zero, as a quick way to see if each field has been encountered.
If any fields were missed, it enters a slow clean-up phase, in which it iterates over every field and for each unchecked bit,
- an option-type field gets filled with None (thus ignoring the distinction between option-type field has value null and option-type field is not present).
- other types of fields raise errors ("JSON data does not match schema").
Along the way, I also implemented an optimization: keys are now expected to be encountered in the order they appear in the requires list of the schema. If the keys in the data have the same order, they are always the first to be string-compared, minimizing string-comparisons. If not, it still works, but it may check more fields.

So this is how it stands: all of the features have been added, though some cases are favored over others for optimization. Having all keys from the requires list of the schema, in the order specified by the requires list, is the fastest case. Missing option-type fields and out-of-order fields are allowed, but slower.

agoose77

LGTM!

agoose77 · 2023-09-25T21:47:54Z

awkward-cpp/include/awkward/io/json.h

+            chunkmask = (uint64_t)1 << (j & 0x3f);
+            if ((record_checklist_[argument2()][j >> 6] & chunkmask) == 0) {


Could this be key_already_filled(argument2(), j)?

It wouldn't get to reuse the chunkmask. (Not a big deal, but this is a hot loop in an inline function.)

agoose77 · 2023-09-25T21:47:59Z

awkward-cpp/include/awkward/io/json.h

        stringi = instructions_.data()[i * 4 + 1];
        start = offsets[stringi];
        stop = offsets[stringi + 1];
        if (strncmp(str, &chars[start], (size_t)(stop - start)) == 0) {
-          return instructions_.data()[i * 4 + 2];
+          // ensure that the checklist bit is 1
+          chunkmask = (uint64_t)1 << (j & 0x3f);


What's the intention of the mask with 0x3f? It will ensure that j & 0x3f is always (0 <= j < 64), but one could also impose argument2() < 64. I don't know enough about this logic beyond asking this question :)

It's because you could have more than 64 fields in a RecordArray, and & 0x3f is % 64. I'd have to remind myself of the logic, but this is part of dealing with batches.

Right, my question was more concerning what should happen when you get j > 64, i.e. why wrap around? In any case, if it passes the tests, that's good enough for me.

Ummm. (It's been a while since I wrote it, and I can't easily look at the code now.) There was a reason to have both a division by 64 and a modulo by 64; the bit-shift is the division and the bit-mask is the modulo.

The tests do test this, since they include records with more than 64 fields.

…field

jpivarski · 2023-09-26T21:17:49Z

I set this to auto-merge. I just learned from @henryiii that there's an advantage to releasing a new awkward-cpp before the Python 3.12 release in a week and a half, and this PR is a good excuse to release a new awkward-cpp in the next few days.

jpivarski · 2023-09-27T13:05:29Z

I was walking and I remembered the reason for the division and modulo by 64. Since records can have more than 64 fields, we can't have one uint64_t integer be the bitmap for them all. That's why it's a std::vector<uint64_t> for each distinct record (and therefore a std::vector<std::vector<uint64_t>>). The bits for each field have to be distributed among the 64-bit integers in the same way as place value arithmetic, but base 64 instead of base 10. If we need to get at a middle digit, we need to divide to get rid of the lower significance part, and modulo to get rid of the higher significance part.

If we never had more than 64 fields, we could use a single uint64_t and only divide, no modulo.

jpivarski added 6 commits September 15, 2023 12:15

Fix mask name in Form and container.

45b37ce

Implemented field name order optimization as a way to test record ide…

dea3ced

…ntifiers.

Reimplemented that optimization in terms of record_current_field_, wh…

8e6c98a

…ich will be used in the checklist-handling.

Implementing the end_object check, but it's not doing the nullable ca…

a4eb090

…ses right.

Add sample data for testing.

16af1b5

All old tests pass; now we just need more tests.

d3cd938

jpivarski linked an issue Sep 15, 2023 that may be closed by this pull request

ak.from_json with a schema= argument raises exception in ak_from_buffers due to size difference #2709

Closed

jpivarski temporarily deployed to docs-preview September 15, 2023 22:17 — with GitHub Actions Inactive

Merge branch 'main' into jpivarski/fix-json-from-schema-with-missing-…

f0154f6

…field

jpivarski temporarily deployed to docs-preview September 18, 2023 23:31 — with GitHub Actions Inactive

jpivarski marked this pull request as draft September 19, 2023 02:22

Merge branch 'main' into jpivarski/fix-json-from-schema-with-missing-…

255b77c

…field

jpivarski temporarily deployed to docs-preview September 19, 2023 21:25 — with GitHub Actions Inactive

Got rid of the big lookup tables.

0fe1f52

jpivarski temporarily deployed to docs-preview September 19, 2023 22:13 — with GitHub Actions Inactive

jpivarski added 2 commits September 19, 2023 17:25

Eliminate duplicates (by accepting only the first instance of an expe…

74d485e

…cted key).

Remove debugging code.

99db5f5

jpivarski temporarily deployed to docs-preview September 19, 2023 22:50 — with GitHub Actions Inactive

Implemented null-insertion for missing option-type fields.

bec2346

jpivarski temporarily deployed to docs-preview September 20, 2023 00:16 — with GitHub Actions Inactive

Balance push_stack and pop_stack for early return.

d399ff3

jpivarski temporarily deployed to docs-preview September 20, 2023 00:45 — with GitHub Actions Inactive

Is this causing the test failures?

428d9e3

jpivarski temporarily deployed to docs-preview September 20, 2023 02:33 — with GitHub Actions Inactive

It was overrunning the stack, obliterating 'characters_'.

34b4dd3

jpivarski temporarily deployed to docs-preview September 20, 2023 20:07 — with GitHub Actions Inactive

Merge branch 'main' into jpivarski/fix-json-from-schema-with-missing-…

e5d5e9f

…field

jpivarski temporarily deployed to docs-preview September 20, 2023 20:25 — with GitHub Actions Inactive

Added the relevant tests.

cab2167

jpivarski marked this pull request as ready for review September 20, 2023 20:54

jpivarski requested a review from agoose77 September 20, 2023 20:55

pre-commit-ci bot and others added 4 commits September 20, 2023 20:55

style: pre-commit fixes

2014374

Find SAMPLES_DIR in sdist.

55eb0c8

Merge branch 'jpivarski/fix-json-from-schema-with-missing-field' of h…

8a10515

…ttps://github.com/scikit-hep/awkward into jpivarski/fix-json-from-schema-with-missing-field

style: pre-commit fixes

fc3a49d

pre-commit-ci bot temporarily deployed to docs-preview September 20, 2023 21:40 Inactive

agoose77 approved these changes Sep 25, 2023

View reviewed changes

Merge branch 'main' into jpivarski/fix-json-from-schema-with-missing-…

9927f9f

…field

jpivarski enabled auto-merge (squash) September 26, 2023 21:15

jpivarski temporarily deployed to docs-preview September 26, 2023 21:26 — with GitHub Actions Inactive

jpivarski merged commit d9d020d into main Sep 26, 2023
37 checks passed

jpivarski deleted the jpivarski/fix-json-from-schema-with-missing-field branch September 26, 2023 21:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: catch cases in which fields required by a JSON schema are not in the JSON object #2712

fix: catch cases in which fields required by a JSON schema are not in the JSON object #2712

jpivarski commented Sep 15, 2023 •

edited

Loading

jpivarski commented Sep 15, 2023

codecov bot commented Sep 15, 2023 •

edited

Loading

jpivarski commented Sep 16, 2023

agoose77 commented Sep 16, 2023 •

edited

Loading

agoose77 commented Sep 16, 2023

douglasdavis commented Sep 18, 2023

douglasdavis commented Sep 18, 2023

jpivarski commented Sep 19, 2023

jpivarski commented Sep 19, 2023

jpivarski commented Sep 20, 2023

agoose77 left a comment

agoose77 Sep 25, 2023

jpivarski Sep 26, 2023

agoose77 Sep 25, 2023

jpivarski Sep 26, 2023

agoose77 Sep 26, 2023

jpivarski Sep 26, 2023

jpivarski commented Sep 26, 2023

jpivarski commented Sep 27, 2023

		chunkmask = (uint64_t)1 << (j & 0x3f);
		if ((record_checklist_[argument2()][j >> 6] & chunkmask) == 0) {

fix: catch cases in which fields required by a JSON schema are not in the JSON object #2712

fix: catch cases in which fields required by a JSON schema are not in the JSON object #2712

Conversation

jpivarski commented Sep 15, 2023 • edited Loading

Footnotes

jpivarski commented Sep 15, 2023

codecov bot commented Sep 15, 2023 • edited Loading

Codecov Report

jpivarski commented Sep 16, 2023

agoose77 commented Sep 16, 2023 • edited Loading

agoose77 commented Sep 16, 2023

douglasdavis commented Sep 18, 2023

douglasdavis commented Sep 18, 2023

jpivarski commented Sep 19, 2023

jpivarski commented Sep 19, 2023

jpivarski commented Sep 20, 2023

agoose77 left a comment

Choose a reason for hiding this comment

agoose77 Sep 25, 2023

Choose a reason for hiding this comment

jpivarski Sep 26, 2023

Choose a reason for hiding this comment

agoose77 Sep 25, 2023

Choose a reason for hiding this comment

jpivarski Sep 26, 2023

Choose a reason for hiding this comment

agoose77 Sep 26, 2023

Choose a reason for hiding this comment

jpivarski Sep 26, 2023

Choose a reason for hiding this comment

jpivarski commented Sep 26, 2023

jpivarski commented Sep 27, 2023

jpivarski commented Sep 15, 2023 •

edited

Loading

codecov bot commented Sep 15, 2023 •

edited

Loading

agoose77 commented Sep 16, 2023 •

edited

Loading