-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: catch cases in which fields required by a JSON schema are not in the JSON object #2712
fix: catch cases in which fields required by a JSON schema are not in the JSON object #2712
Conversation
…ich will be used in the checklist-handling.
@douglasdavis, what do you think of that second checkbox? If the schema says that option-typed fields Does the JSONSchema specification have anything to say about the distinction between fields whose values are |
The one error on MacOS Python 3.11 is in pyarrow: tests/test_2616_use_pyarrow_for_strings.py:930 We'll see if it's consistent. |
@jpivarski JSONschema has |
It's a flaky error. I've a mind to disable the test for this as we see it in other PRs, and we can't easily fix it iirc. |
My first thought is an error makes sense here; if the JSONSchema has the field (and listed with
There is the |
I just realized Angus had already pointed out "required" before my comment above :) |
Don't let this one hold up the next release. I'll get it done soon, though. |
All this needs now is a big round of tests. The JSON example that motivated this (which has missing keys where option-type is expected) is now parsable: >>> import awkward as ak
>>> schema = {
... "title": "untitled",
... "description": "Auto generated by dask-awkward",
... "type": "object",
... "properties": {
... "payload": {
... "type": "object",
... "properties": {
... "pull_request": {
... "type": ["object", "null"],
... "properties": {"merged_at": {"type": ["string", "null"]}},
... }
... },
... }
... },
... }
>>> with open("tests/samples/complex-nested.json", "rb") as file:
... array = ak.from_json(file, line_delimited=True, schema=schema)
...
>>> array.show(type=True)
type: 10 * {
payload: {
pull_request: ?{
merged_at: ?string
}
}
}
[{payload: {pull_request: None}},
{payload: {pull_request: {merged_at: None}}},
{payload: {pull_request: None}},
{payload: {pull_request: None}},
{payload: {pull_request: None}},
{payload: {pull_request: None}},
{payload: {pull_request: None}},
{payload: {pull_request: None}},
{payload: {pull_request: None}},
{payload: {pull_request: None}}] (tests/samples/complex-nested.json is the first 10 lines of https://data.gharchive.org/2015-01-01-10.json.gz) By the way, you can field-slice through missing data: >>> array["payload", "pull_request", "merged_at"]
<Array [None, None, None, None, ..., None, None, None] type='10 * ?string'> We'd do better with a less empty dataset... >>> array = ak.from_iter([
... {"payload": None},
... {"payload": {"pull_request": None}},
... {"payload": {"pull_request": {"merged_at": None}}},
... {"payload": {"pull_request": {"merged_at": "1970-01-01"}}},
... ])
>>> array["payload", "pull_request", "merged_at"]
<Array [None, None, None, '1970-01-01'] type='4 * ?string'> |
…ttps://github.com/scikit-hep/awkward into jpivarski/fix-json-from-schema-with-missing-field
This is done, though in the future, we may want to add string → date parsing in the C++ layer (avoid making lots of big date strings). It adds a bitmap called "checklist" for all of the fields in every record (every record has its own checklist), and
So this is how it stands: all of the features have been added, though some cases are favored over others for optimization. Having all keys from the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
chunkmask = (uint64_t)1 << (j & 0x3f); | ||
if ((record_checklist_[argument2()][j >> 6] & chunkmask) == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be key_already_filled(argument2(), j)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wouldn't get to reuse the chunkmask
. (Not a big deal, but this is a hot loop in an inline function.)
stringi = instructions_.data()[i * 4 + 1]; | ||
start = offsets[stringi]; | ||
stop = offsets[stringi + 1]; | ||
if (strncmp(str, &chars[start], (size_t)(stop - start)) == 0) { | ||
return instructions_.data()[i * 4 + 2]; | ||
// ensure that the checklist bit is 1 | ||
chunkmask = (uint64_t)1 << (j & 0x3f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the intention of the mask with 0x3f
? It will ensure that j & 0x3f
is always (0 <= j < 64), but one could also impose argument2() < 64
. I don't know enough about this logic beyond asking this question :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's because you could have more than 64 fields in a RecordArray, and & 0x3f
is % 64
. I'd have to remind myself of the logic, but this is part of dealing with batches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, my question was more concerning what should happen when you get j > 64
, i.e. why wrap around? In any case, if it passes the tests, that's good enough for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ummm. (It's been a while since I wrote it, and I can't easily look at the code now.) There was a reason to have both a division by 64 and a modulo by 64; the bit-shift is the division and the bit-mask is the modulo.
The tests do test this, since they include records with more than 64 fields.
I set this to auto-merge. I just learned from @henryiii that there's an advantage to releasing a new awkward-cpp before the Python 3.12 release in a week and a half, and this PR is a good excuse to release a new awkward-cpp in the next few days. |
I was walking and I remembered the reason for the division and modulo by 64. Since records can have more than 64 fields, we can't have one If we never had more than 64 fields, we could use a single |
To do this, every object must now keep a table of which of their fields have been filled and ensure that the table is completely filled at the end of an object. This adds more cross-checking, so let's try to not slow it down too much.
This PR adds the following.
record_checklist_init_
is set to 0 everywhere other than positions for each field. (So if a record hasStartObject
step (encountered a{
),record_checklist_init_
is copied intorecord_checklist_
.find_key
, the bit corresponding to that field inrecord_checklist_
is set to 0.EndObject
step (encountered a}
), we|
all of the bits inrecord_checklist_
and see if it's different from 0.We use a bitmap because setting up the table and checking the table can be done 64 fields at a time, using the vectorization built into ordinary 64-bit registers. The "copying
record_checklist_init_
intorecord_checklist_
" and "checking to see if it's different from zero" steps each only involves a singleuint64_t
if there are fewer than 64 fields.1There's some indirection because we have
std::vector<std::vector<uint64_t>>
but these vectors remain a single allocation in the performance-critical code; they should stay in CPU cache.Doing this required me to introduce "record identifiers" (to find the right element of the outer
std::vector
). Since that was the blocker for an unrelated optimization, optimistically guessing that the fields in data will have the same order as the required fields in the schema, I put that in as a test of the machinery and left it because it's an optimization.Tasks remaining for this PR:
find_key
, make sure that its bit is not already set, to raise an error (by returning-1
) whenever a required field is set more than once.FillIndexedOptionArray
andFillByteMaskedArray
) and raise an error (by returningfalse
) on any missing non-option-type fields. This case is allowed to be slow. The implementation could be complex and may need to be put off for another PR. It's also a policy choice: if a field is required but not present, should we really consider that okay because it's option-type?Footnotes
No comment on getting 512 fields in one go by invoking AVX-512... ↩