-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat!: printing a typetracer array should not touch it #3019
feat!: printing a typetracer array should not touch it #3019
Conversation
…ching on the Dask head node doesn't allow for errors on the Dask workers
Maybe tests/test_2714_from_buffers_placeholders.py to test all of the PlaceholderArrays? And tests/test_2660_expected_container_keys_from_form.py to test all of the layout types to ensure that |
@jpivarski computing the shape of a buffer requires reading iff. the buffer provider does not factorise the metadata from the array contents. This is true for parquet and partially true for ROOT TTree's; to compute shape |
The shapes that we're talking about are regular dimensions, and I don't know of any formats that fail to factorize out regular dimensions, except for those formats that lack metadata entirely.
I can't think of any situations in which needing to know the (fixed-length) shape of a buffer would require any non-metadata to be read. It occurred to me last night that I should just grep the awkward and dask-awkward codebases for " Footnotes
|
@jpivarski are you talking about Because (here I'm talking about nplike buffer lengths) any time a content is nested inside a list-type, the length of its metadata-buffers depends upon reading the parent list-offsets. We handle this process in the input layer e.g. dask-awkward / uproot. In simple terms, |
correct, parquet is always a column, and there's no regular ND array. The only fixed width type is byte-array, but I don't think anyone stores regular arrays in that.
You mean "item"? An OPTIONAL field can have a REPEATED child in the parquet spec, so you could say it's fully general, just parquet-mr decided that you always need both these levels.
For completeness, I don't recall jsonschema having a list-length specifier, but it could in theory have created. It's a horrible data describing grammar. |
The place where it happens is here: awkward/src/awkward/contents/numpyarray.py Line 194 in b7eb8b9
It's touching the TypeTracerArray's Because of this, asking for the Form of any array qualifies as touching its shape, but this is a bit of a technicality. It would be a little different if it were touching any length attributes, since lengths are not known at type-tracing time the way that a NumpyArray's I just realized why needing to know lengths requires some data-reading: the length of a Content node inside of a variable-length list is sometimes given by metadata (in Parquet) and sometimes not (in ROOT). That's a reason why length-touching matters, but this Maybe this part of the report is just badly named: what we sometimes care about is length-touching, but this is reporting shape-touching. |
That's what I was referring to (although it was a tangential point).
I had forgotten about JSONSchema! Actually, JSONSchema lets you set the minimum and maximum allowed lengths of lists at a given level, and these two numbers can be the same, which specifies a fixed-length list. Now that I think of it, |
Yes, we were careful to distinguish Additionally, although we might be able to compute the length of a deeply nested buffer from Parquet by a single read, we actually need to read the metadata buffers that |
Will circle back to this ASAP. |
@agoose77 Let me know if this PR makes sense. If not, I'll close it. |
@jpivarski thanks for the nudge. My read of this PR is that I note that things like I wonder whether we should have a better string value for placeholders than [
[1, 2, 3],
[4, 5],
[6]
] The known-data repr of [1, 2, 3] so the repr of [
[1, 2, 3],
[4, 5],
[6]
] If the list offsets are placeholders, then the repr of [
??,
??,
??
] I think we might to do something smarter here. What do you think? |
Would something better be
or
That is, acknowledging that there are lists (whose lengths we don't know), and only using "??" for numeric values? Maybe, but
The type is known, and it's something that users can look at. Going to the trouble of producing the right nesting, within the right number of characters and handling ellipses correctly (i.e. not getting |
@jpivarski you know what, let's just merge this as-is. We can iterate if someone isn't happy with it! That saves us all a headache. |
Yeah, we can consider it iterative. I think that the string Okay, I'll merge it as-is. |
I think the rationale for having
print(array)
touch all of the data inarray
(whether it's high-level or a layout) is so that the delayed operation is exactly the same as an eager operation. In Dask, a DAG-builder has to touch everything in aprint(array)
operation so that all of the data will be available to the Dask worker to actually print that string.However,
print
output.print
operation on deeply nested data is probably not sensitive to all of the deeply nested data—it's limited to a line width or 80 characters (configurable), so only a few parts of it are going to get printed. The Awkward 1 VirtualArray was careful to only materialize that which appears in the printed string, for a point of comparison.__str__
or__repr__
string to a variable and does some computations with it, and even if the relevant data are close enough to the beginning or end of the array to appear in that string, I don't think there is (or should be) a strong user expectation that this string will be exactly the same as a string you'd get from an eager array. The string representation of a Python object, intended for interactive feedback, should not be relied upon programmatically. For example, whenever anyone doesto get the values of a NumPy array, they are doing a Bad Thing™.
I think there's a stronger user expectation that observing the string representation of a Python object does not change that object, and that inclusion in the set of "touched" buffers is an important enough change to worry about.
This PR does two things:
__str__
and__repr__
) non-touching, and it only affects the string representation context (i.e. not changing every use ofarray_str
, just those in__str__
and__repr__
).There is now a difference between
__str__
and__repr__
on eager arrays versus in delayed operations: on eager arrays, the string representations contain values, and in delayed operations, they contain values if some other operation touched that data, and??
placeholders otherwise.To demonstrate that high-level and layout string representations no longer touch:
and
They touch the shape but not the data. I'm 90% sure that touching the shape does not invoke data-reading, only metadata, right @agoose77 and @martindurant?
Now to show that the Dask worker will not explode when trying to work with the data that now has PlaceholderArrays anywhere that hasn't been touched by some operation other than the printing:
I have not added tests, and therefore, this PR is not ready to be merged!
@agoose77, which test files should I fill with tests of
str(array)
for arrays containing TypeTracerArrays and PlaceholderArrays to be sure that the TypeTracerArrays don't report as data-touching and that the PlaceholderArrays don't raise TypeErrors? I want to efficiently cover all of the cases. (I could make a new suite of examples to test, but I think there are probably examples in some test file already.)