Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avro reader crash #3188

Open
martindurant opened this issue Jul 26, 2024 · 17 comments
Open

avro reader crash #3188

martindurant opened this issue Jul 26, 2024 · 17 comments
Labels
bug (unverified) The problem described would be a bug, but needs to be triaged

Comments

@martindurant
Copy link
Contributor

martindurant commented Jul 26, 2024

Version of Awkward Array

ak 2.6.5
ak-cpp 34

Description and code to reproduce

I tried to read the avro file at this location https://github.com/Teradata/kylo/blob/master/samples/sample-data/avro/userdata5.avro (downloaded to local) using ak.from_avro_file and got segmentation fault. The file loads fine with fastavro.

macOS 13.6.7, conda install of python 3.10

@martindurant martindurant added the bug (unverified) The problem described would be a bug, but needs to be triaged label Jul 26, 2024
@jpivarski
Copy link
Member

Although there's not much demand (that I know of) for the Avro reader, it's a canary in the coalmine for AwkwardForth and therefore Uproot. If the Avro is not handled correctly, it can fail with a Forth runtime error, but it should not be segfaulting under any circumstances, so this is serious.

I'll try to reproduce it sometime when I have access to my main computer again and see if I can narrow in on the point of failure.

@martindurant
Copy link
Contributor Author

In addition, from_avro_file fails if passed a file-like rather than a str filename. On this line, the attribute should be .outcontents (as in the branch above, L50) not .outarr.

@jpivarski
Copy link
Member

Has PR #3167 solved this issue?

@martindurant
Copy link
Contributor Author

I didn't see that PR. I can test.

@martindurant
Copy link
Contributor Author

martindurant commented Aug 1, 2024

Updating awkward to 55a9a43 (current main) still crashes. Is there a chance of any cached state from a previous run?
awkward-cpp == 36

@jpivarski
Copy link
Member

jpivarski commented Aug 1, 2024

Is there a chance of any cached state from a previous run?

I'm not sure. GitHub Actions recompiles awkward-cpp in each test, so we shouldn't cache from previous runs. (@agoose77, did the test ever have a cache like that?)

Meanwhile, is this an awkward-cpp-only reproducer?

python -c 'from awkward_cpp.lib._ext import ForthMachine32 ; vm = ForthMachine32("1 2 3 4 5"); vm.run(); print(vm.stack)'

I'm thinking that we should bisect awkward-cpp itself, removing parts of it until there isn't a segfault and then see which part makes the difference. Awkward relies on many parts of awkward-cpp, but awkward-cpp itself is mostly a bag of functions and classes; it should be easy to remove large parts of it and still be able to rely on the above test.

@martindurant
Copy link
Contributor Author

That line runs fine. The avro reader can read other files (I have a workign test in akimbo for avro), in case that wasn't clear - it only segfaults with the example at the top, a file which reads fine with fastavro.

@martindurant
Copy link
Contributor Author

but I just did

In [1]: from awkward_cpp.lib._ext import ForthMachine32

In [2]: machine = ForthMachine32("1 2 3 4 5")

In [3]: machine = ForthMachine32("1 2 3 4 5")

In [4]: machine.<tab>
segmentation fault

(only happens if you make the machine twice)

@jpivarski
Copy link
Member

The branch jpivarski/test-minimal-awkward-cpp (in do-not-merge PR #3196) removes everything from awkward-cpp except for AwkwardForth. The first commit (0c860e8) removes files and the second commit (2680761) reverts to the modern (non-classic) AppleClang loader.

This should be a sufficient test:

nox -s prepare
pip install -vv ./awkward-cpp/
python -c 'from awkward_cpp.lib._ext import ForthMachine32 ; vm = ForthMachine32("1 2 3 4 5"); vm.run(); print(vm.stack); vm = ForthMachine32("1 2 3 4 5"); vm.run(); print(vm.stack)'

I think the GitHub Actions (which will never pass!) provide artifacts.

This works on my M2 Mac (same for awkward-cpp without removing anything). I have Apple clang version 15.0.0 (clang-1500.3.9.4), so in principle, it shouldn't work. If I understand things, the above doesn't work for you: it segfaults on the second ForthMachine.

@martindurant
Copy link
Contributor Author

I got segfault only if I made two machines without running the first one

@jpivarski
Copy link
Member

Okay, so the test that shows the segfault for you is

python -c 'from awkward_cpp.lib._ext import ForthMachine32 ; vm = ForthMachine32("1 2 3 4 5"); vm = ForthMachine32("1 2 3 4 5"); vm.run(); print(vm.stack)'

Interestingly, it still doesn't segfault for me. But anyway, we have a much smaller surface area to find the bug. @ianna, are you able to reproduce it? (@henryiii said that it's independent of Intel vs Apple Silicon.)

@martindurant
Copy link
Contributor Author

martindurant commented Aug 1, 2024

No, that doesn't fail either; I got the crash only with <tab> as in my example code (in ipython - not sure exactly what this calls, but it's not dir(), which is fine). I can't find a simple line like yours (yet) which fails, except ak.from_avro_file :| .

@ariostas
Copy link
Collaborator

ariostas commented Aug 8, 2024

I was curious about this, so I decided to look into it a bit. I'm not sure if this was already known, but I'll document what I found here.

I was only able to reproduce the segfault with ak.from_avro_file, but not with the minimal version Jim made. The segfault is caused due to passing a negative number of elements to ForthOutputBufferOf<unsigned char>::write_uint8. Changing this line

to something like

num_items = stack_pop();
if (num_items < 0) num_items = 0;

makes it so that at least it crashes gracefully. As for why negative numbers end up being popped from the stack and interpreted as a length, I'm not sure. I'm not familiar with this code, so I couldn't figure it out. Also, I'm not sure if this is related to the segfault seen from the minimal version.

If someone wants to give me a quick rundown of how the code works, I can look into this further.

@jpivarski
Copy link
Member

@ariostas, you found a language-level bug in AwkwardForth (as opposed to a bug in the Avro-reader, which wouldn't worry me as much).

The reason a value popped from the stack is being interpreted as a number of items to read is because this is the implementation of vectorized read words like #i-> (to read $N$ integers from the input source, where $N$ is a number pulled from the Forth stack) and #d-> (to do the same for $N$ doubles). Not checking to see if the value from the stack is negative is a bug: AwkwardForth is not supposed to have any undefined behavior, only valid states and error states.

Most uses of num_items are in for loops like

for (int64_t count = 0; count < num_items; count++) {

which shouldn't pose a problem: if the num_items is negative, the loop is skipped just as if num_items were zero. One exception is the unusual length integer words, like #2bit->, #3bit->, #4bit->, etc. They set

int64_t items_remaining = num_items;

and count-down items_remaining until it reaches zero:

if (items_remaining != 0) {

and

while (items_remaining != 0) {

These items_remaining != 0 should be changed to items_remaining > 0 to avoid an infinite loop.

However, the Avro reader couldn't be hitting this problem because Avro doesn't have any of these unusual (but fixed) length integers. (Avro has variable length integers, which are implemented with count < num_items.)

The one case that might cause a segfault is in the write-to-stack macros:

TYPE* ptr = reinterpret_cast<TYPE*>( \
current_inputs_[(IndexTypeOf<int64_t>)in_num].get()->read( \
num_items * (int64_t)sizeof(TYPE), current_error_)); \

TYPE* ptr = reinterpret_cast<TYPE*>( \
current_inputs_[(IndexTypeOf<int64_t>)in_num].get()->read( \
num_items * (int64_t)sizeof(TYPE), current_error_)); \

TYPE* ptr = reinterpret_cast<TYPE*>( \
current_inputs_[(IndexTypeOf<int64_t>)in_num].get()->read( \
num_items * (int64_t)sizeof(TYPE), current_error_)); \

A negative value could break that, and it might happen in the Avro reader when a contiguous array of floating-point or boolean values are read. (The other cases are variable-length integers.)

Your fix of ensuring that num_items is nonnegative would fix that, but so would

  • changing items_remaining != 0 to items_remaining > 0 and
  • putting the if (num_items < 0) num_items = 0; inside the else { of

(I'm only saying that because this is highly streamlined code, and doing the check later would slow down fewer cases. I doubt the jump-compare-zero and assignment is much of a speedbump, though.)

So this is great! With this fix, I'd like to know if @martindurant still sees any segfault. In #3188 (comment), he found one with tab-complete, which can't be related to this, but I wonder if that's still reproducible. (It's such an odd case, since dir doesn't do it, but tab-complete does. Maybe there was residual left in memory...?)

@ariostas
Copy link
Collaborator

ariostas commented Aug 9, 2024

Thank you @jpivarski, that sounds great.

Your fix of ensuring that num_items is nonnegative would fix that, but so would

  • changing items_remaining != 0 to items_remaining > 0 and
  • putting the if (num_items < 0) num_items = 0; inside the else { of

You also need the check inside this else if, which is where it segfaults with this file (with the WRITE_DIRECTLY macro).

else if (~bytecode & READ_DIRECT) {

As for speed, since it's a one-liner I would expect the compiler to figure out the best option, but maybe it can be faster to force it to be branchless with

num_items *= (num_items >= 0)

@jpivarski
Copy link
Member

Do you want to write the PR or me?

@ariostas
Copy link
Collaborator

ariostas commented Aug 9, 2024

I can do it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug (unverified) The problem described would be a bug, but needs to be triaged
Projects
None yet
Development

No branches or pull requests

3 participants