Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TEST][FLAKY] test_detection_models #7363

Closed
tqchen opened this issue Jan 28, 2021 · 19 comments · Fixed by #7380
Closed

[TEST][FLAKY] test_detection_models #7363

tqchen opened this issue Jan 28, 2021 · 19 comments · Fixed by #7380
Assignees

Comments

@tqchen
Copy link
Member

tqchen commented Jan 28, 2021

Seems to be quite frequent in recent PRs, might be related to #7346

https://ci.tlcpack.ai/job/tvm/job/main/495/execution/node/449/log/

https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/PR-7360/1/pipeline(this one is timeout)

@tqchen
Copy link
Member Author

tqchen commented Jan 28, 2021

cc @zhiics @masahi

@masahi masahi self-assigned this Jan 28, 2021
@masahi
Copy link
Member

masahi commented Jan 28, 2021

So is this a flaky segfault? Not sure how I can reproduce this. Are there common characteristics in nodes that failed (What GPU, which CUDA version etc)? If one node fails, does it fail consistently?

@tqchen
Copy link
Member Author

tqchen commented Jan 28, 2021

based on my current read it could also be a timeout. In which case we need to look into if the detection model itself runs too long and if we could build a faster unittest

@masahi
Copy link
Member

masahi commented Jan 28, 2021

It shouldn't take more than a few minutes. It should run much faster than TF SSD test, which takes like 20 min (done in a separate thread

def test_forward_ssd():
run_thread = threading.Thread(target=_test_ssd_impl, args=())
old_stack_size = threading.stack_size(100 * 1024 * 1024)
run_thread.start()
run_thread.join()
threading.stack_size(old_stack_size)
).

Disabled one of rewrites in #7365. The other rewrite added in #7346 should be harmless. We'll see.

@zhiics
Copy link
Member

zhiics commented Jan 29, 2021

Would this also affect the production test cases although I haven't really seen it.

@zhiics
Copy link
Member

zhiics commented Jan 29, 2021

Close by @7365

@zhiics zhiics closed this as completed Jan 29, 2021
@masahi
Copy link
Member

masahi commented Jan 29, 2021

I ran MaskRCNN with that rewrite countless times, I didn't see any problem. Weird.

@altanh
Copy link
Contributor

altanh commented Jan 29, 2021

Before we move too far on from this, can someone compare with our experience in fixing this bug? #7010

If there is some hard to find libomp conflict popping up again, it will crash with "Aborted" like seen the first test

@masahi
Copy link
Member

masahi commented Jan 29, 2021

hmm, I see what's common are the use of pytorch, and the way CI dies. But other than that I don't know if two flaky issues are related.

@tqchen
Copy link
Member Author

tqchen commented Jan 29, 2021

@tqchen tqchen reopened this Jan 29, 2021
@masahi
Copy link
Member

masahi commented Jan 29, 2021

Looking at https://ci.tlcpack.ai/job/tvm/job/main/, after #7346 was merged, things had been stable for some time (https://ci.tlcpack.ai/job/tvm/job/main/487/ to https://ci.tlcpack.ai/job/tvm/job/main/491/). The crash began to happen after #7354, which added const folding on If condition, is merged https://ci.tlcpack.ai/job/tvm/job/main/492/.

The other rewrite in #7346 does involve If. Maybe there is something wrong with If and the rewrite.

@masahi
Copy link
Member

masahi commented Jan 29, 2021

One thing I'm curious, the crash at CI https://ci.tlcpack.ai/job/tvm/job/main/ is very frequent, while CIs for open PRs do not get crash (except https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/PR-7360/1/pipeline/ which was timeout) as far as I see it. I wonder why.

@altanh
Copy link
Contributor

altanh commented Jan 29, 2021

In the mean time, is there any way to disable pytest capturing while we debug the CI runs? #7017

this might give more information and confirm/rule out the cause being libomp

@zhiics
Copy link
Member

zhiics commented Jan 30, 2021

not sure if I understand the question. Do you want to just run the specific test in CI? If so, we can checkout the docker image and run pytest -v tests/python/frontend/pytorch/test_object_detection.py

@altanh
Copy link
Contributor

altanh commented Jan 30, 2021

Essentially yes, I think checking out the CI image and running pytest -s tests/python/frontend/pytorch will hopefully give more information about what is causing the abort (as long as we can reproduce it).

@masahi
Copy link
Member

masahi commented Jan 31, 2021

@tqchen @zhiics I investigated this issue and I think I have a solution. Running test_object_detection.py indeed results in a segfault at the end, after correct execution. I didn't notice this segfault before because I was running a different script when I was working on optimizing MaskRCNN.

Now, I wondered why running one script results in the segfault, while the other doesn't, when two scripts do essentially the same thing. And then I remembered the ONNX + PyTorch segfault problem, which was caused by pytorch being imported earlier than ONNX, see onnx/onnx#2394 (comment)

test_object_detection.py indeed imports pytorch before TVM, while my other script imports TVM first. So I simply swap the import order in test_object_detection.py , and the segfault is gone. If I import torch first in my other scripts, segfault now happens, as expected. I can 100% reproduce this issue of segfault and import order.

Conclusion: When using PyTorch 1.7 + CUDA, always import TVM first, before torch. PR #7380

@masahi
Copy link
Member

masahi commented Jan 31, 2021

hmm still failing. But now it clearly says all tests have passed, but then it died. https://ci.tlcpack.ai/job/tvm/job/main/506/console. So it is improvement over the previous situation.

I wonder if this is a driver problem on the post-merge CI node, since I don't see this error happening on CIs for the open PRs.

@masahi
Copy link
Member

masahi commented Jan 31, 2021

I reopened #7371 for the final attempt. It disables the other rewrite done in #7346 and the cuda test.

I don't think rewrites are the problems, but disabling cuda tests should remove the segfault.

@masahi
Copy link
Member

masahi commented Feb 8, 2021

This can be closed now https://ci.tlcpack.ai/job/tvm/job/main/

@masahi masahi closed this as completed Feb 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants