[TEST][FLAKY] test_detection_models #7363

tqchen · 2021-01-28T21:09:02Z

Seems to be quite frequent in recent PRs, might be related to #7346

https://ci.tlcpack.ai/job/tvm/job/main/495/execution/node/449/log/

https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/PR-7360/1/pipeline(this one is timeout)

tqchen · 2021-01-28T21:09:09Z

masahi · 2021-01-28T21:54:03Z

So is this a flaky segfault? Not sure how I can reproduce this. Are there common characteristics in nodes that failed (What GPU, which CUDA version etc)? If one node fails, does it fail consistently?

tqchen · 2021-01-28T22:03:11Z

based on my current read it could also be a timeout. In which case we need to look into if the detection model itself runs too long and if we could build a faster unittest

masahi · 2021-01-28T22:05:30Z

It shouldn't take more than a few minutes. It should run much faster than TF SSD test, which takes like 20 min (done in a separate thread

tvm/tests/python/frontend/tensorflow/test_forward.py

Lines 3090 to 3095 in dda8f5d

    
           def test_forward_ssd(): 
        
               run_thread = threading.Thread(target=_test_ssd_impl, args=()) 
        
               old_stack_size = threading.stack_size(100 * 1024 * 1024) 
        
               run_thread.start() 
        
               run_thread.join() 
        
               threading.stack_size(old_stack_size)

).

Disabled one of rewrites in #7365. The other rewrite added in #7346 should be harmless. We'll see.

zhiics · 2021-01-29T07:39:21Z

Would this also affect the production test cases although I haven't really seen it.

zhiics · 2021-01-29T07:40:20Z

Close by @7365

masahi · 2021-01-29T08:02:47Z

I ran MaskRCNN with that rewrite countless times, I didn't see any problem. Weird.

altanh · 2021-01-29T08:11:12Z

Before we move too far on from this, can someone compare with our experience in fixing this bug? #7010

If there is some hard to find libomp conflict popping up again, it will crash with "Aborted" like seen the first test

masahi · 2021-01-29T08:55:25Z

hmm, I see what's common are the use of pytorch, and the way CI dies. But other than that I don't know if two flaky issues are related.

tqchen · 2021-01-29T14:13:11Z

https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/main/503/pipeline seems #7365 didn't solve the problem

masahi · 2021-01-29T14:45:36Z

Looking at https://ci.tlcpack.ai/job/tvm/job/main/, after #7346 was merged, things had been stable for some time (https://ci.tlcpack.ai/job/tvm/job/main/487/ to https://ci.tlcpack.ai/job/tvm/job/main/491/). The crash began to happen after #7354, which added const folding on If condition, is merged https://ci.tlcpack.ai/job/tvm/job/main/492/.

The other rewrite in #7346 does involve If. Maybe there is something wrong with If and the rewrite.

masahi · 2021-01-29T14:49:05Z

One thing I'm curious, the crash at CI https://ci.tlcpack.ai/job/tvm/job/main/ is very frequent, while CIs for open PRs do not get crash (except https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/PR-7360/1/pipeline/ which was timeout) as far as I see it. I wonder why.

altanh · 2021-01-29T20:24:30Z

In the mean time, is there any way to disable pytest capturing while we debug the CI runs? #7017

this might give more information and confirm/rule out the cause being libomp

zhiics · 2021-01-30T00:35:00Z

not sure if I understand the question. Do you want to just run the specific test in CI? If so, we can checkout the docker image and run pytest -v tests/python/frontend/pytorch/test_object_detection.py

altanh · 2021-01-30T00:57:21Z

Essentially yes, I think checking out the CI image and running pytest -s tests/python/frontend/pytorch will hopefully give more information about what is causing the abort (as long as we can reproduce it).

masahi · 2021-01-31T09:32:23Z

@tqchen @zhiics I investigated this issue and I think I have a solution. Running test_object_detection.py indeed results in a segfault at the end, after correct execution. I didn't notice this segfault before because I was running a different script when I was working on optimizing MaskRCNN.

Now, I wondered why running one script results in the segfault, while the other doesn't, when two scripts do essentially the same thing. And then I remembered the ONNX + PyTorch segfault problem, which was caused by pytorch being imported earlier than ONNX, see onnx/onnx#2394 (comment)

test_object_detection.py indeed imports pytorch before TVM, while my other script imports TVM first. So I simply swap the import order in test_object_detection.py , and the segfault is gone. If I import torch first in my other scripts, segfault now happens, as expected. I can 100% reproduce this issue of segfault and import order.

Conclusion: When using PyTorch 1.7 + CUDA, always import TVM first, before torch. PR #7380

masahi · 2021-01-31T19:17:07Z

hmm still failing. But now it clearly says all tests have passed, but then it died. https://ci.tlcpack.ai/job/tvm/job/main/506/console. So it is improvement over the previous situation.

I wonder if this is a driver problem on the post-merge CI node, since I don't see this error happening on CIs for the open PRs.

masahi · 2021-01-31T19:28:00Z

I reopened #7371 for the final attempt. It disables the other rewrite done in #7346 and the cuda test.

I don't think rewrites are the problems, but disabling cuda tests should remove the segfault.

masahi · 2021-02-08T05:11:22Z

This can be closed now https://ci.tlcpack.ai/job/tvm/job/main/

tqchen added the test: flaky label Jan 28, 2021

masahi self-assigned this Jan 28, 2021

masahi mentioned this issue Jan 28, 2021

[TEST] Disable one of graph rewrites in torch detection test #7365

Merged

zhiics closed this as completed Jan 29, 2021

tqchen reopened this Jan 29, 2021

masahi mentioned this issue Jan 29, 2021

[TEST] Another attempt to fix flaky segfaults from torch detection test #7371

Merged

masahi mentioned this issue Jan 31, 2021

[TEST] Swap pytorch and tvm import order to fix flaky segfaults #7380

Merged

tqchen closed this as completed in #7380 Jan 31, 2021

masahi reopened this Jan 31, 2021

masahi closed this as completed Feb 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TEST][FLAKY] test_detection_models #7363

[TEST][FLAKY] test_detection_models #7363

tqchen commented Jan 28, 2021

tqchen commented Jan 28, 2021

masahi commented Jan 28, 2021 •

edited

Loading

tqchen commented Jan 28, 2021

masahi commented Jan 28, 2021 •

edited

Loading

zhiics commented Jan 29, 2021

zhiics commented Jan 29, 2021

masahi commented Jan 29, 2021

altanh commented Jan 29, 2021

masahi commented Jan 29, 2021

tqchen commented Jan 29, 2021 •

edited

Loading

masahi commented Jan 29, 2021 •

edited

Loading

masahi commented Jan 29, 2021

altanh commented Jan 29, 2021

zhiics commented Jan 30, 2021

altanh commented Jan 30, 2021

masahi commented Jan 31, 2021 •

edited

Loading

masahi commented Jan 31, 2021 •

edited

Loading

masahi commented Jan 31, 2021 •

edited

Loading

masahi commented Feb 8, 2021

[TEST][FLAKY] test_detection_models #7363

[TEST][FLAKY] test_detection_models #7363

Comments

tqchen commented Jan 28, 2021

tqchen commented Jan 28, 2021

masahi commented Jan 28, 2021 • edited Loading

tqchen commented Jan 28, 2021

masahi commented Jan 28, 2021 • edited Loading

zhiics commented Jan 29, 2021

zhiics commented Jan 29, 2021

masahi commented Jan 29, 2021

altanh commented Jan 29, 2021

masahi commented Jan 29, 2021

tqchen commented Jan 29, 2021 • edited Loading

masahi commented Jan 29, 2021 • edited Loading

masahi commented Jan 29, 2021

altanh commented Jan 29, 2021

zhiics commented Jan 30, 2021

altanh commented Jan 30, 2021

masahi commented Jan 31, 2021 • edited Loading

masahi commented Jan 31, 2021 • edited Loading

masahi commented Jan 31, 2021 • edited Loading

masahi commented Feb 8, 2021

masahi commented Jan 28, 2021 •

edited

Loading

masahi commented Jan 28, 2021 •

edited

Loading

tqchen commented Jan 29, 2021 •

edited

Loading

masahi commented Jan 29, 2021 •

edited

Loading

masahi commented Jan 31, 2021 •

edited

Loading

masahi commented Jan 31, 2021 •

edited

Loading

masahi commented Jan 31, 2021 •

edited

Loading