[Torch] Restore class-aware NMS for detection models by graph rewrite #7154

masahi · 2020-12-22T23:37:32Z

NMS used by PyTorch detection model actually performs multiclass NMS in one go, by adding different offsets to boxes from different classes so that two boxes from different classes never overlap. See

https://github.com/pytorch/vision/blob/3d60f498e71ba63b428edb184c9ac38fa3737fa6/torchvision/ops/boxes.py#L80-L89

But this means most of O(N**2) IOU tests we do in the NMS triangle loop are useless. The goal of this PR is to restore class indices which is one of the inputs to batched_nms function above and perform class-aware NMS for TVM-compiled detection models.

I did this by pattern matching and rewriting after model import. Specifically, I pattern match against this subgraph corresponding to PyTorch batched_nms used by maskrcnn / faster rcnn.

Unfortunately, this optimization didn't yield speedup I hoped: On GPU it only makes 70ms faster, and on CPU it actually makes it slightly slower (?) for some reason. I haven't looked into why it is not going much faster.

nvprof output from running MaskRCNN on GPU

Before

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name                                                                                            
 GPU activities:   56.57%  711.34ms         2  355.67ms  15.296ms  696.05ms  fused_vision_non_max_suppression_kernel2                                                        
                   17.38%  218.54ms         1  218.54ms  218.54ms  218.54ms  fused_nn_dense_add_nn_relu_kernel0

After

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name                                                                                            
 GPU activities:   54.52%  645.28ms         2  322.64ms  15.498ms  629.78ms  fused_vision_non_max_suppression_kernel2                                                        
                   18.44%  218.27ms         1  218.27ms  218.27ms  218.27ms  fused_nn_dense_add_nn_relu_kernel0

On CPU, the output from VM profiler:

Before

#OpName                         #InvokeCount    #Duration(us): Sum/Mean/Min/Max
...
fused_vision_non_max_suppression        2               7902.54/3951.27/339.03/7563.51

After

#OpName                         #InvokeCount    #Duration(us): Sum/Mean/Min/Max
fused_vision_non_max_suppression        2               8129.3/4064.65/304.878/7824.42

So performance wise this change doesn't matter much, but I hope this also serves as a non trivial use of pattern matching and rewrite.

cc @kevinthesun @mbrookhart @zhiics @t-vi What do you think?

tests/python/frontend/pytorch/test_object_detection.py

zhiics · 2020-12-23T00:39:54Z

@masahi Thanks for the perf improvement. Could you provide the CPU numbers as well?

masahi · 2020-12-23T01:01:03Z

@zhiics Sure updated the description. Unfortunately I cannot claim that this is perf improvement. The regression is only 200 us on CPU, so it may be just a measurement noise, though.

I have no idea why I'm not getting good speed up. IOU tests, including memory access to boxes should be definitely reduced. The only additional overhead I think of is that the input to NMS is one column wider, due to storing class ids.

Performance is not great, but I believe having access to class ids should not be a bad idea...

zhiics · 2020-12-23T01:03:14Z

@masahi I think this is an plausible as well particularly it is only in the parser. @kevinthesun please help take a look as well. Thanks.

masahi · 2020-12-23T01:07:10Z

I should mention that this rewrite is not run by default, so there is no perf risk.

python/tvm/relay/frontend/pytorch_utils.py

mbrookhart · 2020-12-23T16:33:43Z

This is a bit of a shot in the dark.

I wonder if we're memory access limited, and so that's why you don't see a performance improvement.

When we do the nested loop, we always have to check if the id of instance k matches the id of instance j. Since the input shape is (batch_size, num_anchors, features), and features = 6 here, I wouldn't be surprised if checking the instance of k ends up reading all of the features of k into registers, and that memory read is the expensive operation. Once it's in memory, actually doing the iou calculation is relatively cheap, so skipping it doesn't help that much.

masahi · 2020-12-23T23:50:27Z

When we do the nested loop, we always have to check if the id of instance k matches the id of instance j. Since the input shape is (batch_size, num_anchors, features), and features = 6 here, I wouldn't be surprised if checking the instance of k ends up reading all of the features of k into registers, and that memory read is the expensive operation. Once it's in memory, actually doing the iou calculation is relatively cheap, so skipping it doesn't help that much.

That's highly possible. Looking at this if condition in the triangle inner loop:

tvm/python/tvm/topi/cuda/nms.py

Lines 535 to 540 in 9956b5b

    
           tvm.tir.any( 
        
               force_suppress > 0, 
        
               id_index < 0, 
        
               out[base_idx + offset_k + id_index] 
        
               == out[base_idx + offset_j + id_index], 
        
           ),

previously, force_suppress is always True, so this condition short circuit and access to out[base_idx + offset_k + id_index] and out[base_idx + offset_j + id_index] just below never happen. But now, to make NMS class aware, I had to change force_suppress to False, and now access to out[base_idx + offset_k + id_index] and out[base_idx + offset_j + id_index] always happen. This may be cancelling the speedup from reduced IOU tests. Storing the class IDs in a different 1D tensor may help.

That brings me to one of my pain points with our NMS API: I belieave our NMS API needs to be reworked. The current way of packing class ids and scores together with bbox coordinates is a design mistake that we inherited from MXNet. To store class ids, I have to cast ids to float32, update and pass id_index appropriately. Since our NMS API also requires scores to be packed with bbox, I had to update score_index too and all frontends except MXNet needs to do this concatenation. The worst part is that in NMS IR, the very first thing we do is the extraction 1D score tensor from packed data. So I see no good reason to pack score tensor and bbox coordinates.

mbrookhart · 2020-12-24T16:52:54Z

Sorry for the delay in responding to this, I wanted to look at the frameworks more closely. We currently have 5 importers that leverage NMS:

MXNET does multibox_transform_loc and then NMS on the outputs. multi_box_transform_loc converts a 3D array of scores with shape (batch_size, class_num, num_anchors) into a most likely class and score for that class, plus does some coordinate transforms on the box.

ONNX takes a 3D tensor of (batch_size, class, num_anchors), does slicing/concatenating with the boxes, and then does a per-class get_valid_counts->non_max.

Pytorch takes in a 1D tensor of scores and concats it with the boxes before performing get_valid_counts and nms. As @masahi shows in this PR, there is preprocessing to embed all classes into that 1D tensor outside of the op.

TF takes a 1D tensor of scores and concats it to the boxes before performing get_valid_counts and nms. I'm not sure if the rest of the TF graph is handling the loop over batch size and classes.

TFlite takes a 3D score tensor of shape (batch size, num_anchors, class_id), reorders it to (batch_size, class_id, num_anchors), performs multibox_transform_loc->nms, and strangely does get_valid_counts after NMS.

It looks like we're doing pre-processing in every framework to reduce the amount of score information and convert it to the 5 or 6 D form the nms API wants. None of the frameworks give us inputs in the packed form the API expects, and we jump through hoops in every importer to convert inputs into that form. Then in at least TFLite and ONNX, we perform further splitting/slicing/concatenating to restore the separate class ids.

I think I agree with @masahi, we seem to be jumping through a lot of hoops in the importers to support a TVM NMS API that's out of line with the frameworks, and that might be hurting our overall performance.

trevor-m · 2021-01-12T01:19:21Z

I highly agree with you guys.

For class-aware NMS, the [batch, num_anchors, 6] format seems very inefficient. It means all anchors need to be checked just to see if the classes match. A [batch, num_classes, num_anchors, 5] format would give us a nicely defined slice of memory where the same-class anchors are located.

TF takes a 1D tensor of scores and concats it to the boxes before performing get_valid_counts and nms. I'm not sure if the rest of the TF graph is handling the loop over batch size and classes.

That's correct, TF's NMS is only for single class and single batch, so the TF graph loops over batches and classes. To do that, they use tf.map_fn so the execution of each NMS can actually still run in parallel. However, this turns into a mess of control flow operators and TensorArrays, so Relay isn't able to do the same parallelization. This PR's graph rewrite could actually benefit TF OD models as well, but the pattern is a lot more complicated for TF.

masahi · 2021-01-12T10:08:38Z

@kevinthesun @zhiics @mbrookhart

As shown in my new NMS PR #7257, this rewrite results in a better speed up with improved memory layout. Can we merge this? I have new rewrites coming to further optimize PyTorch NMS and MaskRCNN / FasterRCNN.

mbrookhart

LGTM

…apache#7154) * add a pattern to rewrite nms to batched nms * update object detection test to add rewrite * updated tutorial * add doc * fixed coord_start * test fixed by setting force_surpress=False * revert tutorial change * add some comment to explain the pattern * update NMS pattern following frontend change

masahi commented Dec 23, 2020

View reviewed changes

tests/python/frontend/pytorch/test_object_detection.py Outdated Show resolved Hide resolved

kevinthesun reviewed Dec 23, 2020

View reviewed changes

python/tvm/relay/frontend/pytorch_utils.py Outdated Show resolved Hide resolved

masahi force-pushed the torch-maskrcnn-rewrite branch from 4280e98 to 8029644 Compare December 23, 2020 20:20

masahi force-pushed the torch-maskrcnn-rewrite branch from 07f31f4 to 35a177b Compare December 24, 2020 01:47

masahi mentioned this pull request Dec 24, 2020

[Torch] Fix PyTorch NMS conversion for negative scores #7137

Merged

masahi added 8 commits December 26, 2020 10:11

add a pattern to rewrite nms to batched nms

4d11bcf

update object detection test to add rewrite

24e4092

updated tutorial

f375eb6

add doc

6dd8967

fixed coord_start

ee4f8f0

test fixed by setting force_surpress=False

e62ada0

revert tutorial change

4749f9a

add some comment to explain the pattern

bcdf774

masahi force-pushed the torch-maskrcnn-rewrite branch 2 times, most recently from d9b9995 to 4d43fdc Compare December 26, 2020 02:31

update NMS pattern following frontend change

af38f4d

masahi force-pushed the torch-maskrcnn-rewrite branch from 4d43fdc to af38f4d Compare December 26, 2020 02:40

ZihengJiang added the status: need review label Jan 6, 2021

masahi mentioned this pull request Jan 12, 2021

[TOPI] Improve memory layout inside GPU NMS kernel #7257

Merged

mbrookhart approved these changes Jan 12, 2021

View reviewed changes

zhiics approved these changes Jan 13, 2021

View reviewed changes

zhiics merged commit 86479ba into apache:main Jan 13, 2021

zhiics added status: accepted and removed status: need review labels Jan 13, 2021

junrushao mentioned this pull request Nov 1, 2021

Apache TVM v0.8 Release Note Candidate #9416

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Torch] Restore class-aware NMS for detection models by graph rewrite #7154

[Torch] Restore class-aware NMS for detection models by graph rewrite #7154

masahi commented Dec 22, 2020 •

edited

Loading

zhiics commented Dec 23, 2020

masahi commented Dec 23, 2020 •

edited

Loading

zhiics commented Dec 23, 2020

masahi commented Dec 23, 2020

mbrookhart commented Dec 23, 2020

masahi commented Dec 23, 2020 •

edited

Loading

mbrookhart commented Dec 24, 2020 •

edited

Loading

trevor-m commented Jan 12, 2021

masahi commented Jan 12, 2021 •

edited

Loading

mbrookhart left a comment

[Torch] Restore class-aware NMS for detection models by graph rewrite #7154

[Torch] Restore class-aware NMS for detection models by graph rewrite #7154

Conversation

masahi commented Dec 22, 2020 • edited Loading

zhiics commented Dec 23, 2020

masahi commented Dec 23, 2020 • edited Loading

zhiics commented Dec 23, 2020

masahi commented Dec 23, 2020

mbrookhart commented Dec 23, 2020

masahi commented Dec 23, 2020 • edited Loading

mbrookhart commented Dec 24, 2020 • edited Loading

trevor-m commented Jan 12, 2021

masahi commented Jan 12, 2021 • edited Loading

mbrookhart left a comment

Choose a reason for hiding this comment

masahi commented Dec 22, 2020 •

edited

Loading

masahi commented Dec 23, 2020 •

edited

Loading

masahi commented Dec 23, 2020 •

edited

Loading

mbrookhart commented Dec 24, 2020 •

edited

Loading

masahi commented Jan 12, 2021 •

edited

Loading