[TOPI] Make cumsum IR reusable, add thrust scan #7303

masahi · 2021-01-18T07:10:13Z

This PR generalizes the cumsum IR developed in #7123 as a reusable, exclusive scan primitive. Also enabled offloading to thrust exclusive_scan (and inclusive_scan). This makes get_valid_counts on CUDA faster, for example.

get_valid_counts performance using different scan implementation (numbers in milli sec)

Shape	using TIR scan	using thrust scan
(1, 2500, 6)	0.085925326	0.026152056000000003
(3, 1000, 6)	0.07346336999999999	0.029344976
(16, 500, 5)	0.065479228	0.029170329000000002
(1, 10000, 5)	0.10103608300000003	0.026639481999999992
(16, 10000, 5)	0.20932196100000003	0.100769927
(64, 10000, 5)	0.8142572480000001	0.33169912200000007
(1, 50000, 5)	0.145947307	0.039895927000000005
(16, 50000, 5)	0.854221701	0.37997227899999997
(1, 100000, 5)	0.19091778899999998	0.063521364
(16, 100000, 5)	1.6309583439999997	0.734661902
(1, 1000000, 5)	1.0783681289999998	0.45633614899999997
(16, 1000000, 5)	15.880379433000003	7.164256633000001

Currently thrust scan is about 10x faster than TIR scan. Thrust scan is so fast that other kernels in get_valid_counts become bottleneck. That's why there are only 2x difference in (16, 1000000, 5) result, for example.

To show the utility of exclusive_scan, I'll follow up with the following PRs.

Rewrite CUDA argwhere added in [TOPI][OP] cuda for argwhere #6868 using exclusive_scan. This will enable removing atomic and sort, vastly simplifying the implementation. Also the performance of argwhere will be completely bounded by the performance of exclusive_scan (i.e., it should be much faster than the current implementation).
Add cumsum Relay/TOPI op. As expected, this will simply be a wrapper around exclusive_scan.
Add unique Relay/TOPI op. This is sort + adjacent difference + exclusive scan

For now, only 1D and 2D inputs are supported, and scan is always done on the inner most axis. The API is preliminary, comments are welcome. On CUDA, thrust scan is always used when available, but it is also possible to use thrust scan only if the target is cuda -libs=thrust, for example.

Actually, something called "scan" is already a thing in TVM from the very beginning, but it seems it has some limitations and I don't think people care about it, so I went ahead and added topi/cuda/scan.py. An alternative prefix_sum.py also seems good to me, if people prefer.

ready for review @mbrookhart @Laurawly @zhiics @trevor-m @anijain2305 @antinucleon

mbrookhart · 2021-01-19T16:45:15Z

Scan is probably the most hand-optimized kernel in thrust, I'm thrilled to be within 10x for a cross-GPU kernel. Overall I'm happy with this, but I have 2 thoughts.

Should we add the TIR inclusive scan back in? I have that on a branch from my first implementation of get_valid_counts: mbrookhart@944ee3c

We should probably generalize for rank, I think maybe we can use the same kind of before/after trick used in sort:

tvm/python/tvm/topi/cuda/sort.py

Lines 69 to 85 in f91b51d

    
           """Initialize the output buffers by copying from inputs""" 
        
           axis_mul_before = 1 
        
           axis_mul_after = 1 
        
           if axis < 0: 
        
               axis = len(shape) + axis 
        
           for i, value in enumerate(shape, 0): 
        
               if i < axis: 
        
                   axis_mul_before *= value 
        
               elif i > axis: 
        
                   axis_mul_after *= value 
        
           # Set up threading 
        
           max_threads = int(tvm.target.Target.current(allow_none=False).max_num_threads) 
        
           nthread_tx = max_threads 
        
           nthread_bx = ceil_div(shape[axis], max_threads) 
        
           nthread_by = axis_mul_before 
        
           nthread_bz = axis_mul_after

masahi · 2021-01-19T20:19:38Z

Right now, inclusive scan can be supported by exclusive_scan(data) + data. I think that is fine for now, given that our scan IR is far from stable and we don't want to maintain two IRs for the sake of removing the additional sum.
Yes, we can definitely do that. But this PR is already not small and I want to keep the original IR as close as possible for this PR. There are other TODO items for scan (e.g. support other binary ops), so I hope we can address this problem in the future as well.

A related discussion point: Do you expect scan performance on non-innermost axis to be slower than the innermost case? If that's the case (which I believe yes), I think supporting non innermost scan and other ranks by

reshape + transpose -> innermost scan -> reshape and transpose back

is a good solution. It is definitely preferred in terms of implementation simplicity, allowing scan implementation to focus on 1 or 2D + innermost axis.

mbrookhart · 2021-01-20T00:45:17Z

Yeah, scanning on the non-inner axis will have a cache locality performance hit, but I'm honestly not sure if that would be better or worse than the overhead from doing a pair of reshape/transpose ops. Reshape and transpose are heavily limited by memory bandwidth.

mbrookhart

I'm happy with this, we have some feature/performance questions that I think can go into follow up PRs.

anijain2305

Overall LGTM. I am new to this concept and TVM-CUDA, so can't fully comment on the details. But high-level idea and design looks solid.

One request is to run this with empty tensor (zero in the input shape) to see if any corner case is missing.

masahi · 2021-01-20T01:19:03Z

hmm interesting, I've never created a test case with empty tensor, is that possible?

Note that the IR is copied straight from #7303, so the same guard against empty tensor is here.

tvm/python/tvm/topi/cuda/scan.py

Line 59 in 4e13a3f

with ib.if_scope(scan_axis_size == 0):

anijain2305 · 2021-01-20T01:34:06Z

Yes, I eyeballed the changes wrt to empty tensor and they looked good. So, I am happy to approve this PR. Once it is merged, I can try on my end with TF models as well.

For empty tensor test case - https://github.com/apache/tvm/blob/main/tests/python/relay/test_any.py#L879

masahi · 2021-01-20T01:39:18Z

Once it is merged, I can try on my end with TF models as well.

Perf improvement is not expected, since it only improves get_valid_count slightly if you use thrust scan instead of TIR scan. The purpose of this PR is to enable parallelization for other ops, that are difficult without it. argwhere is a perfect example that I'll demonstrate soon after this one.

@anijain2305 The term you want to search for is "gpu stream compaction".

commit cf0d4fd Author: Masahiro Masuda <masahi129@gmail.com> Date: Fri Dec 25 10:12:01 2020 +0900 get valid count test working commit eb142d3 Author: Masahiro Masuda <masahi129@gmail.com> Date: Fri Dec 25 07:22:00 2020 +0900 integrate new cumsum change commit f89684d Author: Masahiro Masuda <masahi129@gmail.com> Date: Fri Dec 25 06:56:46 2020 +0900 remove ceil_div from nms commit a2ad4de Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Dec 20 20:36:34 2020 +0900 add api for returning reduction from ex scan output commit b7f4ef7 Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Dec 20 19:49:07 2020 +0900 move ceil_div to utils commit a9a57e3 Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Dec 20 19:38:15 2020 +0900 rename prefix_scan.py to scan.py commit 03ed43f Author: Masahiro Masuda <masahi129@gmail.com> Date: Sat Dec 19 06:12:55 2020 +0900 surpress cpplint commit abceac9 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 20:36:24 2020 +0900 support more data type commit 3e7d1f8 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 20:09:51 2020 +0900 1d thrust scan working commit ac13b40 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 19:49:25 2020 +0900 adding thrust scan support commit 65634e8 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 19:01:11 2020 +0900 add thrust scan python stub commit 9876c90 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 20:55:14 2020 +0900 introduce prefix_scan.py and move scan ir in nms.py commit 667bdd3 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 15:06:18 2020 +0900 make the scan loop exclusive commit 480787b Author: mbrookhart <mbrookhart@octoml.ai> Date: Thu Dec 17 10:01:11 2020 -0700 Parallelize cumsum in get_valid_counts

masahi · 2021-01-20T02:47:53Z

@anijain2305 I added an empty tensor test in a6c7403

OpenCL seems to have a problem with 0 size buffer, but otherwise both TIR scan and thrust scan seem to have no issue. Please take a look.

anijain2305 · 2021-01-20T03:08:04Z

Thanks :) LGTM :)

masahi · 2021-01-20T09:12:23Z

Thanks @mbrookhart @anijain2305

* import changes from scan branch commit cf0d4fd Author: Masahiro Masuda <masahi129@gmail.com> Date: Fri Dec 25 10:12:01 2020 +0900 get valid count test working commit eb142d3 Author: Masahiro Masuda <masahi129@gmail.com> Date: Fri Dec 25 07:22:00 2020 +0900 integrate new cumsum change commit f89684d Author: Masahiro Masuda <masahi129@gmail.com> Date: Fri Dec 25 06:56:46 2020 +0900 remove ceil_div from nms commit a2ad4de Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Dec 20 20:36:34 2020 +0900 add api for returning reduction from ex scan output commit b7f4ef7 Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Dec 20 19:49:07 2020 +0900 move ceil_div to utils commit a9a57e3 Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Dec 20 19:38:15 2020 +0900 rename prefix_scan.py to scan.py commit 03ed43f Author: Masahiro Masuda <masahi129@gmail.com> Date: Sat Dec 19 06:12:55 2020 +0900 surpress cpplint commit abceac9 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 20:36:24 2020 +0900 support more data type commit 3e7d1f8 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 20:09:51 2020 +0900 1d thrust scan working commit ac13b40 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 19:49:25 2020 +0900 adding thrust scan support commit 65634e8 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 19:01:11 2020 +0900 add thrust scan python stub commit 9876c90 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 20:55:14 2020 +0900 introduce prefix_scan.py and move scan ir in nms.py commit 667bdd3 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 15:06:18 2020 +0900 make the scan loop exclusive commit 480787b Author: mbrookhart <mbrookhart@octoml.ai> Date: Thu Dec 17 10:01:11 2020 -0700 Parallelize cumsum in get_valid_counts * fix for 1d scan * rename * cast to out dtype * do not run return reduction for inclusive scan * remove another ceil_div definition * adding scan test * add scheduling for scan op, fixed scan 1d test * pylint fix * add doc string * add more thrust scan test * add dynamic get valid count test, including empty size tensor * fix hard coded gpu targets for cpu only env * try retunring early if scan_size is 0 * another change for empty tensor and thrust path Co-authored-by: masa <masa@pop-os.localdomain>

masahi force-pushed the cuda-ex-scan branch from 0bc3a4d to 01d6121 Compare January 18, 2021 07:25

masahi marked this pull request as ready for review January 18, 2021 11:13

masahi force-pushed the cuda-ex-scan branch 2 times, most recently from 7a0403d to 2d58ef7 Compare January 19, 2021 03:24

masahi force-pushed the cuda-ex-scan branch from 2d58ef7 to 4e13a3f Compare January 19, 2021 19:53

mbrookhart approved these changes Jan 20, 2021

View reviewed changes

anijain2305 approved these changes Jan 20, 2021

View reviewed changes

masahi and others added 11 commits January 20, 2021 11:45

fix for 1d scan

a7772a5

rename

7608e10

cast to out dtype

b72840e

do not run return reduction for inclusive scan

f2667e3

remove another ceil_div definition

b1bbedf

adding scan test

9783462

add scheduling for scan op, fixed scan 1d test

55df6d4

pylint fix

f740595

add doc string

b4795ef

add more thrust scan test

6c70ed2

masahi force-pushed the cuda-ex-scan branch from 4e13a3f to 20afc32 Compare January 20, 2021 02:45

add dynamic get valid count test, including empty size tensor

a6c7403

masahi force-pushed the cuda-ex-scan branch from 20afc32 to a6c7403 Compare January 20, 2021 02:50

fix hard coded gpu targets for cpu only env

e2df3c6

masahi added 2 commits January 20, 2021 14:48

try retunring early if scan_size is 0

a88e53c

another change for empty tensor and thrust path

717270b

masahi merged commit 62f251b into apache:main Jan 20, 2021

masahi mentioned this pull request Jan 20, 2021

[TOPI] Rewrite GPU argwhere using exclusive scan #7314

Merged

masahi mentioned this pull request Feb 17, 2021

[Frontend][Tensorflow] Add unique operator #7441

Merged

junrushao mentioned this pull request Nov 1, 2021

Apache TVM v0.8 Release Note Candidate #9416

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TOPI] Make cumsum IR reusable, add thrust scan #7303

[TOPI] Make cumsum IR reusable, add thrust scan #7303

masahi commented Jan 18, 2021 •

edited

Loading

mbrookhart commented Jan 19, 2021

masahi commented Jan 19, 2021 •

edited

Loading

mbrookhart commented Jan 20, 2021

mbrookhart left a comment

anijain2305 left a comment

masahi commented Jan 20, 2021

anijain2305 commented Jan 20, 2021

masahi commented Jan 20, 2021

masahi commented Jan 20, 2021 •

edited

Loading

anijain2305 commented Jan 20, 2021

masahi commented Jan 20, 2021

[TOPI] Make cumsum IR reusable, add thrust scan #7303

[TOPI] Make cumsum IR reusable, add thrust scan #7303

Conversation

masahi commented Jan 18, 2021 • edited Loading

mbrookhart commented Jan 19, 2021

masahi commented Jan 19, 2021 • edited Loading

mbrookhart commented Jan 20, 2021

mbrookhart left a comment

Choose a reason for hiding this comment

anijain2305 left a comment

Choose a reason for hiding this comment

masahi commented Jan 20, 2021

anijain2305 commented Jan 20, 2021

masahi commented Jan 20, 2021

masahi commented Jan 20, 2021 • edited Loading

anijain2305 commented Jan 20, 2021

masahi commented Jan 20, 2021

masahi commented Jan 18, 2021 •

edited

Loading

masahi commented Jan 19, 2021 •

edited

Loading

masahi commented Jan 20, 2021 •

edited

Loading