Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TOPI] Make cumsum IR reusable, add thrust scan #7303

Merged
merged 15 commits into from
Jan 20, 2021
Merged

Conversation

masahi
Copy link
Member

@masahi masahi commented Jan 18, 2021

This PR generalizes the cumsum IR developed in #7123 as a reusable, exclusive scan primitive. Also enabled offloading to thrust exclusive_scan (and inclusive_scan). This makes get_valid_counts on CUDA faster, for example.

get_valid_counts performance using different scan implementation (numbers in milli sec)

Shape using TIR scan using thrust scan
(1, 2500, 6) 0.085925326 0.026152056000000003
(3, 1000, 6) 0.07346336999999999 0.029344976
(16, 500, 5) 0.065479228 0.029170329000000002
(1, 10000, 5) 0.10103608300000003 0.026639481999999992
(16, 10000, 5) 0.20932196100000003 0.100769927
(64, 10000, 5) 0.8142572480000001 0.33169912200000007
(1, 50000, 5) 0.145947307 0.039895927000000005
(16, 50000, 5) 0.854221701 0.37997227899999997
(1, 100000, 5) 0.19091778899999998 0.063521364
(16, 100000, 5) 1.6309583439999997 0.734661902
(1, 1000000, 5) 1.0783681289999998 0.45633614899999997
(16, 1000000, 5) 15.880379433000003 7.164256633000001

Currently thrust scan is about 10x faster than TIR scan. Thrust scan is so fast that other kernels in get_valid_counts become bottleneck. That's why there are only 2x difference in (16, 1000000, 5) result, for example.

To show the utility of exclusive_scan, I'll follow up with the following PRs.

  • Rewrite CUDA argwhere added in [TOPI][OP] cuda for argwhere #6868 using exclusive_scan. This will enable removing atomic and sort, vastly simplifying the implementation. Also the performance of argwhere will be completely bounded by the performance of exclusive_scan (i.e., it should be much faster than the current implementation).
  • Add cumsum Relay/TOPI op. As expected, this will simply be a wrapper around exclusive_scan.
  • Add unique Relay/TOPI op. This is sort + adjacent difference + exclusive scan

For now, only 1D and 2D inputs are supported, and scan is always done on the inner most axis. The API is preliminary, comments are welcome. On CUDA, thrust scan is always used when available, but it is also possible to use thrust scan only if the target is cuda -libs=thrust, for example.

Actually, something called "scan" is already a thing in TVM from the very beginning, but it seems it has some limitations and I don't think people care about it, so I went ahead and added topi/cuda/scan.py. An alternative prefix_sum.py also seems good to me, if people prefer.

ready for review @mbrookhart @Laurawly @zhiics @trevor-m @anijain2305 @antinucleon

@masahi masahi marked this pull request as ready for review January 18, 2021 11:13
@masahi masahi force-pushed the cuda-ex-scan branch 2 times, most recently from 7a0403d to 2d58ef7 Compare January 19, 2021 03:24
@mbrookhart
Copy link
Contributor

Scan is probably the most hand-optimized kernel in thrust, I'm thrilled to be within 10x for a cross-GPU kernel. Overall I'm happy with this, but I have 2 thoughts.

  1. Should we add the TIR inclusive scan back in? I have that on a branch from my first implementation of get_valid_counts: mbrookhart@944ee3c
  2. We should probably generalize for rank, I think maybe we can use the same kind of before/after trick used in sort:
    """Initialize the output buffers by copying from inputs"""
    axis_mul_before = 1
    axis_mul_after = 1
    if axis < 0:
    axis = len(shape) + axis
    for i, value in enumerate(shape, 0):
    if i < axis:
    axis_mul_before *= value
    elif i > axis:
    axis_mul_after *= value
    # Set up threading
    max_threads = int(tvm.target.Target.current(allow_none=False).max_num_threads)
    nthread_tx = max_threads
    nthread_bx = ceil_div(shape[axis], max_threads)
    nthread_by = axis_mul_before
    nthread_bz = axis_mul_after

@masahi
Copy link
Member Author

masahi commented Jan 19, 2021

  1. Right now, inclusive scan can be supported by exclusive_scan(data) + data. I think that is fine for now, given that our scan IR is far from stable and we don't want to maintain two IRs for the sake of removing the additional sum.

  2. Yes, we can definitely do that. But this PR is already not small and I want to keep the original IR as close as possible for this PR. There are other TODO items for scan (e.g. support other binary ops), so I hope we can address this problem in the future as well.

A related discussion point: Do you expect scan performance on non-innermost axis to be slower than the innermost case? If that's the case (which I believe yes), I think supporting non innermost scan and other ranks by

reshape + transpose -> innermost scan -> reshape and transpose back 

is a good solution. It is definitely preferred in terms of implementation simplicity, allowing scan implementation to focus on 1 or 2D + innermost axis.

@mbrookhart
Copy link
Contributor

Yeah, scanning on the non-inner axis will have a cache locality performance hit, but I'm honestly not sure if that would be better or worse than the overhead from doing a pair of reshape/transpose ops. Reshape and transpose are heavily limited by memory bandwidth.

Copy link
Contributor

@mbrookhart mbrookhart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this, we have some feature/performance questions that I think can go into follow up PRs.

Copy link
Contributor

@anijain2305 anijain2305 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. I am new to this concept and TVM-CUDA, so can't fully comment on the details. But high-level idea and design looks solid.

One request is to run this with empty tensor (zero in the input shape) to see if any corner case is missing.

@masahi
Copy link
Member Author

masahi commented Jan 20, 2021

hmm interesting, I've never created a test case with empty tensor, is that possible?

Note that the IR is copied straight from #7303, so the same guard against empty tensor is here.

with ib.if_scope(scan_axis_size == 0):

@anijain2305
Copy link
Contributor

Yes, I eyeballed the changes wrt to empty tensor and they looked good. So, I am happy to approve this PR. Once it is merged, I can try on my end with TF models as well.

For empty tensor test case - https://github.com/apache/tvm/blob/main/tests/python/relay/test_any.py#L879

@masahi
Copy link
Member Author

masahi commented Jan 20, 2021

Once it is merged, I can try on my end with TF models as well.

Perf improvement is not expected, since it only improves get_valid_count slightly if you use thrust scan instead of TIR scan. The purpose of this PR is to enable parallelization for other ops, that are difficult without it. argwhere is a perfect example that I'll demonstrate soon after this one.

@anijain2305 The term you want to search for is "gpu stream compaction".

masahi and others added 11 commits January 20, 2021 11:45
commit cf0d4fd
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Fri Dec 25 10:12:01 2020 +0900

    get valid count test working

commit eb142d3
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Fri Dec 25 07:22:00 2020 +0900

    integrate new cumsum change

commit f89684d
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Fri Dec 25 06:56:46 2020 +0900

    remove ceil_div from nms

commit a2ad4de
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sun Dec 20 20:36:34 2020 +0900

    add api for returning reduction from ex scan output

commit b7f4ef7
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sun Dec 20 19:49:07 2020 +0900

    move ceil_div to utils

commit a9a57e3
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sun Dec 20 19:38:15 2020 +0900

    rename prefix_scan.py to scan.py

commit 03ed43f
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sat Dec 19 06:12:55 2020 +0900

    surpress cpplint

commit abceac9
Author: masa <masa@pop-os.localdomain>
Date:   Fri Dec 18 20:36:24 2020 +0900

    support more data type

commit 3e7d1f8
Author: masa <masa@pop-os.localdomain>
Date:   Fri Dec 18 20:09:51 2020 +0900

    1d thrust scan working

commit ac13b40
Author: masa <masa@pop-os.localdomain>
Date:   Fri Dec 18 19:49:25 2020 +0900

    adding thrust scan support

commit 65634e8
Author: masa <masa@pop-os.localdomain>
Date:   Fri Dec 18 19:01:11 2020 +0900

    add thrust scan python stub

commit 9876c90
Author: masa <masa@pop-os.localdomain>
Date:   Fri Dec 18 20:55:14 2020 +0900

    introduce prefix_scan.py and move scan ir in nms.py

commit 667bdd3
Author: masa <masa@pop-os.localdomain>
Date:   Fri Dec 18 15:06:18 2020 +0900

    make the scan loop exclusive

commit 480787b
Author: mbrookhart <mbrookhart@octoml.ai>
Date:   Thu Dec 17 10:01:11 2020 -0700

    Parallelize cumsum in get_valid_counts
@masahi
Copy link
Member Author

masahi commented Jan 20, 2021

@anijain2305 I added an empty tensor test in a6c7403

OpenCL seems to have a problem with 0 size buffer, but otherwise both TIR scan and thrust scan seem to have no issue. Please take a look.

@anijain2305
Copy link
Contributor

Thanks :) LGTM :)

@masahi masahi merged commit 62f251b into apache:main Jan 20, 2021
@masahi
Copy link
Member Author

masahi commented Jan 20, 2021

Thanks @mbrookhart @anijain2305

TusharKanekiDey pushed a commit to TusharKanekiDey/tvm that referenced this pull request Jan 20, 2021
* import changes from scan branch

commit cf0d4fd
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Fri Dec 25 10:12:01 2020 +0900

    get valid count test working

commit eb142d3
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Fri Dec 25 07:22:00 2020 +0900

    integrate new cumsum change

commit f89684d
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Fri Dec 25 06:56:46 2020 +0900

    remove ceil_div from nms

commit a2ad4de
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sun Dec 20 20:36:34 2020 +0900

    add api for returning reduction from ex scan output

commit b7f4ef7
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sun Dec 20 19:49:07 2020 +0900

    move ceil_div to utils

commit a9a57e3
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sun Dec 20 19:38:15 2020 +0900

    rename prefix_scan.py to scan.py

commit 03ed43f
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sat Dec 19 06:12:55 2020 +0900

    surpress cpplint

commit abceac9
Author: masa <masa@pop-os.localdomain>
Date:   Fri Dec 18 20:36:24 2020 +0900

    support more data type

commit 3e7d1f8
Author: masa <masa@pop-os.localdomain>
Date:   Fri Dec 18 20:09:51 2020 +0900

    1d thrust scan working

commit ac13b40
Author: masa <masa@pop-os.localdomain>
Date:   Fri Dec 18 19:49:25 2020 +0900

    adding thrust scan support

commit 65634e8
Author: masa <masa@pop-os.localdomain>
Date:   Fri Dec 18 19:01:11 2020 +0900

    add thrust scan python stub

commit 9876c90
Author: masa <masa@pop-os.localdomain>
Date:   Fri Dec 18 20:55:14 2020 +0900

    introduce prefix_scan.py and move scan ir in nms.py

commit 667bdd3
Author: masa <masa@pop-os.localdomain>
Date:   Fri Dec 18 15:06:18 2020 +0900

    make the scan loop exclusive

commit 480787b
Author: mbrookhart <mbrookhart@octoml.ai>
Date:   Thu Dec 17 10:01:11 2020 -0700

    Parallelize cumsum in get_valid_counts

* fix for 1d scan

* rename

* cast to out dtype

* do not run return reduction for inclusive scan

* remove another ceil_div definition

* adding scan test

* add scheduling for scan op, fixed scan 1d test

* pylint fix

* add doc string

* add more thrust scan test

* add dynamic get valid count test, including empty size tensor

* fix hard coded gpu targets for cpu only env

* try retunring early if scan_size is 0

* another change for empty tensor and thrust path

Co-authored-by: masa <masa@pop-os.localdomain>
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Jan 21, 2021
* import changes from scan branch

commit cf0d4fd
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Fri Dec 25 10:12:01 2020 +0900

    get valid count test working

commit eb142d3
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Fri Dec 25 07:22:00 2020 +0900

    integrate new cumsum change

commit f89684d
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Fri Dec 25 06:56:46 2020 +0900

    remove ceil_div from nms

commit a2ad4de
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sun Dec 20 20:36:34 2020 +0900

    add api for returning reduction from ex scan output

commit b7f4ef7
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sun Dec 20 19:49:07 2020 +0900

    move ceil_div to utils

commit a9a57e3
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sun Dec 20 19:38:15 2020 +0900

    rename prefix_scan.py to scan.py

commit 03ed43f
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sat Dec 19 06:12:55 2020 +0900

    surpress cpplint

commit abceac9
Author: masa <masa@pop-os.localdomain>
Date:   Fri Dec 18 20:36:24 2020 +0900

    support more data type

commit 3e7d1f8
Author: masa <masa@pop-os.localdomain>
Date:   Fri Dec 18 20:09:51 2020 +0900

    1d thrust scan working

commit ac13b40
Author: masa <masa@pop-os.localdomain>
Date:   Fri Dec 18 19:49:25 2020 +0900

    adding thrust scan support

commit 65634e8
Author: masa <masa@pop-os.localdomain>
Date:   Fri Dec 18 19:01:11 2020 +0900

    add thrust scan python stub

commit 9876c90
Author: masa <masa@pop-os.localdomain>
Date:   Fri Dec 18 20:55:14 2020 +0900

    introduce prefix_scan.py and move scan ir in nms.py

commit 667bdd3
Author: masa <masa@pop-os.localdomain>
Date:   Fri Dec 18 15:06:18 2020 +0900

    make the scan loop exclusive

commit 480787b
Author: mbrookhart <mbrookhart@octoml.ai>
Date:   Thu Dec 17 10:01:11 2020 -0700

    Parallelize cumsum in get_valid_counts

* fix for 1d scan

* rename

* cast to out dtype

* do not run return reduction for inclusive scan

* remove another ceil_div definition

* adding scan test

* add scheduling for scan op, fixed scan 1d test

* pylint fix

* add doc string

* add more thrust scan test

* add dynamic get valid count test, including empty size tensor

* fix hard coded gpu targets for cpu only env

* try retunring early if scan_size is 0

* another change for empty tensor and thrust path

Co-authored-by: masa <masa@pop-os.localdomain>
electriclilies pushed a commit to electriclilies/tvm that referenced this pull request Feb 18, 2021
* import changes from scan branch

commit cf0d4fd
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Fri Dec 25 10:12:01 2020 +0900

    get valid count test working

commit eb142d3
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Fri Dec 25 07:22:00 2020 +0900

    integrate new cumsum change

commit f89684d
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Fri Dec 25 06:56:46 2020 +0900

    remove ceil_div from nms

commit a2ad4de
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sun Dec 20 20:36:34 2020 +0900

    add api for returning reduction from ex scan output

commit b7f4ef7
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sun Dec 20 19:49:07 2020 +0900

    move ceil_div to utils

commit a9a57e3
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sun Dec 20 19:38:15 2020 +0900

    rename prefix_scan.py to scan.py

commit 03ed43f
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sat Dec 19 06:12:55 2020 +0900

    surpress cpplint

commit abceac9
Author: masa <masa@pop-os.localdomain>
Date:   Fri Dec 18 20:36:24 2020 +0900

    support more data type

commit 3e7d1f8
Author: masa <masa@pop-os.localdomain>
Date:   Fri Dec 18 20:09:51 2020 +0900

    1d thrust scan working

commit ac13b40
Author: masa <masa@pop-os.localdomain>
Date:   Fri Dec 18 19:49:25 2020 +0900

    adding thrust scan support

commit 65634e8
Author: masa <masa@pop-os.localdomain>
Date:   Fri Dec 18 19:01:11 2020 +0900

    add thrust scan python stub

commit 9876c90
Author: masa <masa@pop-os.localdomain>
Date:   Fri Dec 18 20:55:14 2020 +0900

    introduce prefix_scan.py and move scan ir in nms.py

commit 667bdd3
Author: masa <masa@pop-os.localdomain>
Date:   Fri Dec 18 15:06:18 2020 +0900

    make the scan loop exclusive

commit 480787b
Author: mbrookhart <mbrookhart@octoml.ai>
Date:   Thu Dec 17 10:01:11 2020 -0700

    Parallelize cumsum in get_valid_counts

* fix for 1d scan

* rename

* cast to out dtype

* do not run return reduction for inclusive scan

* remove another ceil_div definition

* adding scan test

* add scheduling for scan op, fixed scan 1d test

* pylint fix

* add doc string

* add more thrust scan test

* add dynamic get valid count test, including empty size tensor

* fix hard coded gpu targets for cpu only env

* try retunring early if scan_size is 0

* another change for empty tensor and thrust path

Co-authored-by: masa <masa@pop-os.localdomain>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants