[CUDA] Support multiple TIR-level dynamic shared memory allocations #8571

masahi · 2021-07-28T04:30:37Z

A follow-up to #8466

A new pass is added to merge multiple TIR-level dynamic shared memory allocations, whose sizes may not be a constant. This case is not handled by storage_rewrite pass. Rather than updating storage_rewrite pass, I added a new pass since the logic is simpler (we MUST merge and we know which alloc to merge).

Hetero-dtype is supported per discussion #8466 (comment)

@tqchen @vinx13 @yzh119

This reverts commit ce62d9e.

tqchen · 2021-07-28T23:22:47Z

cc @Hzfengsy @vinx13 would be great if you can help to manage this PR

masahi · 2021-07-29T01:51:11Z

For the dyn shmem matmul test, the generated kernel looks like:

extern "C" __global__ void default_function_kernel0(half* __restrict__ A, half* __restrict__ B, float* __restrict__ reduce) {
  extern __shared__ uchar buf_dyn_shmem[];
  ((float*)buf_dyn_shmem)[((((((int)threadIdx.y) * 16) + ((int)threadIdx.x)) + 512))] = 0.000000e+00f;
  for (int i = 0; i < 64; ++i) {
    ((half*)buf_dyn_shmem)[((((((int)threadIdx.y) * 16) + ((int)threadIdx.x)) + 512))] = A[(((((((int)blockIdx.y) * 16384) + (((int)threadIdx.y) * 1024)) + (i * 16)) + ((int)threadIdx.x)))];
    ((half*)buf_dyn_shmem)[(((((int)threadIdx.y) * 16) + ((int)threadIdx.x)))] = B[(((((i * 16384) + (((int)threadIdx.y) * 1024)) + (((int)blockIdx.x) * 16)) + ((int)threadIdx.x)))];
    __syncthreads();
    for (int k = 0; k < 16; ++k) {
      ((float*)buf_dyn_shmem)[((((((int)threadIdx.y) * 16) + ((int)threadIdx.x)) + 512))] = (((float*)buf_dyn_shmem)[((((((int)threadIdx.y) * 16) + ((int)threadIdx.x)) + 512))] + ((float)(((half*)buf_dyn_shmem)[((((((int)threadIdx.y) * 16) + k) + 512))] * ((half*)buf_dyn_shmem)[(((k * 16) + ((int)threadIdx.x)))])));
    }
    __syncthreads();
  }
  reduce[(((((((int)blockIdx.y) * 16384) + (((int)threadIdx.y) * 1024)) + (((int)blockIdx.x) * 16)) + ((int)threadIdx.x)))] = ((float*)buf_dyn_shmem)[((((((int)threadIdx.y) * 16) + ((int)threadIdx.x)) + 512))];
}

Hzfengsy

Can we reuse buffers that are out of the lifetime in the new pass? To be specific, please see the following example:

A_shared[i] = A[i]
A_local[i] = A_shared[i]
C_local[i] = A_local[i] + 1
C_shared[i] = C_local[i]

Since A_shared[i] is never used when we store to C_shared. We can directly store the data into A_shared[i] to reduce memory usage. It is supported in storage_rewrite

tests/python/unittest/test_tir_ir_builder.py

src/tir/transforms/merge_dynamic_shared_memory_allocations.cc

Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>

masahi · 2021-07-29T06:21:00Z

Can we reuse buffers that are out of the lifetime in the new pass? To be specific, please see the following example:
A_shared[i] = A[i]
A_local[i] = A_shared[i]
C_local[i] = A_local[i] + 1
C_shared[i] = C_local[i]
Since A_shared[i] is never used when we store to C_shared. We can directly store the data into A_shared[i] to reduce memory usage. It is supported in storage_rewrite

Thanks, I didn't think about reuse support. To support this, I think it is better to drop the new pass in this PR and merge the new functionality to storage_rewrite pass. I'll consider both possibilities and try to find the simplest solution.

One difficulty I can imagine is that, dynamic shared memory in general has unknown alloc size. So for the general cases I don't think reuse analysis would work just like it does in storage_rewrite. For special cases where dynamic shared memory happens to have a constant size, it is probably worth supporting reuse. What do you think? @Hzfengsy @vinx13

masahi · 2021-07-30T00:22:42Z

I think we can use storage_rewrite to support buffer reuse on dynamic shared memory with constant sizes, and then use the new pass in this PR to merge remaining buffers. I'll pursue this approach.

vinx13 · 2021-07-30T00:42:27Z

I think we can use storage_rewrite to support buffer reuse on dynamic shared memory with constant sizes, and then use the new pass in this PR to merge remaining buffers. I'll pursue this approach.

I agree. For constant sizes, storage_rewrite should be able to eliminate buffer allocations, running MergeDynamicMemoryAlloc after StorageRewritewill work

masahi · 2021-07-30T03:05:44Z

ok @Hzfengsy @vinx13 I added a new test that demonstrates storage_rewrite and the new merge pass working seamlessly.

…pache#8571)

masa added 15 commits July 27, 2021 23:27

add shared mem matmul test

617d7e0

Add a stub pass

26bfb17

add builtin for reinterprete load/store

9eab8d3

remove buitlin since Load/Store node already support reinterpret

27c881a

Add Load/Store visitor implementation

f1e35fa

allocate merge first cut

482ecd8

Remove all attr::storage_scope usage

ce62d9e

fix allocate location

c95ede5

Revert "Remove all attr::storage_scope usage"

d55ce65

This reverts commit ce62d9e.

handle vector alloc

c2892b6

add vectorized test

e8907f1

drop multi-lane dtype allocation support, vectorized store working

1f5c79c

doc update

6d6f437

dtype fix in the test

f086e37

lint fix, do not run merging when number of alloc is 1

391b2c6

masahi marked this pull request as ready for review July 28, 2021 19:44

masahi requested review from areusch, comaniac, jroesch, junrushao, jwfromm, kparzysz-quic, leandron, merrymercy, tqchen, vinx13, yzhliu, zhiics and ZihengJiang as code owners July 28, 2021 19:44

Hzfengsy requested changes Jul 29, 2021

View reviewed changes

masahi and others added 5 commits July 29, 2021 14:54

Update src/tir/transforms/merge_dynamic_shared_memory_allocations.cc

6203b0b

Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>

Update src/tir/transforms/merge_dynamic_shared_memory_allocations.cc

56f8dce

Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>

move test cases into dedicated file

ec3d0da

use integer division

30bbbbb

verify new transform pass output

b46e1f2

tqchen assigned vinx13 Jul 29, 2021

add test on combined buffer reuse and merge

7a37406

Hzfengsy approved these changes Jul 31, 2021

View reviewed changes

vinx13 approved these changes Jul 31, 2021

View reviewed changes

vinx13 merged commit 4b67dac into apache:main Jul 31, 2021

ylc pushed a commit to ylc/tvm that referenced this pull request Sep 29, 2021

[CUDA] Support multiple TIR-level dynamic shared memory allocations (a…

7c677ad

…pache#8571)

masahi mentioned this pull request Oct 21, 2021

[CUDA] Support memory reuse for dynamic shared memory #9341

Merged

junrushao mentioned this pull request Nov 1, 2021

Apache TVM v0.8 Release Note Candidate #9416

Closed

ylc pushed a commit to ylc/tvm that referenced this pull request Jan 13, 2022

[CUDA] Support multiple TIR-level dynamic shared memory allocations (a…

54a5539

…pache#8571)

LeiWang1999 mentioned this pull request Jan 3, 2024

[CUDA] Simple extend to optimize reuse for static shared memory. #16342

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] Support multiple TIR-level dynamic shared memory allocations #8571

[CUDA] Support multiple TIR-level dynamic shared memory allocations #8571

masahi commented Jul 28, 2021 •

edited

Loading

tqchen commented Jul 28, 2021

masahi commented Jul 29, 2021

Hzfengsy left a comment

masahi commented Jul 29, 2021 •

edited

Loading

masahi commented Jul 30, 2021

vinx13 commented Jul 30, 2021

masahi commented Jul 30, 2021

[CUDA] Support multiple TIR-level dynamic shared memory allocations #8571

[CUDA] Support multiple TIR-level dynamic shared memory allocations #8571

Conversation

masahi commented Jul 28, 2021 • edited Loading

tqchen commented Jul 28, 2021

masahi commented Jul 29, 2021

Hzfengsy left a comment

Choose a reason for hiding this comment

masahi commented Jul 29, 2021 • edited Loading

masahi commented Jul 30, 2021

vinx13 commented Jul 30, 2021

masahi commented Jul 30, 2021

masahi commented Jul 28, 2021 •

edited

Loading

masahi commented Jul 29, 2021 •

edited

Loading