[Metaschedule] Add demonstration of selectively tuning relay ops with TIR schedules #10793

masahi · 2022-03-25T22:43:04Z

This demonstrates how to selectively extract and tune tasks from a whole relay mod, and apply the tuned schedule during the final relay.build(...).

This flow is entirely different from existing tests in test_meta_schedule_tune_relay.py where ALL ops are extracted and auto-scheduled by MS. My test extracts only int8 dense op, applies a manual TIR schedule on it, and leaves int8 batch_matmul to be scheduled by TE.

This also serves as an example of autotvm style manual template + tensorization. The manual TIR schedule is equivalent to TE VNNI dense schedule in

tvm/python/tvm/topi/x86/dense.py

Lines 366 to 375 in ce335c3

    
           def schedule_dense_vnni(cfg, outs): 
        
               """Create a schedule for dense_vnni""" 
        
               s = te.create_schedule([x.op for x in outs]) 
        
               def _callback(op): 
        
                   if "dense_vnni" in op.tag: 
        
                       dense_vnni_schedule(cfg, s, op.output(0), outs[0]) 
        
               traverse_inline(s, outs[0].op, _callback) 
        
               return s

@junrushao1994 @vinx13 @csullivan @comaniac @jwfromm @TejashShah

src/tir/schedule/ir_comparator.cc

masahi · 2022-03-25T23:12:08Z

tests/python/unittest/test_meta_schedule_tune_relay.py

+    register_func("meta_schedule.dense_vnni", schedule_rule_dense_vnni)
+
+    # TODO(masahi): Weird error from tuning with CheckSubtreeCompactDataflow in for_kind.cc turned on
+    # manual_tir_common(do_tune=True)


Currently this results in the following error:

[08:11:02] /home/masa/projects/dev/tvm/src/meta_schedule/task_scheduler/task_scheduler.cc:127: Scheduler picks Task #0: "fused_nn_contrib_dense_pack_add_add" Traceback (most recent call last): File "test_meta_schedule_tune_relay.py", line 559, in <module> test_tune_relay_manual_tir_vnni() File "test_meta_schedule_tune_relay.py", line 547, in test_tune_relay_manual_tir_vnni manual_tir_common(do_tune=True) File "test_meta_schedule_tune_relay.py", line 485, in manual_tir_common database = tune_extracted_tasks(tune_tasks, target, config, work_dir=work_dir) File "/home/masa/projects/dev/tvm/python/tvm/meta_schedule/tune.py", line 716, in tune_extracted_tasks task_scheduler.tune() File "/home/masa/projects/dev/tvm/python/tvm/meta_schedule/task_scheduler/task_scheduler.py", line 60, in tune _ffi_api.TaskSchedulerTune(self) # type: ignore # pylint: disable=no-member File "/home/masa/projects/dev/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 237, in __call__ raise get_last_ffi_error() tvm._ffi.base.TVMError: Traceback (most recent call last): 5: TVMFuncCall 4: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (tvm::meta_schedule::TaskScheduler)>::AssignTypedLambda<tvm::runtime::Registry::set_body_method<tvm::meta_schedule::TaskScheduler, tvm::meta_schedule::TaskSchedulerNode, void, , void>(void (tvm::meta_schedule::TaskSchedulerNode::*)())::{lambda(tvm::meta_schedule::TaskScheduler)#1}>(tvm::runtime::Registry::set_body_method<tvm::meta_schedule::TaskScheduler, tvm::meta_schedule::TaskSchedulerNode, void, , void>(void (tvm::meta_schedule::TaskSchedulerNode::*)())::{lambda(tvm::meta_schedule::TaskScheduler)#1}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tvm::runtime::TVMRetValue) 3: tvm::meta_schedule::TaskSchedulerNode::Tune() 2: tvm::meta_schedule::ReplayTraceNode::GenerateMeasureCandidates() 1: tvm::meta_schedule::ReplayTraceNode::State::GenerateMeasureCandidates() 0: tvm::support::parallel_for_dynamic(int, int, int, std::function<void (int, int)> const&) [clone .cold] File "/home/masa/projects/dev/tvm/src/support/parallel_for.cc", line 128 RuntimeError: parallel_for_dynamic error with ScheduleError: (not rendered)

If I remove

tvm/src/tir/schedule/primitive/for_kind.cc

Line 160 in 0ddaaa6

CheckSubtreeCompactDataflow(self, loop_sref);

it works. I'm going to debug this next week.

I believe @vinx13 and I got the same problem of getting this check pass. The root cause is that compact dataflow check is too strict for tensorization and therefore we need to disable it in mma case.

Similar case here: #10420

It's not about tensorization, but parallelization. Now I have more information about this error. If I turn on a detailed log, I got this from ReplayTraceNode:

for i0_0_i1_0_fused_fused in T.serial(8192): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ for i0_1 in T.serial(8): for ax0_init in T.vectorized(16): with T.block("compute_init"): i = T.axis.spatial(1024, i0_0_i1_0_fused_fused // 64 * 8 + i0_1) j = T.axis.spatial(1024, i0_0_i1_0_fused_fused % 64 * 16 + ax0_init) T.reads() T.writes(compute[i, j]) T.block_attr({"schedule_rule":"meta_schedule.dense_vnni", "workload":["dense_vnni.x86", ["TENSOR", [1024, 1024], "uint8"], ["TENSOR", [64, 256, 16, 4], "int8"], None, "int32"]}) compute[i, j] = 0 for ax1_0 in T.serial(256): # tir.Block#1 with T.block("compute_update_o"): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ i = T.axis.spatial(1024, i0_0_i1_0_fused_fused // 64 * 8 + i0_1) j_o = T.axis.spatial(64, i0_0_i1_0_fused_fused % 64) k_o = T.axis.reduce(256, ax1_0) T.reads(compute[i, j_o * 16 : j_o * 16 + 16], placeholder[i, k_o * 4 : k_o * 4 + 4], placeholder_1[j_o, k_o, 0 : 16, 0 : 4]) T.writes(compute[i, j_o * 16 : j_o * 16 + 16]) A = T.match_buffer(placeholder[i, k_o * 4 : k_o * 4 + 4], [4], dtype="uint8", offset_factor=1) B = T.match_buffer(placeholder_1[j_o, k_o, 0 : 16, 0 : 4], [16, 4], dtype="int8", offset_factor=1) C = T.match_buffer(compute[i, j_o * 16 : j_o * 16 + 16], [16], dtype="int32", offset_factor=1) A_u8x4: T.uint8x4 = A[T.ramp(0, 1, 4)] A_i32: T.int32 = T.reinterpret(A_u8x4, dtype="int32") B_i8x64: T.int8x64 = B[0, T.ramp(0, 1, 64)] B_i32x16: T.int32x16 = T.reinterpret(B_i8x64, dtype="int32x16") C[T.ramp(0, 1, 16)] = C[T.ramp(0, 1, 16)] + T.call_llvm_pure_intrin(9785, T.uint32(0), T.broadcast(0, 16), T.broadcast(A_i32, 16), B_i32x16, dtype="int32x16") for i1_1 in T.vectorized(16): with T.block("T_add_1"): ax0 = T.axis.spatial(1024, i0_0_i1_0_fused_fused // 64 * 8 + i0_1) ax1 = T.axis.spatial(1024, i0_0_i1_0_fused_fused % 64 * 16 + i1_1) T.reads(compute[ax0, ax1], placeholder_2[0, ax1]) T.writes(T_add[ax0, ax1]) T_add[ax0, ax1] = compute[ax0, ax1] + placeholder_2[0, ax1] + 1 Error message: The queried subtree root tir.For#0 in SRef tree does not have compact dataflow, because its child block tir.Block#1 on SRef tree is neither a local complete block nor a local reduction block.

So even though I'm doing parellel before decompose_reduction to workaround that strict check, ReplayTrace is trying to apply parallel to a schedule that is clearly already decompose_reduction-ed. This is very weird...

It is solved, there is the default set of post processors for CPU that does additional parallel on top of my manual schedule

tvm/python/tvm/meta_schedule/tune.py

Line 121 in 7ff5c83

M.RewriteParallelVectorizeUnroll(),

. Since I'm using manual schedule, I don't want any post procs. Disabling them fixed this issue, thanks @vinx13 @junrushao1994

shingjan · 2022-03-26T00:18:21Z

tests/python/unittest/test_meta_schedule_tune_relay.py

+    return imm.astype("int32x16", span)
+
+
+@T.prim_func


maybe we could move all the tensorization intrinsics in the same place. Right now for wmma/mma cuda it is meta_schedule.testing.tir_tensor_intrin but I believe it is not upstreamed yet.

yeah we should do that.

Are these generally useful intrinsics or just needed for the tests in this file? If it's just test intrinsics it makes sense to keep here.

For now they are only needed here, but for future effort like auto-tensorization, we definitely need a centralized place to put intrinsics so that they can be used from anywhere.

tests/python/unittest/test_meta_schedule_tune_relay.py

jwfromm · 2022-03-28T18:27:25Z

tests/python/unittest/test_meta_schedule_tune_relay.py

+    return imm.astype("int32x16", span)
+
+
+@T.prim_func


Are these generally useful intrinsics or just needed for the tests in this file? If it's just test intrinsics it makes sense to keep here.

tests/python/unittest/test_meta_schedule_tune_relay.py

jwfromm · 2022-03-28T18:41:06Z

tests/python/unittest/test_meta_schedule_tune_relay.py

+
+@pytest.mark.skip("Requires cascadelake")
+def test_tune_relay_manual_tir_vnni():
+    tir.TensorIntrin.register(VNNI_INTRIN, dot_product_desc, dot_product_intrin)


I think this is one of the most important lines in the whole file. I'd love to see a quick comment explaining what it does / how it works for future readers trying to figure out autotensorization.

This test case is not a demonstration of TIR tensorization per se, so the mechanic of tir.TensorIntrin.register is not too relevant here. We have dedicated test cases for TIR tensorization and more uses of tir.TensorIntrin.register in

tvm/tests/python/unittest/test_tir_schedule_tensorize.py

Lines 459 to 462 in ff3a48e

tir.TensorIntrin.register("test_mma_intrin", mma_desc, mma_intrin)

tir.TensorIntrin.register("test_annotated_mma_intrin", annotated_mma_desc, mma_intrin)

tir.TensorIntrin.register("test_dot_product_intrin", dot_product_desc, dot_product_intrin)

tir.TensorIntrin.register("test_outer_product_intrin", outer_product_desc, outer_product_intrin)

jwfromm · 2022-03-28T18:44:16Z

tests/python/unittest/test_meta_schedule_tune_relay.py

+        schedule_dense(block, None, True, sch)
+        return [sch]
+
+    register_func("meta_schedule.dense_vnni", schedule_rule_dense_vnni)


Is this overwriting the default behavior for dense on cascadelake? If so, a quick comment saying so wouldnt hurt.

It is not "overwriting" in the sense that (1) TE scheduling is not affected and (2) without this registration, meta schedule tuning fails with

File "/home/masa/projects/dev/tvm/src/meta_schedule/space_generator/post_order_apply.cc", line 149 ValueError: Check failed: (f) is false: Custom schedule rule not found: meta_schedule.dense_vnni

This is because, for all TE compute annotated with schedule_rule like

tvm/python/tvm/topi/x86/dense.py

Line 299 in ce335c3

attrs={"schedule_rule": "meta_schedule.dense_vnni"},

, we currently require the corresponding schedule rule to be registered (which this line does).

Thinking about this now, I wonder if failing is the desired behavior. Since if we don't find the custom schedule registered, we can ignore the schedule_relu annotation and apply automatic scheduling. cc @junrushao1994

Thanks for the discussion! This is definitely a good question.

From my perspective, I would prefer that we only annotate the "schedule_rule" when we are 100% sure that such rule exists, so that it doesn't just fail silently. On the other hand, it also makes sense to me that it might lead to certain engineering overhead if we enforce a check everywhere in TOPI...

@masahi would love to hear more about your opinions. How do you think the pros & cons? Thanks a lot!

I want to freely add schedule_rule annotations to various TE compute to experiment with things, so requiring that all schedule_rule annotations to have the corresponding packed func registered sounds like a heavy-weight requirement to me. If we have TOPI equivalent for TIR manual schedules, such requirement is easy to satisfy, but until then I expect that manual TIR scheduling is done in an one-off fashion like this PR.

Also, it is totally reasonable to want to auto-schedule TE compute annotated with schedule_rule. Currently I annotated TE x86 dense and batch_matmul compute with VNNI-specific schedule rules (like meta_schedule.dense_vnni above) to apply my manual TIR schedule, but that prevents any automatic scheduling from happening on these TE compute. In the future when auto-tensorization is ready, I want to freely switch between manual and automatic scheduling.

So I want "the need to annotate schedule_relu" and "whether or not I want to register my custom schedule rule" be decoupled.

"Silent failing" is certainly something we need be mindful of. When we encounter a block with schedule_relu annotation, and if the schedule rule registration is missing, how about emitting a warning to make sure that a user is aware of the fact?

@masahi Thanks for the comprehensive discussion about API design and user intention!

Looks like we have 3 different proposals:

P1. Always annotate schedule_rule even if it's not registered; Print a warning if the schedule_rule isn't registered;

P2. Always make sure schedule_rule exists when annotating, and error out if it's not;

P3. Always annotate schedule_rule even if it's not registered; No warning if the schedule_rule isn't registered;

Masa and I both agree that P3 may not be ideal, because silently ignoring abnormality may not be the best user experience.

In the future when auto-tensorization is ready, I want to freely switch between manual and automatic scheduling.

Totally agree about future possibility of switching around! Definitely it's going to be a lot of fun :-)

Therefore, I would conclude that emitting a warning is probably both @masahi and I agree on. Additionally, we might enhance the search space generation to selectively check if the schedule_rule is allowed with a target-specific allowlist, but it's probably not high-priority for now.

masahi · 2022-03-28T21:49:42Z

@jwfromm I added some comments to explain how things work. Indeed, the PR was missing that aspect completely, thanks for pointing that out.

masahi · 2022-03-29T11:11:32Z

@junrushao1994 @jwfromm good to go?

junrushao · 2022-03-29T15:40:21Z

The PR overall looks good to me! I'm going to merge it in after the interesting discussion here: #10793 (comment).

In the future quarter, it might be desirable for us if we could run this end-to-end nightly to make sure nothing is broken

junrushao · 2022-03-30T20:30:05Z

Thanks @masahi @jwfromm for the discussion, and @shingjan @vinx13 for review! I'm going to merge it in

… TIR schedules (apache#10793) This demonstrates how to selectively extract and tune tasks from a whole relay mod, and apply the tuned schedule during the final `relay.build(...)`. This flow is entirely different from existing tests in `test_meta_schedule_tune_relay.py` where ALL ops are extracted and auto-scheduled by MS. My test extracts only int8 `dense` op, applies a manual TIR schedule on it, and leaves int8 `batch_matmul` to be scheduled by TE. This also serves as an example of autotvm style manual template + tensorization. The manual TIR schedule is equivalent to TE VNNI `dense` schedule in https://github.com/apache/tvm/blob/ce335c3a74185df6cc1152e53c60695d8a418d8e/python/tvm/topi/x86/dense.py#L366-L375

See the discussion in #10793 (comment) for the context. Now I'm doing auto-tensorization on VNNI, I do need to be able to switch on / off `schedule_rule` freely.

…e#10975) See the discussion in apache#10793 (comment) for the context. Now I'm doing auto-tensorization on VNNI, I do need to be able to switch on / off `schedule_rule` freely.

masahi commented Mar 25, 2022

View reviewed changes

src/tir/schedule/ir_comparator.cc Show resolved Hide resolved

masahi commented Mar 25, 2022

View reviewed changes

shingjan reviewed Mar 26, 2022

View reviewed changes

masahi marked this pull request as ready for review March 26, 2022 00:41

junrushao reviewed Mar 26, 2022

View reviewed changes

tests/python/unittest/test_meta_schedule_tune_relay.py Outdated Show resolved Hide resolved

masahi force-pushed the tir-manual-sch-integration branch from 25a99ab to 87fad1c Compare March 27, 2022 22:34

jwfromm reviewed Mar 28, 2022

View reviewed changes

masahi force-pushed the tir-manual-sch-integration branch 2 times, most recently from b06c10d to eeb4a6d Compare March 28, 2022 21:43

masahi added 10 commits March 29, 2022 09:24

Add test

e006bb6

clean up

6728786

remove annotation check in ir comparator

34f93d3

disable tuning test for now

e3c39bb

add comment

ca0e140

more comment

fefab18

disable default post processor, tuning now works with compactness check

3ae1cb8

move intrin decl for vector type

60cd50e

skip flaky vk test

e4aa9bd

add more comment

9ddb469

masahi force-pushed the tir-manual-sch-integration branch from eeb4a6d to 9ddb469 Compare March 29, 2022 00:47

junrushao approved these changes Mar 30, 2022

View reviewed changes

junrushao merged commit 642fc57 into apache:main Mar 30, 2022

masahi mentioned this pull request Apr 12, 2022

[Metaschedule] Make custom schedule_rule registration optional #10975

Merged

driazati mentioned this pull request Jul 14, 2022

TVM v0.9.0.rc0 Release Candidate Notes #12102

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Metaschedule] Add demonstration of selectively tuning relay ops with TIR schedules #10793

[Metaschedule] Add demonstration of selectively tuning relay ops with TIR schedules #10793

masahi commented Mar 25, 2022 •

edited

Loading

masahi Mar 25, 2022 •

edited

Loading

shingjan Mar 26, 2022

shingjan Mar 26, 2022

masahi Mar 26, 2022

masahi Mar 26, 2022

shingjan Mar 26, 2022

masahi Mar 26, 2022

jwfromm Mar 28, 2022

masahi Mar 29, 2022

jwfromm Mar 28, 2022

jwfromm Mar 28, 2022

masahi Mar 28, 2022

jwfromm Mar 28, 2022

masahi Mar 28, 2022

junrushao Mar 29, 2022

masahi Mar 29, 2022 •

edited

Loading

junrushao Mar 30, 2022

masahi commented Mar 28, 2022

masahi commented Mar 29, 2022

junrushao commented Mar 29, 2022

junrushao commented Mar 30, 2022

	def schedule_dense_vnni(cfg, outs):
	"""Create a schedule for dense_vnni"""
	s = te.create_schedule([x.op for x in outs])

	def _callback(op):
	if "dense_vnni" in op.tag:
	dense_vnni_schedule(cfg, s, op.output(0), outs[0])

	traverse_inline(s, outs[0].op, _callback)
	return s

	tir.TensorIntrin.register("test_mma_intrin", mma_desc, mma_intrin)
	tir.TensorIntrin.register("test_annotated_mma_intrin", annotated_mma_desc, mma_intrin)
	tir.TensorIntrin.register("test_dot_product_intrin", dot_product_desc, dot_product_intrin)
	tir.TensorIntrin.register("test_outer_product_intrin", outer_product_desc, outer_product_intrin)

[Metaschedule] Add demonstration of selectively tuning relay ops with TIR schedules #10793

[Metaschedule] Add demonstration of selectively tuning relay ops with TIR schedules #10793

Conversation

masahi commented Mar 25, 2022 • edited Loading

masahi Mar 25, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masahi Mar 29, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masahi commented Mar 28, 2022

masahi commented Mar 29, 2022

junrushao commented Mar 29, 2022

junrushao commented Mar 30, 2022

masahi commented Mar 25, 2022 •

edited

Loading

masahi Mar 25, 2022 •

edited

Loading

masahi Mar 29, 2022 •

edited

Loading