-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Metaschedule] Add demonstration of selectively tuning relay ops with TIR schedules #10793
Conversation
register_func("meta_schedule.dense_vnni", schedule_rule_dense_vnni) | ||
|
||
# TODO(masahi): Weird error from tuning with CheckSubtreeCompactDataflow in for_kind.cc turned on | ||
# manual_tir_common(do_tune=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently this results in the following error:
[08:11:02] /home/masa/projects/dev/tvm/src/meta_schedule/task_scheduler/task_scheduler.cc:127: Scheduler picks Task #0: "fused_nn_contrib_dense_pack_add_add"
Traceback (most recent call last):
File "test_meta_schedule_tune_relay.py", line 559, in <module>
test_tune_relay_manual_tir_vnni()
File "test_meta_schedule_tune_relay.py", line 547, in test_tune_relay_manual_tir_vnni
manual_tir_common(do_tune=True)
File "test_meta_schedule_tune_relay.py", line 485, in manual_tir_common
database = tune_extracted_tasks(tune_tasks, target, config, work_dir=work_dir)
File "/home/masa/projects/dev/tvm/python/tvm/meta_schedule/tune.py", line 716, in tune_extracted_tasks
task_scheduler.tune()
File "/home/masa/projects/dev/tvm/python/tvm/meta_schedule/task_scheduler/task_scheduler.py", line 60, in tune
_ffi_api.TaskSchedulerTune(self) # type: ignore # pylint: disable=no-member
File "/home/masa/projects/dev/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 237, in __call__
raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
5: TVMFuncCall
4: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (tvm::meta_schedule::TaskScheduler)>::AssignTypedLambda<tvm::runtime::Registry::set_body_method<tvm::meta_schedule::TaskScheduler, tvm::meta_schedule::TaskSchedulerNode, void, , void>(void (tvm::meta_schedule::TaskSchedulerNode::*)())::{lambda(tvm::meta_schedule::TaskScheduler)#1}>(tvm::runtime::Registry::set_body_method<tvm::meta_schedule::TaskScheduler, tvm::meta_schedule::TaskSchedulerNode, void, , void>(void (tvm::meta_schedule::TaskSchedulerNode::*)())::{lambda(tvm::meta_schedule::TaskScheduler)#1}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tvm::runtime::TVMRetValue)
3: tvm::meta_schedule::TaskSchedulerNode::Tune()
2: tvm::meta_schedule::ReplayTraceNode::GenerateMeasureCandidates()
1: tvm::meta_schedule::ReplayTraceNode::State::GenerateMeasureCandidates()
0: tvm::support::parallel_for_dynamic(int, int, int, std::function<void (int, int)> const&) [clone .cold]
File "/home/masa/projects/dev/tvm/src/support/parallel_for.cc", line 128
RuntimeError: parallel_for_dynamic error with ScheduleError: (not rendered)
If I remove
tvm/src/tir/schedule/primitive/for_kind.cc
Line 160 in 0ddaaa6
CheckSubtreeCompactDataflow(self, loop_sref); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe @vinx13 and I got the same problem of getting this check pass. The root cause is that compact dataflow check is too strict for tensorization and therefore we need to disable it in mma case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar case here: #10420
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not about tensorization, but parallelization. Now I have more information about this error. If I turn on a detailed log, I got this from ReplayTraceNode
:
for i0_0_i1_0_fused_fused in T.serial(8192):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
for i0_1 in T.serial(8):
for ax0_init in T.vectorized(16):
with T.block("compute_init"):
i = T.axis.spatial(1024, i0_0_i1_0_fused_fused // 64 * 8 + i0_1)
j = T.axis.spatial(1024, i0_0_i1_0_fused_fused % 64 * 16 + ax0_init)
T.reads()
T.writes(compute[i, j])
T.block_attr({"schedule_rule":"meta_schedule.dense_vnni", "workload":["dense_vnni.x86", ["TENSOR", [1024, 1024], "uint8"], ["TENSOR", [64, 256, 16, 4], "int8"], None, "int32"]})
compute[i, j] = 0
for ax1_0 in T.serial(256):
# tir.Block#1
with T.block("compute_update_o"):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
i = T.axis.spatial(1024, i0_0_i1_0_fused_fused // 64 * 8 + i0_1)
j_o = T.axis.spatial(64, i0_0_i1_0_fused_fused % 64)
k_o = T.axis.reduce(256, ax1_0)
T.reads(compute[i, j_o * 16 : j_o * 16 + 16], placeholder[i, k_o * 4 : k_o * 4 + 4], placeholder_1[j_o, k_o, 0 : 16, 0 : 4])
T.writes(compute[i, j_o * 16 : j_o * 16 + 16])
A = T.match_buffer(placeholder[i, k_o * 4 : k_o * 4 + 4], [4], dtype="uint8", offset_factor=1)
B = T.match_buffer(placeholder_1[j_o, k_o, 0 : 16, 0 : 4], [16, 4], dtype="int8", offset_factor=1)
C = T.match_buffer(compute[i, j_o * 16 : j_o * 16 + 16], [16], dtype="int32", offset_factor=1)
A_u8x4: T.uint8x4 = A[T.ramp(0, 1, 4)]
A_i32: T.int32 = T.reinterpret(A_u8x4, dtype="int32")
B_i8x64: T.int8x64 = B[0, T.ramp(0, 1, 64)]
B_i32x16: T.int32x16 = T.reinterpret(B_i8x64, dtype="int32x16")
C[T.ramp(0, 1, 16)] = C[T.ramp(0, 1, 16)] + T.call_llvm_pure_intrin(9785, T.uint32(0), T.broadcast(0, 16), T.broadcast(A_i32, 16), B_i32x16, dtype="int32x16")
for i1_1 in T.vectorized(16):
with T.block("T_add_1"):
ax0 = T.axis.spatial(1024, i0_0_i1_0_fused_fused // 64 * 8 + i0_1)
ax1 = T.axis.spatial(1024, i0_0_i1_0_fused_fused % 64 * 16 + i1_1)
T.reads(compute[ax0, ax1], placeholder_2[0, ax1])
T.writes(T_add[ax0, ax1])
T_add[ax0, ax1] = compute[ax0, ax1] + placeholder_2[0, ax1] + 1
Error message: The queried subtree root tir.For#0 in SRef tree does not have compact dataflow, because its child block tir.Block#1 on SRef tree is neither a local complete block nor a local reduction block.
So even though I'm doing parellel
before decompose_reduction
to workaround that strict check, ReplayTrace
is trying to apply parallel
to a schedule that is clearly already decompose_reduction
-ed. This is very weird...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is solved, there is the default set of post processors for CPU that does additional parallel
on top of my manual schedule
tvm/python/tvm/meta_schedule/tune.py
Line 121 in 7ff5c83
M.RewriteParallelVectorizeUnroll(), |
return imm.astype("int32x16", span) | ||
|
||
|
||
@T.prim_func |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we could move all the tensorization intrinsics in the same place. Right now for wmma/mma cuda it is meta_schedule.testing.tir_tensor_intrin
but I believe it is not upstreamed yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah we should do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these generally useful intrinsics or just needed for the tests in this file? If it's just test intrinsics it makes sense to keep here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now they are only needed here, but for future effort like auto-tensorization, we definitely need a centralized place to put intrinsics so that they can be used from anywhere.
25a99ab
to
87fad1c
Compare
return imm.astype("int32x16", span) | ||
|
||
|
||
@T.prim_func |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these generally useful intrinsics or just needed for the tests in this file? If it's just test intrinsics it makes sense to keep here.
|
||
@pytest.mark.skip("Requires cascadelake") | ||
def test_tune_relay_manual_tir_vnni(): | ||
tir.TensorIntrin.register(VNNI_INTRIN, dot_product_desc, dot_product_intrin) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is one of the most important lines in the whole file. I'd love to see a quick comment explaining what it does / how it works for future readers trying to figure out autotensorization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test case is not a demonstration of TIR tensorization per se, so the mechanic of tir.TensorIntrin.register
is not too relevant here. We have dedicated test cases for TIR tensorization and more uses of tir.TensorIntrin.register
in
tvm/tests/python/unittest/test_tir_schedule_tensorize.py
Lines 459 to 462 in ff3a48e
tir.TensorIntrin.register("test_mma_intrin", mma_desc, mma_intrin) | |
tir.TensorIntrin.register("test_annotated_mma_intrin", annotated_mma_desc, mma_intrin) | |
tir.TensorIntrin.register("test_dot_product_intrin", dot_product_desc, dot_product_intrin) | |
tir.TensorIntrin.register("test_outer_product_intrin", outer_product_desc, outer_product_intrin) |
schedule_dense(block, None, True, sch) | ||
return [sch] | ||
|
||
register_func("meta_schedule.dense_vnni", schedule_rule_dense_vnni) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this overwriting the default behavior for dense on cascadelake? If so, a quick comment saying so wouldnt hurt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not "overwriting" in the sense that (1) TE scheduling is not affected and (2) without this registration, meta schedule tuning fails with
File "/home/masa/projects/dev/tvm/src/meta_schedule/space_generator/post_order_apply.cc", line 149
ValueError: Check failed: (f) is false: Custom schedule rule not found: meta_schedule.dense_vnni
This is because, for all TE compute annotated with schedule_rule
like
tvm/python/tvm/topi/x86/dense.py
Line 299 in ce335c3
attrs={"schedule_rule": "meta_schedule.dense_vnni"}, |
Thinking about this now, I wonder if failing is the desired behavior. Since if we don't find the custom schedule registered, we can ignore the schedule_relu
annotation and apply automatic scheduling. cc @junrushao1994
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the discussion! This is definitely a good question.
From my perspective, I would prefer that we only annotate the "schedule_rule" when we are 100% sure that such rule exists, so that it doesn't just fail silently. On the other hand, it also makes sense to me that it might lead to certain engineering overhead if we enforce a check everywhere in TOPI...
@masahi would love to hear more about your opinions. How do you think the pros & cons? Thanks a lot!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to freely add schedule_rule
annotations to various TE compute to experiment with things, so requiring that all schedule_rule
annotations to have the corresponding packed func registered sounds like a heavy-weight requirement to me. If we have TOPI equivalent for TIR manual schedules, such requirement is easy to satisfy, but until then I expect that manual TIR scheduling is done in an one-off fashion like this PR.
Also, it is totally reasonable to want to auto-schedule TE compute annotated with schedule_rule
. Currently I annotated TE x86 dense
and batch_matmul
compute with VNNI-specific schedule rules (like meta_schedule.dense_vnni
above) to apply my manual TIR schedule, but that prevents any automatic scheduling from happening on these TE compute. In the future when auto-tensorization is ready, I want to freely switch between manual and automatic scheduling.
So I want "the need to annotate schedule_relu
" and "whether or not I want to register my custom schedule rule" be decoupled.
"Silent failing" is certainly something we need be mindful of. When we encounter a block with schedule_relu
annotation, and if the schedule rule registration is missing, how about emitting a warning to make sure that a user is aware of the fact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@masahi Thanks for the comprehensive discussion about API design and user intention!
Looks like we have 3 different proposals:
- P1. Always annotate schedule_rule even if it's not registered; Print a warning if the schedule_rule isn't registered;
- P2. Always make sure schedule_rule exists when annotating, and error out if it's not;
- P3. Always annotate schedule_rule even if it's not registered; No warning if the schedule_rule isn't registered;
Masa and I both agree that P3 may not be ideal, because silently ignoring abnormality may not be the best user experience.
In the future when auto-tensorization is ready, I want to freely switch between manual and automatic scheduling.
Totally agree about future possibility of switching around! Definitely it's going to be a lot of fun :-)
Therefore, I would conclude that emitting a warning is probably both @masahi and I agree on. Additionally, we might enhance the search space generation to selectively check if the schedule_rule is allowed with a target-specific allowlist, but it's probably not high-priority for now.
b06c10d
to
eeb4a6d
Compare
@jwfromm I added some comments to explain how things work. Indeed, the PR was missing that aspect completely, thanks for pointing that out. |
eeb4a6d
to
9ddb469
Compare
@junrushao1994 @jwfromm good to go? |
The PR overall looks good to me! I'm going to merge it in after the interesting discussion here: #10793 (comment). In the future quarter, it might be desirable for us if we could run this end-to-end nightly to make sure nothing is broken |
… TIR schedules (apache#10793) This demonstrates how to selectively extract and tune tasks from a whole relay mod, and apply the tuned schedule during the final `relay.build(...)`. This flow is entirely different from existing tests in `test_meta_schedule_tune_relay.py` where ALL ops are extracted and auto-scheduled by MS. My test extracts only int8 `dense` op, applies a manual TIR schedule on it, and leaves int8 `batch_matmul` to be scheduled by TE. This also serves as an example of autotvm style manual template + tensorization. The manual TIR schedule is equivalent to TE VNNI `dense` schedule in https://github.com/apache/tvm/blob/ce335c3a74185df6cc1152e53c60695d8a418d8e/python/tvm/topi/x86/dense.py#L366-L375
… TIR schedules (apache#10793) This demonstrates how to selectively extract and tune tasks from a whole relay mod, and apply the tuned schedule during the final `relay.build(...)`. This flow is entirely different from existing tests in `test_meta_schedule_tune_relay.py` where ALL ops are extracted and auto-scheduled by MS. My test extracts only int8 `dense` op, applies a manual TIR schedule on it, and leaves int8 `batch_matmul` to be scheduled by TE. This also serves as an example of autotvm style manual template + tensorization. The manual TIR schedule is equivalent to TE VNNI `dense` schedule in https://github.com/apache/tvm/blob/ce335c3a74185df6cc1152e53c60695d8a418d8e/python/tvm/topi/x86/dense.py#L366-L375
… TIR schedules (apache#10793) This demonstrates how to selectively extract and tune tasks from a whole relay mod, and apply the tuned schedule during the final `relay.build(...)`. This flow is entirely different from existing tests in `test_meta_schedule_tune_relay.py` where ALL ops are extracted and auto-scheduled by MS. My test extracts only int8 `dense` op, applies a manual TIR schedule on it, and leaves int8 `batch_matmul` to be scheduled by TE. This also serves as an example of autotvm style manual template + tensorization. The manual TIR schedule is equivalent to TE VNNI `dense` schedule in https://github.com/apache/tvm/blob/ce335c3a74185df6cc1152e53c60695d8a418d8e/python/tvm/topi/x86/dense.py#L366-L375
See the discussion in #10793 (comment) for the context. Now I'm doing auto-tensorization on VNNI, I do need to be able to switch on / off `schedule_rule` freely.
…e#10975) See the discussion in apache#10793 (comment) for the context. Now I'm doing auto-tensorization on VNNI, I do need to be able to switch on / off `schedule_rule` freely.
…e#10975) See the discussion in apache#10793 (comment) for the context. Now I'm doing auto-tensorization on VNNI, I do need to be able to switch on / off `schedule_rule` freely.
This demonstrates how to selectively extract and tune tasks from a whole relay mod, and apply the tuned schedule during the final
relay.build(...)
.This flow is entirely different from existing tests in
test_meta_schedule_tune_relay.py
where ALL ops are extracted and auto-scheduled by MS. My test extracts only int8dense
op, applies a manual TIR schedule on it, and leaves int8batch_matmul
to be scheduled by TE.This also serves as an example of autotvm style manual template + tensorization. The manual TIR schedule is equivalent to TE VNNI
dense
schedule intvm/python/tvm/topi/x86/dense.py
Lines 366 to 375 in ce335c3
@junrushao1994 @vinx13 @csullivan @comaniac @jwfromm @TejashShah