Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metaschedule] Add demonstration of selectively tuning relay ops with TIR schedules #10793

Merged
merged 10 commits into from
Mar 30, 2022

Conversation

masahi
Copy link
Member

@masahi masahi commented Mar 25, 2022

This demonstrates how to selectively extract and tune tasks from a whole relay mod, and apply the tuned schedule during the final relay.build(...).

This flow is entirely different from existing tests in test_meta_schedule_tune_relay.py where ALL ops are extracted and auto-scheduled by MS. My test extracts only int8 dense op, applies a manual TIR schedule on it, and leaves int8 batch_matmul to be scheduled by TE.

This also serves as an example of autotvm style manual template + tensorization. The manual TIR schedule is equivalent to TE VNNI dense schedule in

def schedule_dense_vnni(cfg, outs):
"""Create a schedule for dense_vnni"""
s = te.create_schedule([x.op for x in outs])
def _callback(op):
if "dense_vnni" in op.tag:
dense_vnni_schedule(cfg, s, op.output(0), outs[0])
traverse_inline(s, outs[0].op, _callback)
return s

@junrushao1994 @vinx13 @csullivan @comaniac @jwfromm @TejashShah

register_func("meta_schedule.dense_vnni", schedule_rule_dense_vnni)

# TODO(masahi): Weird error from tuning with CheckSubtreeCompactDataflow in for_kind.cc turned on
# manual_tir_common(do_tune=True)
Copy link
Member Author

@masahi masahi Mar 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this results in the following error:

[08:11:02] /home/masa/projects/dev/tvm/src/meta_schedule/task_scheduler/task_scheduler.cc:127: Scheduler picks Task #0: "fused_nn_contrib_dense_pack_add_add"
Traceback (most recent call last):
  File "test_meta_schedule_tune_relay.py", line 559, in <module>
    test_tune_relay_manual_tir_vnni()
  File "test_meta_schedule_tune_relay.py", line 547, in test_tune_relay_manual_tir_vnni
    manual_tir_common(do_tune=True)
  File "test_meta_schedule_tune_relay.py", line 485, in manual_tir_common
    database = tune_extracted_tasks(tune_tasks, target, config, work_dir=work_dir)
  File "/home/masa/projects/dev/tvm/python/tvm/meta_schedule/tune.py", line 716, in tune_extracted_tasks
    task_scheduler.tune()
  File "/home/masa/projects/dev/tvm/python/tvm/meta_schedule/task_scheduler/task_scheduler.py", line 60, in tune
    _ffi_api.TaskSchedulerTune(self)  # type: ignore # pylint: disable=no-member
  File "/home/masa/projects/dev/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 237, in __call__
    raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
  5: TVMFuncCall
  4: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (tvm::meta_schedule::TaskScheduler)>::AssignTypedLambda<tvm::runtime::Registry::set_body_method<tvm::meta_schedule::TaskScheduler, tvm::meta_schedule::TaskSchedulerNode, void, , void>(void (tvm::meta_schedule::TaskSchedulerNode::*)())::{lambda(tvm::meta_schedule::TaskScheduler)#1}>(tvm::runtime::Registry::set_body_method<tvm::meta_schedule::TaskScheduler, tvm::meta_schedule::TaskSchedulerNode, void, , void>(void (tvm::meta_schedule::TaskSchedulerNode::*)())::{lambda(tvm::meta_schedule::TaskScheduler)#1}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tvm::runtime::TVMRetValue)
  3: tvm::meta_schedule::TaskSchedulerNode::Tune()
  2: tvm::meta_schedule::ReplayTraceNode::GenerateMeasureCandidates()
  1: tvm::meta_schedule::ReplayTraceNode::State::GenerateMeasureCandidates()
  0: tvm::support::parallel_for_dynamic(int, int, int, std::function<void (int, int)> const&) [clone .cold]
  File "/home/masa/projects/dev/tvm/src/support/parallel_for.cc", line 128
RuntimeError: parallel_for_dynamic error with ScheduleError: (not rendered)

If I remove

CheckSubtreeCompactDataflow(self, loop_sref);
it works. I'm going to debug this next week.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe @vinx13 and I got the same problem of getting this check pass. The root cause is that compact dataflow check is too strict for tensorization and therefore we need to disable it in mma case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar case here: #10420

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not about tensorization, but parallelization. Now I have more information about this error. If I turn on a detailed log, I got this from ReplayTraceNode:

 for i0_0_i1_0_fused_fused in T.serial(8192):
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
            for i0_1 in T.serial(8):
                for ax0_init in T.vectorized(16):
                    with T.block("compute_init"):
                        i = T.axis.spatial(1024, i0_0_i1_0_fused_fused // 64 * 8 + i0_1)
                        j = T.axis.spatial(1024, i0_0_i1_0_fused_fused % 64 * 16 + ax0_init)
                        T.reads()
                        T.writes(compute[i, j])
                        T.block_attr({"schedule_rule":"meta_schedule.dense_vnni", "workload":["dense_vnni.x86", ["TENSOR", [1024, 1024], "uint8"], ["TENSOR", [64, 256, 16, 4], "int8"], None, "int32"]})
                        compute[i, j] = 0
                for ax1_0 in T.serial(256):
                    # tir.Block#1
                    with T.block("compute_update_o"):
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                        i = T.axis.spatial(1024, i0_0_i1_0_fused_fused // 64 * 8 + i0_1)
                        j_o = T.axis.spatial(64, i0_0_i1_0_fused_fused % 64)
                        k_o = T.axis.reduce(256, ax1_0)
                        T.reads(compute[i, j_o * 16 : j_o * 16 + 16], placeholder[i, k_o * 4 : k_o * 4 + 4], placeholder_1[j_o, k_o, 0 : 16, 0 : 4])
                        T.writes(compute[i, j_o * 16 : j_o * 16 + 16])
                        A = T.match_buffer(placeholder[i, k_o * 4 : k_o * 4 + 4], [4], dtype="uint8", offset_factor=1)
                        B = T.match_buffer(placeholder_1[j_o, k_o, 0 : 16, 0 : 4], [16, 4], dtype="int8", offset_factor=1)
                        C = T.match_buffer(compute[i, j_o * 16 : j_o * 16 + 16], [16], dtype="int32", offset_factor=1)
                        A_u8x4: T.uint8x4 = A[T.ramp(0, 1, 4)]
                        A_i32: T.int32 = T.reinterpret(A_u8x4, dtype="int32")
                        B_i8x64: T.int8x64 = B[0, T.ramp(0, 1, 64)]
                        B_i32x16: T.int32x16 = T.reinterpret(B_i8x64, dtype="int32x16")
                        C[T.ramp(0, 1, 16)] = C[T.ramp(0, 1, 16)] + T.call_llvm_pure_intrin(9785, T.uint32(0), T.broadcast(0, 16), T.broadcast(A_i32, 16), B_i32x16, dtype="int32x16")
                for i1_1 in T.vectorized(16):
                    with T.block("T_add_1"):
                        ax0 = T.axis.spatial(1024, i0_0_i1_0_fused_fused // 64 * 8 + i0_1)
                        ax1 = T.axis.spatial(1024, i0_0_i1_0_fused_fused % 64 * 16 + i1_1)
                        T.reads(compute[ax0, ax1], placeholder_2[0, ax1])
                        T.writes(T_add[ax0, ax1])
                        T_add[ax0, ax1] = compute[ax0, ax1] + placeholder_2[0, ax1] + 1
    
Error message: The queried subtree root tir.For#0 in SRef tree does not have compact dataflow, because its child block tir.Block#1 on SRef tree is neither a local complete block nor a local reduction block.

So even though I'm doing parellel before decompose_reduction to workaround that strict check, ReplayTrace is trying to apply parallel to a schedule that is clearly already decompose_reduction-ed. This is very weird...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is solved, there is the default set of post processors for CPU that does additional parallel on top of my manual schedule

M.RewriteParallelVectorizeUnroll(),
. Since I'm using manual schedule, I don't want any post procs. Disabling them fixed this issue, thanks @vinx13 @junrushao1994

return imm.astype("int32x16", span)


@T.prim_func
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could move all the tensorization intrinsics in the same place. Right now for wmma/mma cuda it is meta_schedule.testing.tir_tensor_intrin but I believe it is not upstreamed yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah we should do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these generally useful intrinsics or just needed for the tests in this file? If it's just test intrinsics it makes sense to keep here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now they are only needed here, but for future effort like auto-tensorization, we definitely need a centralized place to put intrinsics so that they can be used from anywhere.

@masahi masahi marked this pull request as ready for review March 26, 2022 00:41
@masahi masahi force-pushed the tir-manual-sch-integration branch from 25a99ab to 87fad1c Compare March 27, 2022 22:34
return imm.astype("int32x16", span)


@T.prim_func
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these generally useful intrinsics or just needed for the tests in this file? If it's just test intrinsics it makes sense to keep here.

tests/python/unittest/test_meta_schedule_tune_relay.py Outdated Show resolved Hide resolved

@pytest.mark.skip("Requires cascadelake")
def test_tune_relay_manual_tir_vnni():
tir.TensorIntrin.register(VNNI_INTRIN, dot_product_desc, dot_product_intrin)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is one of the most important lines in the whole file. I'd love to see a quick comment explaining what it does / how it works for future readers trying to figure out autotensorization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case is not a demonstration of TIR tensorization per se, so the mechanic of tir.TensorIntrin.register is not too relevant here. We have dedicated test cases for TIR tensorization and more uses of tir.TensorIntrin.register in

tir.TensorIntrin.register("test_mma_intrin", mma_desc, mma_intrin)
tir.TensorIntrin.register("test_annotated_mma_intrin", annotated_mma_desc, mma_intrin)
tir.TensorIntrin.register("test_dot_product_intrin", dot_product_desc, dot_product_intrin)
tir.TensorIntrin.register("test_outer_product_intrin", outer_product_desc, outer_product_intrin)

schedule_dense(block, None, True, sch)
return [sch]

register_func("meta_schedule.dense_vnni", schedule_rule_dense_vnni)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this overwriting the default behavior for dense on cascadelake? If so, a quick comment saying so wouldnt hurt.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not "overwriting" in the sense that (1) TE scheduling is not affected and (2) without this registration, meta schedule tuning fails with

  File "/home/masa/projects/dev/tvm/src/meta_schedule/space_generator/post_order_apply.cc", line 149
ValueError: Check failed: (f) is false: Custom schedule rule not found: meta_schedule.dense_vnni

This is because, for all TE compute annotated with schedule_rule like

attrs={"schedule_rule": "meta_schedule.dense_vnni"},
, we currently require the corresponding schedule rule to be registered (which this line does).

Thinking about this now, I wonder if failing is the desired behavior. Since if we don't find the custom schedule registered, we can ignore the schedule_relu annotation and apply automatic scheduling. cc @junrushao1994

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the discussion! This is definitely a good question.

From my perspective, I would prefer that we only annotate the "schedule_rule" when we are 100% sure that such rule exists, so that it doesn't just fail silently. On the other hand, it also makes sense to me that it might lead to certain engineering overhead if we enforce a check everywhere in TOPI...

@masahi would love to hear more about your opinions. How do you think the pros & cons? Thanks a lot!

Copy link
Member Author

@masahi masahi Mar 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to freely add schedule_rule annotations to various TE compute to experiment with things, so requiring that all schedule_rule annotations to have the corresponding packed func registered sounds like a heavy-weight requirement to me. If we have TOPI equivalent for TIR manual schedules, such requirement is easy to satisfy, but until then I expect that manual TIR scheduling is done in an one-off fashion like this PR.

Also, it is totally reasonable to want to auto-schedule TE compute annotated with schedule_rule. Currently I annotated TE x86 dense and batch_matmul compute with VNNI-specific schedule rules (like meta_schedule.dense_vnni above) to apply my manual TIR schedule, but that prevents any automatic scheduling from happening on these TE compute. In the future when auto-tensorization is ready, I want to freely switch between manual and automatic scheduling.

So I want "the need to annotate schedule_relu" and "whether or not I want to register my custom schedule rule" be decoupled.

"Silent failing" is certainly something we need be mindful of. When we encounter a block with schedule_relu annotation, and if the schedule rule registration is missing, how about emitting a warning to make sure that a user is aware of the fact?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masahi Thanks for the comprehensive discussion about API design and user intention!

Looks like we have 3 different proposals:

  • P1. Always annotate schedule_rule even if it's not registered; Print a warning if the schedule_rule isn't registered;
  • P2. Always make sure schedule_rule exists when annotating, and error out if it's not;
  • P3. Always annotate schedule_rule even if it's not registered; No warning if the schedule_rule isn't registered;

Masa and I both agree that P3 may not be ideal, because silently ignoring abnormality may not be the best user experience.

In the future when auto-tensorization is ready, I want to freely switch between manual and automatic scheduling.

Totally agree about future possibility of switching around! Definitely it's going to be a lot of fun :-)

Therefore, I would conclude that emitting a warning is probably both @masahi and I agree on. Additionally, we might enhance the search space generation to selectively check if the schedule_rule is allowed with a target-specific allowlist, but it's probably not high-priority for now.

@masahi masahi force-pushed the tir-manual-sch-integration branch 2 times, most recently from b06c10d to eeb4a6d Compare March 28, 2022 21:43
@masahi
Copy link
Member Author

masahi commented Mar 28, 2022

@jwfromm I added some comments to explain how things work. Indeed, the PR was missing that aspect completely, thanks for pointing that out.

@masahi masahi force-pushed the tir-manual-sch-integration branch from eeb4a6d to 9ddb469 Compare March 29, 2022 00:47
@masahi
Copy link
Member Author

masahi commented Mar 29, 2022

@junrushao1994 @jwfromm good to go?

@junrushao
Copy link
Member

The PR overall looks good to me! I'm going to merge it in after the interesting discussion here: #10793 (comment).

In the future quarter, it might be desirable for us if we could run this end-to-end nightly to make sure nothing is broken

@junrushao
Copy link
Member

Thanks @masahi @jwfromm for the discussion, and @shingjan @vinx13 for review! I'm going to merge it in

@junrushao junrushao merged commit 642fc57 into apache:main Mar 30, 2022
junrushao pushed a commit to junrushao/tvm that referenced this pull request Mar 31, 2022
… TIR schedules (apache#10793)

This demonstrates how to selectively extract and tune tasks from a whole relay mod, and apply the tuned schedule during the final `relay.build(...)`. 

This flow is entirely different from existing tests in `test_meta_schedule_tune_relay.py` where ALL ops are extracted and auto-scheduled by MS. My test extracts only int8 `dense` op, applies a manual TIR schedule on it, and leaves int8 `batch_matmul` to be scheduled by TE. 

This also serves as an example of autotvm style manual template + tensorization. The manual TIR schedule is equivalent to TE VNNI `dense` schedule in https://github.com/apache/tvm/blob/ce335c3a74185df6cc1152e53c60695d8a418d8e/python/tvm/topi/x86/dense.py#L366-L375
pfk-beta pushed a commit to pfk-beta/tvm that referenced this pull request Apr 11, 2022
… TIR schedules (apache#10793)

This demonstrates how to selectively extract and tune tasks from a whole relay mod, and apply the tuned schedule during the final `relay.build(...)`. 

This flow is entirely different from existing tests in `test_meta_schedule_tune_relay.py` where ALL ops are extracted and auto-scheduled by MS. My test extracts only int8 `dense` op, applies a manual TIR schedule on it, and leaves int8 `batch_matmul` to be scheduled by TE. 

This also serves as an example of autotvm style manual template + tensorization. The manual TIR schedule is equivalent to TE VNNI `dense` schedule in https://github.com/apache/tvm/blob/ce335c3a74185df6cc1152e53c60695d8a418d8e/python/tvm/topi/x86/dense.py#L366-L375
mehrdadh pushed a commit to mehrdadh/tvm that referenced this pull request Apr 11, 2022
… TIR schedules (apache#10793)

This demonstrates how to selectively extract and tune tasks from a whole relay mod, and apply the tuned schedule during the final `relay.build(...)`. 

This flow is entirely different from existing tests in `test_meta_schedule_tune_relay.py` where ALL ops are extracted and auto-scheduled by MS. My test extracts only int8 `dense` op, applies a manual TIR schedule on it, and leaves int8 `batch_matmul` to be scheduled by TE. 

This also serves as an example of autotvm style manual template + tensorization. The manual TIR schedule is equivalent to TE VNNI `dense` schedule in https://github.com/apache/tvm/blob/ce335c3a74185df6cc1152e53c60695d8a418d8e/python/tvm/topi/x86/dense.py#L366-L375
junrushao pushed a commit that referenced this pull request Apr 13, 2022
See the discussion in #10793 (comment) for the context.

Now I'm doing auto-tensorization on VNNI, I do need to be able to switch on / off `schedule_rule` freely.
Lucien0 pushed a commit to Lucien0/tvm that referenced this pull request Apr 19, 2022
…e#10975)

See the discussion in apache#10793 (comment) for the context.

Now I'm doing auto-tensorization on VNNI, I do need to be able to switch on / off `schedule_rule` freely.
altanh pushed a commit to altanh/tvm that referenced this pull request Apr 28, 2022
…e#10975)

See the discussion in apache#10793 (comment) for the context.

Now I'm doing auto-tensorization on VNNI, I do need to be able to switch on / off `schedule_rule` freely.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants