[RFC][AUTOTVM] Auto-Schedule from Compute Declaration #2954

merrymercy · 2019-04-02T18:10:31Z

Update(Dec. 25, 2020): This RFC is deprecated. We started another project "Ansor" to bring auto-scheduler for TVM. Ansor is integrated as tvm.auto_scheduler package in the current code base. You can see the new RFC and tutorials.

Auto-Scheduler

TVM decouples kernel implementation into compute and schedule. The compute part is a friendly DSL that can describe algorithms intuitively. However, the schedule part still requires strong expert knowledge and time-consuming tuning to provide decent performance. The tuning process is partially automated by the existing autotvm package, but a human-engineered template is still required.

This RFC proposes a "real" autotvm, which we can call auto scheduler. It aims at removing all human efforts on the schedule part.

Proposed Design

The auto-scheduler is built on the existing autotvm package. It will generate a template from compute declaration. Then this template can either be

Statically filled by heuristic rules and cost functions to provide reasonable performance, or
Dynamically tuned by autotvm to provide better performance with some time budget

The auto-scheduler takes a computation graph described by tvm DSL as input, then classify the type of read/write patterns and the type of computation. It dispatches the nodes in the DAG to different "meta templates". The "meta templates" generates autotvm templates from the compute declaration. There are four types of meta templates : simple reduction, complex reduction, direct compute, and location-tunable compute. The auto-scheduler will do parallelization, vectorization, tiling, and operator fusion.

The code is available on my branch. The current implementation is in pure python bacuse autotvm is mainly written in python. But move the whole autotvm package to c++ is within long-term plan. The code is organized as follows.

Analysis on access pattern python/tvm/autotvm/auto_schedule/stage_analysis.py
CPU backend python/tvm/autotvm/auto_schedule/backend/cpu.py
GPU backend python/tvm/autotvm/auto_schedule/backend/gpu.py
Configuration for the auto-scheduler python/tvm/autotvm/auto_schedule/common.py
Experimental auto-packing for optimizing vectorization and locality python/tvm/autotvm/auto_schedule/auto_pack.py
Test cases tests/python/unittest/test_auto_scheduler.py

API

There are only two user-oriented API calls

autotvm.AutoSchedulerOptions(**kwargs)
This is used to configure the auto scheduler. The arguments include hardware configurations(vector lanes, number of threads, size of shared memory, etc) and tuning configurations (how many tuning knobs to generate).
autotvm.create_schedule(tensors)
This is similar to tvm.create_schedule, but returns an already optimized schedule.

A = tvm.placeholder((128,), name='A')
B = tvm.placeholder((128,), name='B')
C = tvm.compute((128,),  lambda i: A[i] + B[i] * 2)

with tvm.target.create('llvm'):
    with autotvm.AutoSchedulerOptions(vec_size=8, num_threads=16):
        s, bufs = autotvm.create_schedule([A, B, C])

# NO SCHEDULE REQUIRED

func = tvm.build(s, bufs)

Examples

Tutorial
This is a tutorial on how to statically use the auto-scheduler or auto-tune it.
Schedule a whole network
This example is adopted from [TVM] Automatic differentiation for tensor expressions #2498. It is a LeNet like convolution neural network written purely by tvm (without graph IR). The auto-scheduler also provides basic operator fusion for it. Right now we can only run forward pass. I am working on fixing the backward pass.

Performance

One reachable performance goal is to replace more than 90% schedule code in existing TOPI by this auto-scheduler. I haven't done the experiments, but I believe the generated templates can cover the existing search space for most operators (includes conv2d, reduction, ...).

Another part of the goal is to provide reasonable static performance. In the "Schedule a whole network" example, for batched forward pass, the current performance is 1.2x slower than out-of-the-box TF + Keras, and 10x faster than naive schedule (fuse and parallel outer loops) on an Intel i7-8750H. For static usage, the input of the auto-scheduler are parameters for heuristic rules and hardware configurations. We will gather all inputs into a global config, so users can still do some quick "tuning".

Todo List

Performance test and improvement to cover more than 90% schedule code in TOPI
Improve the heuristic rules to provide better static performance, do tests to make sure we cover the search space of existing templates.
Improve tuning speed
The current implementation does analysis and generates the template on the fly, which is expensive and redundant during batched tuning. We should decouple the template generation and template tuning, and explicitly cache the template.
(long-term) Move all autotvm related code to c++
Improve loop partition to better handle partial tile, vectorization.

The text was updated successfully, but these errors were encountered:

kevinthesun · 2019-04-02T21:51:13Z

Thank you for opening this RFC! I have a question regarding user API. Does the hardware information needed for autotvm.AutoSchedulerOptions(**kwargs) function pre-defined for different hardware architectures? If so, how much more information does a user need to provide to differentiate between different minor types of the same device target, such as Intel Xeon Platinum vs Xeon Haswell, or Nvida K80 vs V100? Today we have a single template for minor device types. Will auto-scheduler provide different templates?

jroesch · 2019-04-03T02:11:30Z

@merrymercy how much work is there per backend? looking over the code now will follow up with more questions later.

yzhliu · 2019-04-03T05:27:06Z

@merrymercy Could you elaborate a bit about the 4 types (simple reduction, complex reduction, direct compute, and location-tunable compute) ? Also it would be helpful if you can give an example of how the DAG looks like.

tmoreau89 · 2019-04-03T07:06:31Z

Thanks @merrymercy, this is really awesome work. I second Jared's comment on work involved in adding a backend. I'd be happy to chat some more about how one would add automated compilation to different hardware accelerators including VTA.

merrymercy · 2019-04-03T07:53:00Z

@kevinthesun The hardware parameters for the auto-scheduler are very coarse-grained. These parameters are most used in static scheduling. So it even won't distinguish between ARM CPU and Intel CPU. If you want to fit a specific target device, we still need to do auto-tuning on real devices.

@jroesch Currently, it is about 500 loc per backend. I am working on improvements so it may increase.

@yzhliu

simple reduction: reduction ops that do not have reuse opportunity (e.g. softmax, argmin)
complex reduction: reduction ops that have reuse opportunity (e.g. matmul, conv2d)
direct compute: broadcast, elemwise, stencil computation, (e.g. relu, add)
location-tunable compute: the same as above. The difference is that direct compute computes at root, while location-tunable compute can computes at other nodes to increase locality.

@tmoreau89 This is doable. The problem of accelerators is that if we want the auto-scheduler to take in a hardware-independent description, then we need a special pack pass to transform the layout.

jroesch · 2019-04-03T08:32:31Z

@merrymercy I'm less interested in LOC and more how much conceptual burden there is. My question is more: What are the key pieces that make up a backend description?

I looked over the code but was at SysML and have two deadlines this week so I haven't had a chance to really look it over. Look forward to landing this stuff.

One idea I've been thinking about is a combined TVM + Relay language where we can auto-extract chunks that can be lowered to the compute language, auto-schedule, then auto-tune for end-to-end perf.

kevinthesun · 2019-04-03T22:20:05Z

@merrymercy Auto-scheduler will create another search space consists of schedule templates. For a given set of hardware parameters, it will try various schedule templates and for each template do some auto-tuning on real device. This means for each minor device type, we need to do all these steps. Do I understand it correctly?

yzhliu · 2019-04-04T03:29:59Z

@merrymercy Do you think this analysis design can be easily extended to be working based on TVM Tensor AST (HalideIR) instead of ScheduleStage? Not urgent but I think eventually we will make schedule primitives work on HalideIR, so that we can unify the underlying data structure of schedule and other passes.

tqchen · 2019-04-04T04:21:33Z

Good discussions, I think in general we can move to summarize the common patterns and make things work for specific hardware backend. As for point bought by @yzhliu (unifying schedule with pass), eventually ScheduleStage itself(or other IR structure) can be viewed as a dialect of the IR, and we can do so after we push for such unification.

merrymercy · 2019-04-04T10:30:07Z

@jroesch There is no easy description for a backend. Currently these meta-templates are mainly based on the summary of existing human schedule code in TOPI. So adding a new backend is still hard. What can be reused is the classification of compute type.

@kevinthesun There is only one template for one specific op. The auto-scheduler first creates this template. Then, for static usage, it will fill the knobs in the template according to hardware paremeters. The API example shown above falls in this category. For tuning usage, the auto-scheduler won't use hardware parameters. Instead, it relies real tuning. In this case, you need to explicitly create autotvm.Task, autotvm.Tuner as what we do currently. An example is shown in the tutorial.

@yzhliu The tvm.compute dsl is much easier to analyze than general Halide IR, because of its clean dependency relations and simple loop structures.

eqy · 2019-04-05T22:07:19Z

Minor question: do we consider "injective" as a special case of "simple reduction?"

eqy · 2019-04-05T22:26:27Z

@merrymercy Do you think that this is a good time to also make schedules serializable/package them with autotvm style configs? In the past we have had issues where we did not want to merge in changes to schedules because they would break compatibility with tophub, and now it seems that the variety of schedules may also change quickly as auto-schedule is changed. Instead of forcing schedules to be schedule, we can maybe side-step this by packaging schedules together with autotvm configs.

merrymercy · 2019-04-09T08:22:05Z

@eqy "injective" is considered "direct compute". Typically they will be inlined.

Serializable Template + Serializable Config seems to be a good direction to go.

yangjunpro · 2019-11-09T19:59:57Z

@merrymercy
Hi Lianmin,

Thanks for the nice proposal. May I know the latest progress of the auto-scheduling work?
It looks that for a long time there isn't any status update.

Regards
Jun

hello-hzb · 2019-11-11T06:09:04Z

@merrymercy
Hi Zheng，
I have paied attension to your auto-scheduler work for a few days. No update for a few month. How is it going these days?
Why don't you merge the autoshceduler to the master branch of TVM?

merrymercy · 2019-11-13T19:32:28Z

Hi @yangjunpro @hello-hzb ,
This project has been suspended for several months. I won't continue my work on the original branch.
However, the push for an auto-scheduler is still interesting to a lot of people. I might work on auto-scheduler again with some Berkeley students. We'd like to try different approaches, so we won't start from my old branch.

yzhliu · 2019-11-13T19:49:06Z

@merrymercy would you mind summarize a bit what's the drawback of the original implement, so we can learn from it.

yangjunpro · 2019-11-23T02:57:18Z

Hi @yangjunpro @hello-hzb ,
This project has been suspended for several months. I won't continue my work on the original branch.
However, the push for an auto-scheduler is still interesting to a lot of people. I might work on auto-scheduler again with some Berkeley students. We'd like to try different approaches, so we won't start from my old branch.

Sure, I think Zhao has already contacted with you and also involve two of my colleagues Minmin and Chenfan. Look forward to further collaborations.

tqchen · 2020-08-20T02:09:02Z

close as per ansor update

tqchen · 2020-08-20T02:09:16Z

https://discuss.tvm.ai/t/rfc-ansor-an-auto-scheduler-for-tvm-autotvm-v2-0/7005

merrymercy changed the title ~~[RFC][AUTOTVM] Auto-Scheduler from Compute Decleration~~ [RFC][AUTOTVM] Auto-Schedule from Compute Decleration Apr 2, 2019

icemelon added the status: RFC label Apr 2, 2019

merrymercy changed the title ~~[RFC][AUTOTVM] Auto-Schedule from Compute Decleration~~ [RFC][AUTOTVM] Auto-Schedule from Compute Declaration Apr 2, 2019

tqchen closed this as completed Aug 20, 2020

Wheest mentioned this issue Dec 26, 2020

Better grouped convolution for CPU targets #6137

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC][AUTOTVM] Auto-Schedule from Compute Declaration #2954

[RFC][AUTOTVM] Auto-Schedule from Compute Declaration #2954

merrymercy commented Apr 2, 2019 •

edited

Loading

kevinthesun commented Apr 2, 2019

jroesch commented Apr 3, 2019

yzhliu commented Apr 3, 2019

tmoreau89 commented Apr 3, 2019

merrymercy commented Apr 3, 2019 •

edited

Loading

jroesch commented Apr 3, 2019 •

edited

Loading

kevinthesun commented Apr 3, 2019

yzhliu commented Apr 4, 2019

tqchen commented Apr 4, 2019

merrymercy commented Apr 4, 2019 •

edited

Loading

eqy commented Apr 5, 2019

eqy commented Apr 5, 2019

merrymercy commented Apr 9, 2019

yangjunpro commented Nov 9, 2019

hello-hzb commented Nov 11, 2019 •

edited

Loading

merrymercy commented Nov 13, 2019 •

edited

Loading

yzhliu commented Nov 13, 2019

yangjunpro commented Nov 23, 2019

tqchen commented Aug 20, 2020

tqchen commented Aug 20, 2020

[RFC][AUTOTVM] Auto-Schedule from Compute Declaration #2954

[RFC][AUTOTVM] Auto-Schedule from Compute Declaration #2954

Comments

merrymercy commented Apr 2, 2019 • edited Loading

Auto-Scheduler

Proposed Design

API

Examples

Performance

Todo List

kevinthesun commented Apr 2, 2019

jroesch commented Apr 3, 2019

yzhliu commented Apr 3, 2019

tmoreau89 commented Apr 3, 2019

merrymercy commented Apr 3, 2019 • edited Loading

jroesch commented Apr 3, 2019 • edited Loading

kevinthesun commented Apr 3, 2019

yzhliu commented Apr 4, 2019

tqchen commented Apr 4, 2019

merrymercy commented Apr 4, 2019 • edited Loading

eqy commented Apr 5, 2019

eqy commented Apr 5, 2019

merrymercy commented Apr 9, 2019

yangjunpro commented Nov 9, 2019

hello-hzb commented Nov 11, 2019 • edited Loading

merrymercy commented Nov 13, 2019 • edited Loading

yzhliu commented Nov 13, 2019

yangjunpro commented Nov 23, 2019

tqchen commented Aug 20, 2020

tqchen commented Aug 20, 2020

merrymercy commented Apr 2, 2019 •

edited

Loading

merrymercy commented Apr 3, 2019 •

edited

Loading

jroesch commented Apr 3, 2019 •

edited

Loading

merrymercy commented Apr 4, 2019 •

edited

Loading

hello-hzb commented Nov 11, 2019 •

edited

Loading

merrymercy commented Nov 13, 2019 •

edited

Loading