Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC][AUTOTVM] Auto-Schedule from Compute Declaration #2954

Closed
4 tasks
merrymercy opened this issue Apr 2, 2019 · 20 comments
Closed
4 tasks

[RFC][AUTOTVM] Auto-Schedule from Compute Declaration #2954

merrymercy opened this issue Apr 2, 2019 · 20 comments

Comments

@merrymercy
Copy link
Member

merrymercy commented Apr 2, 2019

Update(Dec. 25, 2020): This RFC is deprecated. We started another project "Ansor" to bring auto-scheduler for TVM. Ansor is integrated as tvm.auto_scheduler package in the current code base. You can see the new RFC and tutorials.

Auto-Scheduler

TVM decouples kernel implementation into compute and schedule. The compute part is a friendly DSL that can describe algorithms intuitively. However, the schedule part still requires strong expert knowledge and time-consuming tuning to provide decent performance. The tuning process is partially automated by the existing autotvm package, but a human-engineered template is still required.

This RFC proposes a "real" autotvm, which we can call auto scheduler. It aims at removing all human efforts on the schedule part.

Proposed Design

The auto-scheduler is built on the existing autotvm package. It will generate a template from compute declaration. Then this template can either be

  • Statically filled by heuristic rules and cost functions to provide reasonable performance, or
  • Dynamically tuned by autotvm to provide better performance with some time budget

The auto-scheduler takes a computation graph described by tvm DSL as input, then classify the type of read/write patterns and the type of computation. It dispatches the nodes in the DAG to different "meta templates". The "meta templates" generates autotvm templates from the compute declaration. There are four types of meta templates : simple reduction, complex reduction, direct compute, and location-tunable compute. The auto-scheduler will do parallelization, vectorization, tiling, and operator fusion.

The code is available on my branch. The current implementation is in pure python bacuse autotvm is mainly written in python. But move the whole autotvm package to c++ is within long-term plan. The code is organized as follows.

API

There are only two user-oriented API calls

  • autotvm.AutoSchedulerOptions(**kwargs)
    This is used to configure the auto scheduler. The arguments include hardware configurations(vector lanes, number of threads, size of shared memory, etc) and tuning configurations (how many tuning knobs to generate).
  • autotvm.create_schedule(tensors)
    This is similar to tvm.create_schedule, but returns an already optimized schedule.
A = tvm.placeholder((128,), name='A')
B = tvm.placeholder((128,), name='B')
C = tvm.compute((128,),  lambda i: A[i] + B[i] * 2)

with tvm.target.create('llvm'):
    with autotvm.AutoSchedulerOptions(vec_size=8, num_threads=16):
        s, bufs = autotvm.create_schedule([A, B, C])

# NO SCHEDULE REQUIRED

func = tvm.build(s, bufs)

Examples

  1. Tutorial
    This is a tutorial on how to statically use the auto-scheduler or auto-tune it.
  2. Schedule a whole network
    This example is adopted from [TVM] Automatic differentiation for tensor expressions #2498. It is a LeNet like convolution neural network written purely by tvm (without graph IR). The auto-scheduler also provides basic operator fusion for it. Right now we can only run forward pass. I am working on fixing the backward pass.

Performance

One reachable performance goal is to replace more than 90% schedule code in existing TOPI by this auto-scheduler. I haven't done the experiments, but I believe the generated templates can cover the existing search space for most operators (includes conv2d, reduction, ...).

Another part of the goal is to provide reasonable static performance. In the "Schedule a whole network" example, for batched forward pass, the current performance is 1.2x slower than out-of-the-box TF + Keras, and 10x faster than naive schedule (fuse and parallel outer loops) on an Intel i7-8750H. For static usage, the input of the auto-scheduler are parameters for heuristic rules and hardware configurations. We will gather all inputs into a global config, so users can still do some quick "tuning".

Todo List

  • Performance test and improvement to cover more than 90% schedule code in TOPI
    Improve the heuristic rules to provide better static performance, do tests to make sure we cover the search space of existing templates.
  • Improve tuning speed
    The current implementation does analysis and generates the template on the fly, which is expensive and redundant during batched tuning. We should decouple the template generation and template tuning, and explicitly cache the template.
  • (long-term) Move all autotvm related code to c++
  • Improve loop partition to better handle partial tile, vectorization.
@merrymercy merrymercy changed the title [RFC][AUTOTVM] Auto-Scheduler from Compute Decleration [RFC][AUTOTVM] Auto-Schedule from Compute Decleration Apr 2, 2019
@merrymercy merrymercy changed the title [RFC][AUTOTVM] Auto-Schedule from Compute Decleration [RFC][AUTOTVM] Auto-Schedule from Compute Declaration Apr 2, 2019
@kevinthesun
Copy link
Contributor

Thank you for opening this RFC! I have a question regarding user API. Does the hardware information needed for autotvm.AutoSchedulerOptions(**kwargs) function pre-defined for different hardware architectures? If so, how much more information does a user need to provide to differentiate between different minor types of the same device target, such as Intel Xeon Platinum vs Xeon Haswell, or Nvida K80 vs V100? Today we have a single template for minor device types. Will auto-scheduler provide different templates?

@jroesch
Copy link
Member

jroesch commented Apr 3, 2019

@merrymercy how much work is there per backend? looking over the code now will follow up with more questions later.

@yzhliu
Copy link
Member

yzhliu commented Apr 3, 2019

@merrymercy Could you elaborate a bit about the 4 types (simple reduction, complex reduction, direct compute, and location-tunable compute) ? Also it would be helpful if you can give an example of how the DAG looks like.

@tmoreau89
Copy link
Contributor

Thanks @merrymercy, this is really awesome work. I second Jared's comment on work involved in adding a backend. I'd be happy to chat some more about how one would add automated compilation to different hardware accelerators including VTA.

@merrymercy
Copy link
Member Author

merrymercy commented Apr 3, 2019

@kevinthesun The hardware parameters for the auto-scheduler are very coarse-grained. These parameters are most used in static scheduling. So it even won't distinguish between ARM CPU and Intel CPU. If you want to fit a specific target device, we still need to do auto-tuning on real devices.

@jroesch Currently, it is about 500 loc per backend. I am working on improvements so it may increase.

@yzhliu

  • simple reduction: reduction ops that do not have reuse opportunity (e.g. softmax, argmin)
  • complex reduction: reduction ops that have reuse opportunity (e.g. matmul, conv2d)
  • direct compute: broadcast, elemwise, stencil computation, (e.g. relu, add)
  • location-tunable compute: the same as above. The difference is that direct compute computes at root, while location-tunable compute can computes at other nodes to increase locality.

@tmoreau89 This is doable. The problem of accelerators is that if we want the auto-scheduler to take in a hardware-independent description, then we need a special pack pass to transform the layout.

@jroesch
Copy link
Member

jroesch commented Apr 3, 2019

@merrymercy I'm less interested in LOC and more how much conceptual burden there is. My question is more: What are the key pieces that make up a backend description?

I looked over the code but was at SysML and have two deadlines this week so I haven't had a chance to really look it over. Look forward to landing this stuff.

One idea I've been thinking about is a combined TVM + Relay language where we can auto-extract chunks that can be lowered to the compute language, auto-schedule, then auto-tune for end-to-end perf.

@kevinthesun
Copy link
Contributor

@merrymercy Auto-scheduler will create another search space consists of schedule templates. For a given set of hardware parameters, it will try various schedule templates and for each template do some auto-tuning on real device. This means for each minor device type, we need to do all these steps. Do I understand it correctly?

@yzhliu
Copy link
Member

yzhliu commented Apr 4, 2019

@merrymercy Do you think this analysis design can be easily extended to be working based on TVM Tensor AST (HalideIR) instead of ScheduleStage? Not urgent but I think eventually we will make schedule primitives work on HalideIR, so that we can unify the underlying data structure of schedule and other passes.

@tqchen
Copy link
Member

tqchen commented Apr 4, 2019

Good discussions, I think in general we can move to summarize the common patterns and make things work for specific hardware backend. As for point bought by @yzhliu (unifying schedule with pass), eventually ScheduleStage itself(or other IR structure) can be viewed as a dialect of the IR, and we can do so after we push for such unification.

@merrymercy
Copy link
Member Author

merrymercy commented Apr 4, 2019

@jroesch There is no easy description for a backend. Currently these meta-templates are mainly based on the summary of existing human schedule code in TOPI. So adding a new backend is still hard. What can be reused is the classification of compute type.

@kevinthesun There is only one template for one specific op. The auto-scheduler first creates this template. Then, for static usage, it will fill the knobs in the template according to hardware paremeters. The API example shown above falls in this category. For tuning usage, the auto-scheduler won't use hardware parameters. Instead, it relies real tuning. In this case, you need to explicitly create autotvm.Task, autotvm.Tuner as what we do currently. An example is shown in the tutorial.

@yzhliu The tvm.compute dsl is much easier to analyze than general Halide IR, because of its clean dependency relations and simple loop structures.

@eqy
Copy link
Contributor

eqy commented Apr 5, 2019

Minor question: do we consider "injective" as a special case of "simple reduction?"

@eqy
Copy link
Contributor

eqy commented Apr 5, 2019

@merrymercy Do you think that this is a good time to also make schedules serializable/package them with autotvm style configs? In the past we have had issues where we did not want to merge in changes to schedules because they would break compatibility with tophub, and now it seems that the variety of schedules may also change quickly as auto-schedule is changed. Instead of forcing schedules to be schedule, we can maybe side-step this by packaging schedules together with autotvm configs.

@merrymercy
Copy link
Member Author

@eqy "injective" is considered "direct compute". Typically they will be inlined.

Serializable Template + Serializable Config seems to be a good direction to go.

@yangjunpro
Copy link

@merrymercy
Hi Lianmin,

Thanks for the nice proposal. May I know the latest progress of the auto-scheduling work?
It looks that for a long time there isn't any status update.

Regards
Jun

@hello-hzb
Copy link

hello-hzb commented Nov 11, 2019

@merrymercy
Hi Zheng,
I have paied attension to your auto-scheduler work for a few days. No update for a few month. How is it going these days?
Why don't you merge the autoshceduler to the master branch of TVM?

@merrymercy
Copy link
Member Author

merrymercy commented Nov 13, 2019

Hi @yangjunpro @hello-hzb ,
This project has been suspended for several months. I won't continue my work on the original branch.
However, the push for an auto-scheduler is still interesting to a lot of people. I might work on auto-scheduler again with some Berkeley students. We'd like to try different approaches, so we won't start from my old branch.

@yzhliu
Copy link
Member

yzhliu commented Nov 13, 2019

@merrymercy would you mind summarize a bit what's the drawback of the original implement, so we can learn from it.

@yangjunpro
Copy link

Hi @yangjunpro @hello-hzb ,
This project has been suspended for several months. I won't continue my work on the original branch.
However, the push for an auto-scheduler is still interesting to a lot of people. I might work on auto-scheduler again with some Berkeley students. We'd like to try different approaches, so we won't start from my old branch.

Sure, I think Zhao has already contacted with you and also involve two of my colleagues Minmin and Chenfan. Look forward to further collaborations.

@tqchen
Copy link
Member

tqchen commented Aug 20, 2020

close as per ansor update

@tqchen tqchen closed this as completed Aug 20, 2020
@tqchen
Copy link
Member

tqchen commented Aug 20, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants