-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC][AUTOTVM] Auto-Schedule from Compute Declaration #2954
Comments
Thank you for opening this RFC! I have a question regarding user API. Does the hardware information needed for autotvm.AutoSchedulerOptions(**kwargs) function pre-defined for different hardware architectures? If so, how much more information does a user need to provide to differentiate between different minor types of the same device target, such as Intel Xeon Platinum vs Xeon Haswell, or Nvida K80 vs V100? Today we have a single template for minor device types. Will auto-scheduler provide different templates? |
@merrymercy how much work is there per backend? looking over the code now will follow up with more questions later. |
@merrymercy Could you elaborate a bit about the 4 types (simple reduction, complex reduction, direct compute, and location-tunable compute) ? Also it would be helpful if you can give an example of how the DAG looks like. |
Thanks @merrymercy, this is really awesome work. I second Jared's comment on work involved in adding a backend. I'd be happy to chat some more about how one would add automated compilation to different hardware accelerators including VTA. |
@kevinthesun The hardware parameters for the auto-scheduler are very coarse-grained. These parameters are most used in static scheduling. So it even won't distinguish between ARM CPU and Intel CPU. If you want to fit a specific target device, we still need to do auto-tuning on real devices. @jroesch Currently, it is about 500 loc per backend. I am working on improvements so it may increase.
@tmoreau89 This is doable. The problem of accelerators is that if we want the auto-scheduler to take in a hardware-independent description, then we need a special pack pass to transform the layout. |
@merrymercy I'm less interested in LOC and more how much conceptual burden there is. My question is more: What are the key pieces that make up a backend description? I looked over the code but was at SysML and have two deadlines this week so I haven't had a chance to really look it over. Look forward to landing this stuff. One idea I've been thinking about is a combined TVM + Relay language where we can auto-extract chunks that can be lowered to the compute language, auto-schedule, then auto-tune for end-to-end perf. |
@merrymercy Auto-scheduler will create another search space consists of schedule templates. For a given set of hardware parameters, it will try various schedule templates and for each template do some auto-tuning on real device. This means for each minor device type, we need to do all these steps. Do I understand it correctly? |
@merrymercy Do you think this analysis design can be easily extended to be working based on TVM Tensor AST (HalideIR) instead of ScheduleStage? Not urgent but I think eventually we will make schedule primitives work on HalideIR, so that we can unify the underlying data structure of schedule and other passes. |
Good discussions, I think in general we can move to summarize the common patterns and make things work for specific hardware backend. As for point bought by @yzhliu (unifying schedule with pass), eventually ScheduleStage itself(or other IR structure) can be viewed as a dialect of the IR, and we can do so after we push for such unification. |
@jroesch There is no easy description for a backend. Currently these meta-templates are mainly based on the summary of existing human schedule code in TOPI. So adding a new backend is still hard. What can be reused is the classification of compute type. @kevinthesun There is only one template for one specific op. The auto-scheduler first creates this template. Then, for static usage, it will fill the knobs in the template according to hardware paremeters. The API example shown above falls in this category. For tuning usage, the auto-scheduler won't use hardware parameters. Instead, it relies real tuning. In this case, you need to explicitly create @yzhliu The |
Minor question: do we consider "injective" as a special case of "simple reduction?" |
@merrymercy Do you think that this is a good time to also make schedules serializable/package them with autotvm style configs? In the past we have had issues where we did not want to merge in changes to schedules because they would break compatibility with tophub, and now it seems that the variety of schedules may also change quickly as auto-schedule is changed. Instead of forcing schedules to be schedule, we can maybe side-step this by packaging schedules together with autotvm configs. |
@eqy "injective" is considered "direct compute". Typically they will be inlined. Serializable Template + Serializable Config seems to be a good direction to go. |
@merrymercy Thanks for the nice proposal. May I know the latest progress of the auto-scheduling work? Regards |
@merrymercy |
Hi @yangjunpro @hello-hzb , |
@merrymercy would you mind summarize a bit what's the drawback of the original implement, so we can learn from it. |
Sure, I think Zhao has already contacted with you and also involve two of my colleagues Minmin and Chenfan. Look forward to further collaborations. |
close as per ansor update |
Update(Dec. 25, 2020): This RFC is deprecated. We started another project "Ansor" to bring auto-scheduler for TVM. Ansor is integrated as
tvm.auto_scheduler
package in the current code base. You can see the new RFC and tutorials.Auto-Scheduler
TVM decouples kernel implementation into compute and schedule. The compute part is a friendly DSL that can describe algorithms intuitively. However, the schedule part still requires strong expert knowledge and time-consuming tuning to provide decent performance. The tuning process is partially automated by the existing autotvm package, but a human-engineered template is still required.
This RFC proposes a "real" autotvm, which we can call auto scheduler. It aims at removing all human efforts on the schedule part.
Proposed Design
The auto-scheduler is built on the existing autotvm package. It will generate a template from compute declaration. Then this template can either be
The auto-scheduler takes a computation graph described by tvm DSL as input, then classify the type of read/write patterns and the type of computation. It dispatches the nodes in the DAG to different "meta templates". The "meta templates" generates autotvm templates from the compute declaration. There are four types of meta templates : simple reduction, complex reduction, direct compute, and location-tunable compute. The auto-scheduler will do parallelization, vectorization, tiling, and operator fusion.
The code is available on my branch. The current implementation is in pure python bacuse autotvm is mainly written in python. But move the whole autotvm package to c++ is within long-term plan. The code is organized as follows.
API
There are only two user-oriented API calls
autotvm.AutoSchedulerOptions(**kwargs)
This is used to configure the auto scheduler. The arguments include hardware configurations(vector lanes, number of threads, size of shared memory, etc) and tuning configurations (how many tuning knobs to generate).
autotvm.create_schedule(tensors)
This is similar to
tvm.create_schedule
, but returns an already optimized schedule.Examples
This is a tutorial on how to statically use the auto-scheduler or auto-tune it.
This example is adopted from [TVM] Automatic differentiation for tensor expressions #2498. It is a LeNet like convolution neural network written purely by tvm (without graph IR). The auto-scheduler also provides basic operator fusion for it. Right now we can only run forward pass. I am working on fixing the backward pass.
Performance
One reachable performance goal is to replace more than 90% schedule code in existing TOPI by this auto-scheduler. I haven't done the experiments, but I believe the generated templates can cover the existing search space for most operators (includes conv2d, reduction, ...).
Another part of the goal is to provide reasonable static performance. In the "Schedule a whole network" example, for batched forward pass, the current performance is 1.2x slower than out-of-the-box TF + Keras, and 10x faster than naive schedule (fuse and parallel outer loops) on an Intel i7-8750H. For static usage, the input of the auto-scheduler are parameters for heuristic rules and hardware configurations. We will gather all inputs into a global config, so users can still do some quick "tuning".
Todo List
Improve the heuristic rules to provide better static performance, do tests to make sure we cover the search space of existing templates.
The current implementation does analysis and generates the template on the fly, which is expensive and redundant during batched tuning. We should decouple the template generation and template tuning, and explicitly cache the template.
The text was updated successfully, but these errors were encountered: