[WIP][Ansor][AutoTVM v2.0] Part 0: Infrastructures for Automatic Schedule Search #5883

jcf94 · 2020-06-22T09:26:19Z

Hi all,

In [RFC] Ansor: An Auto-scheduler for TVM (AutoTVM v2.0), we've introduced the auto-scheduler Ansor. In the RFC, we reached an agreement that we should replace AutoTVM with Ansor.
For most existing templates, current Ansor can directly replace them with better performance and less tuning time.
For other special templates (low-precision, sparse), the plan is to introduce search space customization and gradually rewrite them with Ansor's new API.

This is the first PR according to the integration plan mentioned in the RFC.
This PR contains the infrastructure for search (the definition of state and actions) and small modifications outside the Ansor folder.

Infrastructure for search: A lightweight IR

Automatic scheduling is a search problem. For a search problem, we need to define the states and actions.
The state of schedule search is the loop structure defined by the schedule (i.e., the TVM IR created by tvm.lower). The actions are schedule primitives to manipulate the loop structures (e.g., split, reorder, fuse).

To enable flexible manipulation of the loop structures, we implemented a lightweight loop structure IR (Intermediate Representation) specifically for search. We also implemented all schedule primitives for this IR. Basically, it is a simplified TVM IR. We don't use the existing TVM IR because:

We want fast incremental change to the loop structures
We want serializable transform history for replay, backtracking, and mutation
We may create some macro schedule primitives

After the search is done, we will lower this IR to TVM IR with TVM schedule primitives.

Key data structures

ComputeDAG: Compute declaration graph and its related analysis tools
Related files: src/ansor/compute_dag.*, python/tvm/ansor/compute_dag.py
This is the entrance data structure of Ansor. Ansor takes a compute declaration described by tvm.compute as input and converts it to this data structure for analysis.
TransformStep: This defines the "action" for the search problem, i.e., the schedule primitives for our IR.
Related files: src/ansor/transform_step.*, python/tvm/ansor/loop_state.py
Each step has its corresponding tvm.te schedule primitives. We record all TransformStep for every state as its transform history. After the search is done, these transform steps will be lowered with their corresponding TVM's schedule primitives.
State: This defines the "state" for the search problem, i.e., the current loop structure and history transform steps to reach this state.
Related files: src/ansor/loop_state.*, python/tvm/ansor/loop_state.py
A state consists of a current loop structure and the transform history to reach its current loop structure.
The loop structure keeps a preview of how the schedule will finally look like after lowering (how many iterators, the extent of each iter, the location of some iterators if they have been done compute_at...), which can help the search policy to make decisions during the search.
The history is a sequence of TransformStep which will finally be mapped to schedule primitives.

Example Walkthrough

While the search policy is implemented in C++, we also provide a python API for the new IR.
This is intended to be used by users for space customization. They look very similar to existing schedule primitives, as shown in python/tvm/ansor/loop_state.py. The API design is ongoing and may get updated later.

Take tests/python/unittest/test_ansor_loop_state.py:test_split_fuse_reorder_annotation() for example, we can print out the test state s1 as:

>>> print(s1)

Placeholder: A, B
gpu.blockIdx.x i.0 (0,32)
  vthread i.1 (0,8)
    gpu.threadIdx.y i.2 (0,2)
      parallel j.0 (0,32)
        unroll j.1 (0,8)
          vectorize j.2 (0,2)
            for k (0,512)
              C = ...

The state stores all history transform steps required to reach this state. We can print the history transform steps as TVM's python API.

>>> print(dag.print_python_code_from_state(s1))

i, j, k = tuple(C.op.axis) + tuple(C.op.reduce_axis)
i_o_i, i_i = s[C].split(i, factor=2)
i_o_o, i_o_i = s[C].split(i_o_i, factor=8)
j_o, j_i_o = s[C].split(j, nparts=32)
j_i_o, j_i_i = s[C].split(j_i_o, nparts=8)
s[C].parallel(j_o)
s[C].unroll(j_i_o)
s[C].vectorize(j_i_i)
s[C].bind(i_o_o, tvm.thread_axis("blockIdx.x"))
s[C].bind(i_o_i, tvm.thread_axis("vthread"))
s[C].bind(i_i, tvm.thread_axis("threadIdx.y"))

Or replay these steps to get a schedule for tvm.lower and tvm.build.

s, args = dag.apply_steps_from_state(s1)
print(tvm.lower(s, args, simple_mode=True))

The steps of this state can be serialized into the log file as:

[["SP", 2, 0, 512, [8, 2], 1], ["SP", 2, 3, 512, [32, 8], 0], ["AN", 2, 3, 3], ["AN", 2, 4, 1], ["AN", 2, 5, 2], ["AN", 2, 0, 5], ["AN", 2, 1, 4], ["AN", 2, 2, 8]]

Ansor serializes all transform steps to the log file. This is different from AutoTVM which only serializes parameters.

In the next few PRs, we'll introduce search policy and tutorials for single op/ subgraph schedule search, relay integration and tutorials for end-to-end network schedule search, custom rules to support customized search space. When Ansor is able to fully support AutoTVM's features, we can gradually deprecate AutotVM.

This is a joint work by @merrymercy @jcf94 @minminsun @FrozenGene @comaniac @yangjunpro @yidawang .

Changes of original TVM code outside Ansor folders (Will later split these to separate PRs)

CUDA device API & VerifyGPUCode pass update #5898 Add kMaxRegistersPerBlock in GPU device query (include/tvm/runtime/device_api.h, src/runtime/cuda/cuda_device_api.cc, src/runtime/opencl/opencl_device_api.cc)
[TE] Add LegalizeInvalidAttach to legalize the compute_at location after split or fuse #5917 Add LegalizeInvalidAttach (src/te/schedule/schedule_dataflow_rewrite.cc)
([random] support random fill #5913) Add nd.non_empty (include/tvm/runtime/c_runtime_api.h, include/tvm/runtime/ndarray.h, python/tvm/runtime/ndarray.py, src/runtime/ndarray.cc)
CUDA device API & VerifyGPUCode pass update #5898 Add vectorization check in VerifyGPUCode (src/tir/analysis/verify_gpu_code.cc)
#(25) Add explicit_unroll_max_extent (src/tir/transforms/unroll_loop.cc, tests/python/unittest/test_tir_transform_unroll_loop.py)
[Arith][GPU]Rewrite simplify fix for Vectorized Cooperative Fetching #5924 Update Index simplification to fix the vectorized cooperative fetching (src/arith/rewrite_simplify.cc)
([clflush] Enable x86 cpu cache flush #5914) Add flush CPU cache during tuning process to make the result more correct (src/runtime/rpc/rpc_module.cc, src/runtime/threading_backend.cc)
~~#(34) Add call_all_topi_functions to RelayBuildModule (src/relay/backend/build_module.cc, python/tvm/relay/build_module.py)~~

* Init commit: Code migration Start * Add loop_state.cc/h * Add ComputeDAG basic test

* Split transform_step out * Update GetProducers & GetConsumers * Update UTs * Add UT for CacheReadWrite & Some bug fix

* Add FollowSplit & FollowFusedSplit tests * Update dag.InferBound & its UT * Add search_task, measure and serialization * Update Serialization UT

* Add feature * Add cost_model, meta_tile_rewrite_policy * Add MetaTileRewritePolicy basic UT

* Add Basic Python API for State * Add UTs for State

* Update the return value of state operation * Add task * Copy measure.py & utils.py * Fix LocalBuilder * Fix LocalRunner

…che#8) * Add basic Python support for ansor.auto_schedule * Update AutoSchedule API * Bug fix for get the attach point of a fused iter * Update UT after infer bug fix

* Delete C++ UT hack since Python is ready * Add ndarray.non_empty * Update Serialization python API

* Update c++ code style and unit test * Update python State wrapper and test cases

* Add RPCRunner & OpenCL search test * Add CUDA search test * Add RPCRunner test

* Add basic tutorial

* Add XGBModel & RPCRunnerWarpper * Revert "Add Parallel Granularity Mutation"

* add workload registry * update * update

* add tune_test.py (the old tune_wkl.py) * update * fix measure * fix for gpu

* Bug fix for tutorials * Add PreLoadMeasuredStates * Add search_callback support for task tuner * Code refine for tune_test.py * Update * Update * Update * Update * Bug fix

* Add custom sketch rule * Bug fix

* relay integration

) * Add single op tune scripts * Add tune subgraph support * Merge all op & all subgraph to one file * Rename file

* Add vectorized cooperative_fetching test * Update math simplify for vectorized CF * File rename * Update tune_network * API update

* Add a threading wrapper to fix the test bug * Set default TVM_USE_AUTO_SCHEDULER to false * Update PreLoadMeasuredStates callback

* Add tensorize step

* Start to update api * Add compute_dag to state * API update

* kernel layout rewrite * remove some hacks * add defuse_ops pass and move kernel_layout_rewrite pass after fuse_ops pass * set TVM_RELAY_DISABLE_BUILD_CACHE for task extraction and prepare_layout_rewrite

tqchen · 2020-06-22T15:18:33Z

src/ansor/loop_state.h

+ *  It consists of the current loop structure and the history steps to reach this state. */
+class StateNode: public Object {
+ public:
+  std::vector<Stage> stages;           // Current stages and loop structures


vector<Stage> -> Array<Stage>

tqchen · 2020-06-22T15:19:23Z

src/ansor/loop_state.h

+        ObjectRef aux_info);
+
+  // Schedule primitives
+  void reorder(int stage_id, const std::vector<Iterator>& order);


Let us move the schedule promitives to the StateNode instead

tqchen · 2020-06-22T15:24:51Z

This PR is good to get a global context. Some further comments: To make sure we get the a thoroughly review and smooth upstream, let us break this PR down further into several PRs, it also helps us to logically organize and think about the overall design architecture:

Changes outside ansor, consider use one PR for each group of change(e.g. rewrite simplify change can be its own PR).
A single PR for the state data structures, loop_state.h, loop_state.py
- Clearly document all the fields, key functions
- Discuss the implication of the design, by briefly comment on the search process in loop_state.h
A PR contains Measurements related code for ansor
Specific implementation of each ansor primitives(TransformStep), each as its own PR

tqchen · 2020-06-22T15:28:47Z

cc @merrymercy @spectrometerHBH @junrushao1994 @Hzfengsy @jwfromm

jwfromm · 2020-06-22T20:14:50Z

I agree with @tqchen that it would be helpful to break this up a little into separate PRs. Maybe it can be divided into the three partitions discussed in the paper: task scheduling, program sampling, and performance tuning. That would make it much more clear what we're looking it.

merrymercy · 2020-06-22T20:47:59Z

@jwfromm The partition of implementation is different from the organization of the paper. We cannot upstream code according to the paper. We listed the integration steps in our RFC, and we will follow those steps.
For people who read the paper, this PR corresponds to nothing in the paper. This PR only provides the infrastructures for search (The definition of state and actions in the search). The search policy (i.e. program sampling and performance fine-tuning) and task scheduler will be in next PRs.

tqchen · 2020-06-23T01:02:53Z

I have outlined the proposal of a possible breakdown of this PR in the above post, please see if that makes sense

comaniac · 2020-06-23T01:17:21Z

Since this PR is incomplete it might be not trivial for people to review. For example, feature.cc includes all utilities we used to extract features, but the use case (XGBoost cost model) is not included yet. Another example is that loop state and transform steps are included in this PR but the search policy isn't. This might cause people hard to imagine how those data structures will be used.

If we are going to breakdown this PR, I would suggest putting everything we have to this PR and use this PR as the reference when sending other small PRs.

…he#39) * lint fix * clang-format-fix * pylint fix * Update * Recover the double constructor of tvm::PrimExpr * Fix pylint * pylint fix * pylint fix

…pache#40) * Add MutateComputeLocation and MutateParallel in evolutionary search * fix lint

* improve loop state python API (stage_tensors -> stage_ops) * fix

* Bug Fix * Sample example of Custom TensorCore Matmul

merrymercy · 2020-06-27T17:53:41Z

After some discussion, we changed our upstream plan.

We will distill a minimal version of Ansor and send it as the first PR.
This minimal version will contain a small but complete framework, so people can get a better understanding of the whole structure of Ansor.
This is different from the old upstream plan where we send separate components one by one. In the old plan, the framework is not complete during upstream, making the reviewers lose the context.

jcf94 and others added 30 commits June 20, 2020 09:01

Code migration Start (#1)

7ee0902

* Init commit: Code migration Start * Add loop_state.cc/h * Add ComputeDAG basic test

Split transform_step out & Update more UTs (apache#3)

9fcbf0b

* Split transform_step out * Update GetProducers & GetConsumers * Update UTs * Add UT for CacheReadWrite & Some bug fix

Add search_task, measure and serialization (apache#4)

f43e82f

* Add FollowSplit & FollowFusedSplit tests * Update dag.InferBound & its UT * Add search_task, measure and serialization * Update Serialization UT

Add MetaTileRewritePolicy (apache#5)

e0a5ed5

* Add feature * Add cost_model, meta_tile_rewrite_policy * Add MetaTileRewritePolicy basic UT

Basic Python API for State (apache#6)

359905a

* Add Basic Python API for State * Add UTs for State

Add Python API: Measure & Task (apache#7)

2032a64

* Update the return value of state operation * Add task * Copy measure.py & utils.py * Fix LocalBuilder * Fix LocalRunner

Add ansor.auto_schedule() API; First AutoSchedule working version(apa…

6b21dc6

…che#8) * Add basic Python support for ansor.auto_schedule * Update AutoSchedule API * Bug fix for get the attach point of a fused iter * Update UT after infer bug fix

Bug fix & Add python serialization API (apache#10)

e52135f

* Delete C++ UT hack since Python is ready * Add ndarray.non_empty * Update Serialization python API

Improve code style, python wrapper and test cases (apache#11)

1fe6638

* Update c++ code style and unit test * Update python State wrapper and test cases

fix unit tests

43d1530

Add RPCRunner & OpenCL/CUDA test (apache#12)

f367d15

* Add RPCRunner & OpenCL search test * Add CUDA search test * Add RPCRunner test

rebase to upstream/master

2bd6471

Add Ansor basic tutorial (apache#13)

c860f2c

* Add basic tutorial

migrate feature extraction (apache#14)

f60d1a6

Add XGBModel & RPCRunnerWarpper (apache#15)

b839c0f

* Add XGBModel & RPCRunnerWarpper * Revert "Add Parallel Granularity Mutation"

Migrate workload_registry.py (apache#16)

cfe58d7

* add workload registry * update * update

add task scheduler (apache#17)

143ea45

Add conv2d cuda tutorial with workload registry (apache#18)

ed075c2

add tune_test.py (the old tune_wkl.py) (apache#19)

74ec7d0

* add tune_test.py (the old tune_wkl.py) * update * fix measure * fix for gpu

Code refine for tune_test.py & Add a pre load callback (apache#20)

cd0a516

* Bug fix for tutorials * Add PreLoadMeasuredStates * Add search_callback support for task tuner * Code refine for tune_test.py * Update * Update * Update * Update * Bug fix

Add python custom sketch rule (apache#21)

3a24e49

* Add custom sketch rule * Bug fix

Ansor Relay Integration (without layout rewrite) (apache#22)

a155c1f

* relay integration

Add tune_op_subgraph.py & Some code clean for tune_network.py (apache#23

674027f

) * Add single op tune scripts * Add tune subgraph support * Merge all op & all subgraph to one file * Rename file

add explicit_unroll_max_extent (apache#25)

2f241ed

Add Index simplification & API update (apache#26)

18d44b8

* Add vectorized cooperative_fetching test * Update math simplify for vectorized CF * File rename * Update tune_network * API update

Update PreLoadMeasuredStates & Some bug fix (apache#27)

4ea6712

* Add a threading wrapper to fix the test bug * Set default TVM_USE_AUTO_SCHEDULER to false * Update PreLoadMeasuredStates callback

Add tensorize step for loop_state (apache#31)

6126cdb

* Add tensorize step

State python api update (apache#33)

c7364df

* Start to update api * Add compute_dag to state * API update

kernel layout rewrite (apache#28)

36cd9ef

* kernel layout rewrite * remove some hacks * add defuse_ops pass and move kernel_layout_rewrite pass after fuse_ops pass * set TVM_RELAY_DISABLE_BUILD_CACHE for task extraction and prepare_layout_rewrite

[cache flush] port cache flush to ansor (apache#32)

145e61c

tqchen requested changes Jun 22, 2020

View reviewed changes

merrymercy changed the title ~~[Ansor][AutoTVM v2.0] PR0: Key Structures for Auto Schedule Search~~ [Ansor][AutoTVM v2.0] Part 0: Key Data Structures for Auto Schedule Search Jun 22, 2020

merrymercy changed the title ~~[Ansor][AutoTVM v2.0] Part 0: Key Data Structures for Auto Schedule Search~~ [Ansor][AutoTVM v2.0] Part 0: Infrastructures for automatic schedule search Jun 22, 2020

merrymercy changed the title ~~[Ansor][AutoTVM v2.0] Part 0: Infrastructures for automatic schedule search~~ [Ansor][AutoTVM v2.0] Part 0: Infrastructures for Automatic Schedule Search Jun 22, 2020

Some lint fix & Recover the double constructor of tvm::PrimExpr (apac…

8e53d12

…he#39) * lint fix * clang-format-fix * pylint fix * Update * Recover the double constructor of tvm::PrimExpr * Fix pylint * pylint fix * pylint fix

jcf94 mentioned this pull request Jun 23, 2020

CUDA device API & VerifyGPUCode pass update #5898

Merged

merrymercy and others added 8 commits June 23, 2020 12:09

Add MutateComputeLocation and MutateParallel in evolutionary search (a…

cd5c5ad

…pache#40) * Add MutateComputeLocation and MutateParallel in evolutionary search * fix lint

Improve loop state python API (stage_tensors -> stage_ops) (apache#41)

5860191

* improve loop state python API (stage_tensors -> stage_ops) * fix

ComputeDAG bug fix & Add Custom TensorCore Matmul Example (apache#42)

14a19cd

* Bug Fix * Sample example of Custom TensorCore Matmul

Revert commit

59c88d1

Revert commits

86bfd8f

Rever Commits, Start to build minimum Ansor system

910964e

Code clean for minimum Ansor system

d567617

UT ready

a8e589e

jcf94 changed the title ~~[Ansor][AutoTVM v2.0] Part 0: Infrastructures for Automatic Schedule Search~~ [WIP][Ansor][AutoTVM v2.0] Part 0: Infrastructures for Automatic Schedule Search Jun 24, 2020

jcf94 force-pushed the upstream_0 branch from be08a38 to a8e589e Compare June 24, 2020 08:42

Update

2456c3e

FrozenGene mentioned this pull request Jun 24, 2020

[random] support random fill #5913

Merged

merrymercy mentioned this pull request Jun 24, 2020

[TE] Add LegalizeInvalidAttach to legalize the compute_at location after split or fuse #5917

Merged

jcf94 mentioned this pull request Jun 25, 2020

[Arith][GPU]Rewrite simplify fix for Vectorized Cooperative Fetching #5924

Merged

merrymercy closed this Jun 27, 2020

jcf94 mentioned this pull request Jun 30, 2020

[Ansor][AutoTVM v2.0] Phase 0: Ansor minimum system for auto schedule generating #5962

Merged

jcf94 deleted the upstream_0 branch July 15, 2020 02:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][Ansor][AutoTVM v2.0] Part 0: Infrastructures for Automatic Schedule Search #5883

[WIP][Ansor][AutoTVM v2.0] Part 0: Infrastructures for Automatic Schedule Search #5883

jcf94 commented Jun 22, 2020 •

edited

Loading

tqchen Jun 22, 2020

tqchen Jun 22, 2020

tqchen commented Jun 22, 2020 •

edited

Loading

tqchen commented Jun 22, 2020

jwfromm commented Jun 22, 2020 •

edited

Loading

merrymercy commented Jun 22, 2020 •

edited

Loading

tqchen commented Jun 23, 2020 •

edited

Loading

comaniac commented Jun 23, 2020

merrymercy commented Jun 27, 2020

[WIP][Ansor][AutoTVM v2.0] Part 0: Infrastructures for Automatic Schedule Search #5883

[WIP][Ansor][AutoTVM v2.0] Part 0: Infrastructures for Automatic Schedule Search #5883

Conversation

jcf94 commented Jun 22, 2020 • edited Loading

Infrastructure for search: A lightweight IR

Key data structures

Example Walkthrough

Changes of original TVM code outside Ansor folders (Will later split these to separate PRs)

tqchen Jun 22, 2020

Choose a reason for hiding this comment

tqchen Jun 22, 2020

Choose a reason for hiding this comment

tqchen commented Jun 22, 2020 • edited Loading

tqchen commented Jun 22, 2020

jwfromm commented Jun 22, 2020 • edited Loading

merrymercy commented Jun 22, 2020 • edited Loading

tqchen commented Jun 23, 2020 • edited Loading

comaniac commented Jun 23, 2020

merrymercy commented Jun 27, 2020

jcf94 commented Jun 22, 2020 •

edited

Loading

tqchen commented Jun 22, 2020 •

edited

Loading

jwfromm commented Jun 22, 2020 •

edited

Loading

merrymercy commented Jun 22, 2020 •

edited

Loading

tqchen commented Jun 23, 2020 •

edited

Loading