Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][Ansor][AutoTVM v2.0] Part 0: Infrastructures for Automatic Schedule Search #5883

Closed
wants to merge 45 commits into from

Conversation

jcf94
Copy link
Contributor

@jcf94 jcf94 commented Jun 22, 2020

Hi all,

In [RFC] Ansor: An Auto-scheduler for TVM (AutoTVM v2.0), we've introduced the auto-scheduler Ansor. In the RFC, we reached an agreement that we should replace AutoTVM with Ansor.
For most existing templates, current Ansor can directly replace them with better performance and less tuning time.
For other special templates (low-precision, sparse), the plan is to introduce search space customization and gradually rewrite them with Ansor's new API.

This is the first PR according to the integration plan mentioned in the RFC.
This PR contains the infrastructure for search (the definition of state and actions) and small modifications outside the Ansor folder.

Infrastructure for search: A lightweight IR

Automatic scheduling is a search problem. For a search problem, we need to define the states and actions.
The state of schedule search is the loop structure defined by the schedule (i.e., the TVM IR created by tvm.lower). The actions are schedule primitives to manipulate the loop structures (e.g., split, reorder, fuse).

To enable flexible manipulation of the loop structures, we implemented a lightweight loop structure IR (Intermediate Representation) specifically for search. We also implemented all schedule primitives for this IR. Basically, it is a simplified TVM IR. We don't use the existing TVM IR because:

  1. We want fast incremental change to the loop structures
  2. We want serializable transform history for replay, backtracking, and mutation
  3. We may create some macro schedule primitives

After the search is done, we will lower this IR to TVM IR with TVM schedule primitives.

Key data structures

  • ComputeDAG: Compute declaration graph and its related analysis tools
    Related files: src/ansor/compute_dag.*, python/tvm/ansor/compute_dag.py
    This is the entrance data structure of Ansor. Ansor takes a compute declaration described by tvm.compute as input and converts it to this data structure for analysis.
  • TransformStep: This defines the "action" for the search problem, i.e., the schedule primitives for our IR.
    Related files: src/ansor/transform_step.*, python/tvm/ansor/loop_state.py
    Each step has its corresponding tvm.te schedule primitives. We record all TransformStep for every state as its transform history. After the search is done, these transform steps will be lowered with their corresponding TVM's schedule primitives.
  • State: This defines the "state" for the search problem, i.e., the current loop structure and history transform steps to reach this state.
    Related files: src/ansor/loop_state.*, python/tvm/ansor/loop_state.py
    A state consists of a current loop structure and the transform history to reach its current loop structure.
    The loop structure keeps a preview of how the schedule will finally look like after lowering (how many iterators, the extent of each iter, the location of some iterators if they have been done compute_at...), which can help the search policy to make decisions during the search.
    The history is a sequence of TransformStep which will finally be mapped to schedule primitives.

Example Walkthrough

While the search policy is implemented in C++, we also provide a python API for the new IR.
This is intended to be used by users for space customization. They look very similar to existing schedule primitives, as shown in python/tvm/ansor/loop_state.py. The API design is ongoing and may get updated later.

Take tests/python/unittest/test_ansor_loop_state.py:test_split_fuse_reorder_annotation() for example, we can print out the test state s1 as:

>>> print(s1)

Placeholder: A, B
gpu.blockIdx.x i.0 (0,32)
  vthread i.1 (0,8)
    gpu.threadIdx.y i.2 (0,2)
      parallel j.0 (0,32)
        unroll j.1 (0,8)
          vectorize j.2 (0,2)
            for k (0,512)
              C = ...

The state stores all history transform steps required to reach this state. We can print the history transform steps as TVM's python API.

>>> print(dag.print_python_code_from_state(s1))

i, j, k = tuple(C.op.axis) + tuple(C.op.reduce_axis)
i_o_i, i_i = s[C].split(i, factor=2)
i_o_o, i_o_i = s[C].split(i_o_i, factor=8)
j_o, j_i_o = s[C].split(j, nparts=32)
j_i_o, j_i_i = s[C].split(j_i_o, nparts=8)
s[C].parallel(j_o)
s[C].unroll(j_i_o)
s[C].vectorize(j_i_i)
s[C].bind(i_o_o, tvm.thread_axis("blockIdx.x"))
s[C].bind(i_o_i, tvm.thread_axis("vthread"))
s[C].bind(i_i, tvm.thread_axis("threadIdx.y"))

Or replay these steps to get a schedule for tvm.lower and tvm.build.

s, args = dag.apply_steps_from_state(s1)
print(tvm.lower(s, args, simple_mode=True))

The steps of this state can be serialized into the log file as:

[["SP", 2, 0, 512, [8, 2], 1], ["SP", 2, 3, 512, [32, 8], 0], ["AN", 2, 3, 3], ["AN", 2, 4, 1], ["AN", 2, 5, 2], ["AN", 2, 0, 5], ["AN", 2, 1, 4], ["AN", 2, 2, 8]]

Ansor serializes all transform steps to the log file. This is different from AutoTVM which only serializes parameters.


In the next few PRs, we'll introduce search policy and tutorials for single op/ subgraph schedule search, relay integration and tutorials for end-to-end network schedule search, custom rules to support customized search space. When Ansor is able to fully support AutoTVM's features, we can gradually deprecate AutotVM.

This is a joint work by @merrymercy @jcf94 @minminsun @FrozenGene @comaniac @yangjunpro @yidawang .

Changes of original TVM code outside Ansor folders (Will later split these to separate PRs)

jcf94 and others added 30 commits June 20, 2020 09:01
* Init commit: Code migration Start

* Add loop_state.cc/h

* Add ComputeDAG basic test
* Split transform_step out

* Update GetProducers & GetConsumers

* Update UTs

* Add UT for CacheReadWrite & Some bug fix
* Add FollowSplit & FollowFusedSplit tests

* Update dag.InferBound & its UT

* Add search_task, measure and serialization

* Update Serialization UT
* Add feature

* Add cost_model, meta_tile_rewrite_policy

* Add MetaTileRewritePolicy basic UT
* Add Basic Python API for State

* Add UTs for State
* Update the return value of state operation

* Add task

* Copy measure.py & utils.py

* Fix LocalBuilder

* Fix LocalRunner
…che#8)

* Add basic Python support for ansor.auto_schedule

* Update AutoSchedule API

* Bug fix for get the attach point of a fused iter

* Update UT after infer bug fix
* Delete C++ UT hack since Python is ready

* Add ndarray.non_empty

* Update Serialization python API
* Update c++ code style and unit test

* Update python State wrapper and test cases
* Add RPCRunner & OpenCL search test

* Add CUDA search test

* Add RPCRunner test
* Add XGBModel & RPCRunnerWarpper

* Revert "Add Parallel Granularity Mutation"
* add workload registry

* update

* update
* add tune_test.py (the old tune_wkl.py)

* update

* fix measure

* fix for gpu
* Bug fix for tutorials

* Add PreLoadMeasuredStates

* Add search_callback support for task tuner

* Code refine for tune_test.py

* Update

* Update

* Update

* Update

* Bug fix
* Add custom sketch rule

* Bug fix
)

* Add single op tune scripts

* Add tune subgraph support

* Merge all op & all subgraph to one file

* Rename file
* Add vectorized cooperative_fetching test

* Update math simplify for vectorized CF

* File rename

* Update tune_network

* API update
* Add a threading wrapper to fix the test bug

* Set default TVM_USE_AUTO_SCHEDULER to false

* Update PreLoadMeasuredStates callback
* Start to update api

* Add compute_dag to state

* API update
* kernel layout rewrite

* remove some hacks

* add defuse_ops pass and move kernel_layout_rewrite pass after fuse_ops pass

* set TVM_RELAY_DISABLE_BUILD_CACHE for task extraction and prepare_layout_rewrite
* It consists of the current loop structure and the history steps to reach this state. */
class StateNode: public Object {
public:
std::vector<Stage> stages; // Current stages and loop structures
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vector<Stage> -> Array<Stage>

ObjectRef aux_info);

// Schedule primitives
void reorder(int stage_id, const std::vector<Iterator>& order);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us move the schedule promitives to the StateNode instead

@tqchen
Copy link
Member

tqchen commented Jun 22, 2020

This PR is good to get a global context. Some further comments: To make sure we get the a thoroughly review and smooth upstream, let us break this PR down further into several PRs, it also helps us to logically organize and think about the overall design architecture:

  • Changes outside ansor, consider use one PR for each group of change(e.g. rewrite simplify change can be its own PR).
  • A single PR for the state data structures, loop_state.h, loop_state.py
    • Clearly document all the fields, key functions
    • Discuss the implication of the design, by briefly comment on the search process in loop_state.h
  • A PR contains Measurements related code for ansor
  • Specific implementation of each ansor primitives(TransformStep), each as its own PR

@tqchen
Copy link
Member

tqchen commented Jun 22, 2020

@merrymercy merrymercy changed the title [Ansor][AutoTVM v2.0] PR0: Key Structures for Auto Schedule Search [Ansor][AutoTVM v2.0] Part 0: Key Data Structures for Auto Schedule Search Jun 22, 2020
@jwfromm
Copy link
Contributor

jwfromm commented Jun 22, 2020

I agree with @tqchen that it would be helpful to break this up a little into separate PRs. Maybe it can be divided into the three partitions discussed in the paper: task scheduling, program sampling, and performance tuning. That would make it much more clear what we're looking it.

@merrymercy
Copy link
Member

merrymercy commented Jun 22, 2020

@jwfromm The partition of implementation is different from the organization of the paper. We cannot upstream code according to the paper. We listed the integration steps in our RFC, and we will follow those steps.
For people who read the paper, this PR corresponds to nothing in the paper. This PR only provides the infrastructures for search (The definition of state and actions in the search). The search policy (i.e. program sampling and performance fine-tuning) and task scheduler will be in next PRs.

@merrymercy merrymercy changed the title [Ansor][AutoTVM v2.0] Part 0: Key Data Structures for Auto Schedule Search [Ansor][AutoTVM v2.0] Part 0: Infrastructures for automatic schedule search Jun 22, 2020
@merrymercy merrymercy changed the title [Ansor][AutoTVM v2.0] Part 0: Infrastructures for automatic schedule search [Ansor][AutoTVM v2.0] Part 0: Infrastructures for Automatic Schedule Search Jun 22, 2020
@tqchen
Copy link
Member

tqchen commented Jun 23, 2020

I have outlined the proposal of a possible breakdown of this PR in the above post, please see if that makes sense

@comaniac
Copy link
Contributor

Since this PR is incomplete it might be not trivial for people to review. For example, feature.cc includes all utilities we used to extract features, but the use case (XGBoost cost model) is not included yet. Another example is that loop state and transform steps are included in this PR but the search policy isn't. This might cause people hard to imagine how those data structures will be used.

If we are going to breakdown this PR, I would suggest putting everything we have to this PR and use this PR as the reference when sending other small PRs.

…he#39)

* lint fix

* clang-format-fix

* pylint fix

* Update

* Recover the double constructor of tvm::PrimExpr

* Fix pylint

* pylint fix

* pylint fix
@jcf94 jcf94 changed the title [Ansor][AutoTVM v2.0] Part 0: Infrastructures for Automatic Schedule Search [WIP][Ansor][AutoTVM v2.0] Part 0: Infrastructures for Automatic Schedule Search Jun 24, 2020
@merrymercy
Copy link
Member

After some discussion, we changed our upstream plan.

We will distill a minimal version of Ansor and send it as the first PR.
This minimal version will contain a small but complete framework, so people can get a better understanding of the whole structure of Ansor.
This is different from the old upstream plan where we send separate components one by one. In the old plan, the framework is not complete during upstream, making the reviewers lose the context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants