[WIP][TVM] Bring Your Own Codegen to TVM #4258

zhiics · 2019-11-05T20:49:29Z

This is a WIP that enables different backends and/or hardware vendors to bring their own codegen tools to TVM. This is the collaboration between @comaniac and me. @jroesch also provided lots of suggestions in the initial design. The RFC can be found here: https://discuss.tvm.ai/t/bring-your-own-codegen-to-tvm/4501/27

Some high-level design and APIs involve the following parts:

Graph coloring/annotation
Providing HW vendors an infra to customize where they want to execute an op.
Two possible ways are allowed to annotate a graph:
- Custom pass: users can write a Relay pass to decide how they want to partition graph using (subgraph_begin and subgraph_end annotations). For example, more sophisticated algorithm could be implemented to annotate the groups of operators.
- A high-level API is used to help user/vendors to enable a convenient integration
```
@reg.register_extern_op("nn.conv2d")
def conv2d(attrs, args, comp):
    return get_extern_op(comp, "conv2d")(attrs, args)
```
  Each codegen only needs provide the supported operators and they can invoke a separate build pipeline, e.g. build_extern, to invoke partitioning. On the completion of this pipeline, the operators that will be offloaded is wrapped with subgraph_start and subgraph_end annotation. The annotated program will then be sent to the normal build pipeline for code and artifacts generation.
```
@reg.register_extern_op("nn.conv2d")
def conv2d(attrs, args, comp):
    return get_extern_op(comp, "conv2d")(attrs, args)
```
Graph partitioning
It is a Relay pass that partitions a program into segments that could be executed on various hardware platforms based on the annotations. The current implementation has not fused consecutive subgraphs belonging to the same backend together yet. It will be handled by follow-up PRs.
Code generation
Generate code for each segment of a partition Relay program. Each codegen tool is wrapped into a runtime module so that we can leverage the current TVM infra to do serialization and runtime invocation.

FYI, we currently used GCC as an external codegen tool for easy prototyping and verification. It should be removed later when we land it.

cc @tqchen @yzhliu @wweic @broune @soiferj @junrushao1994 @icemelon9

tqchen · 2019-11-05T21:33:05Z

Thanks for the PR.

It would be great if you can propose a separate PR for the runtime/contrib support. Given that the runtime itself can be quite interesting and we want to have a clear guide for it.

The main problem I see in the current PR is that that the serialization is not implemented.

You can run things through relay.execute because the runtime is created on the fly. However, you cannot save the module, because SaveBinary and load method is not implemented for DNNLModule.

Moreover, if we are only generating c code, I think a better way would be to reuse DSOModule, e.g. to generate wrapper functions in C that directly adopts the PackedFunc calling convention in the DSOModule, then the code can be collectively compiled with the original source, and still exposes these functions. Have a specific shell code Module(DNNL, and GCC) adds additional duplications that we do not need.

A better example I have in mind could be something like a subgraph sequence in a serialized format(which can be used in save-binary), and then the PackedFunc interprets the graph and run things. This will likely applies to settings like TensorRT, and TF, although you can always generate a sequence of c code into low level libs. The subgraph or special device related blob serialization would be more relevant to accelerators

zhiics · 2019-11-05T21:53:45Z

@tqchen Thanks for comment:) We actually tried something similar to the DSOModule you mentioned here. @comaniac can you probably share a bit more about it. Anyways, let me take a look at it and see if we missed something.

comaniac · 2019-11-05T22:07:35Z

I agree that our base module has many similarities to the DSOMoulde. Maybe we can consider to directly base on the DSOModule and keep the functions overridable in case users want to use other forms of serilizations.

tqchen · 2019-11-05T22:12:28Z

To be specific, if we want to make uses of the shared library, what I will do is to generate the redirection code (like the current CodeGenC)

extern "C" int my_function (TVMArgs* args, TVMTypeCode* tcode, int len) {
    void* handle = args[0].v_handle
    // call into the real my_function
}

Then we can compile the file and link together with other parts.

For modules that needs its own serialization(e.g. has customized graph format) we don't have to subclass from the DSO, but can directly subclass from Module and in the PackedFunc walked through the data structures to do function calls. I think we need an example that is related to this kind, because it is closer to what people need when connecting to a customized nn runtime

zhiics · 2019-11-05T23:09:46Z

@tqchen Yes, we are doing something similar. We generate the C APIs directly and compile them into a .so file so that the runtime module can load it. We didn't generate the wrapper like what you provide above. Instead, we generate the real function call for a subgraph, i.e.

void foo(float* a, int N, float* b, int M, float* c) {
   bar(...);
   foobar(...);
}

This foo API is generated through the Relay ExprVisitor and it will be invoked using GetPackedFunction. The NDArray to float* conversion is currently done in GetPackedFunction instead of foo.

include/tvm/runtime/vm.h

soiferj · 2019-11-06T15:45:56Z

cmake/config.cmake

+# Available externs:
+#   gcc
+#   dnnl
+set(USE_EXTERN none)


Given the number of external compilers and runtimes we may have, I think it's better for each one to have its own field in config.cmake. For example, USE_GCC or USE_DNNL. With this, we can be more flexible with our options for finding the compiler / runtime. For example, USE_GCC=ON vs USE_GCC=<path to gcc>

I think they should be prefixed as USE_EXTERNAL_ to make it clear this is part of the external integration if we go down that path.

include/tvm/relay/attrs/annotation.h

soiferj · 2019-11-06T16:00:48Z

python/tvm/relay/op/contrib/dnnl/extern_op.py

+    """Check if the external codegen should be used.
+    FIXME: Turn off due to not support of multiple outputs.
+    """
+    return False


Is there a purpose to returning false? Could this function just be removed?

python/tvm/relay/op/op.py

src/relay/pass/extern_op.cc

src/runtime/contrib/extern_common.h

tqchen · 2019-11-06T17:35:12Z

OK, please try to send another PR with a mini customized runtime that loads in something like a graph or steps of dnnl calls, and implements save/load binary. This will help resolve the confusion that @soiferj has on this PR. It would be great if the runtime PR is compiler independent, something like graph runtime test, where we manually construct("compile") the necessary data structures.

For the C dll library, please generate the shell code that directly interface with DSO module, so we don't have to define another dll loader.

zhiics · 2019-11-06T18:09:58Z

Sure, let's give it a try to send the runtime part first.

jroesch · 2019-11-11T19:36:40Z

include/tvm/relay/transform.h

+ *
+ * \return The pass.
+ */
+TVM_DLL Pass PartitionGraph();


We should move away from graph terminology we should start to emphasize that we have more than a data-flow graph, this has led people to avoid scoping, effects, etc

jroesch · 2019-11-11T19:57:56Z

src/relay/backend/contrib/contrib_codegen.h

+namespace relay {
+namespace contrib {
+
+class ExternCodegenBase {


Should the external interface sit in contrib?

jroesch

Left a bunch of comments, thanks for taking the prototype I wrote to the finish line :)

zhiics · 2019-11-11T22:25:21Z

@jroesch @soiferj Thanks for comments. We will come back to fix them once the runtime part is done.

u99127

I've only had time to review the tutorial tonight but before anything else - this is a very interesting PR and would need more reviews and iterations with some prototyping. I've had a quick read through and there are some obvious changes that I think should be fixed up and some questions about the integration.

Ramana

tutorials/dev/custom_relay_backend.py

u99127 · 2019-11-12T21:12:58Z

tutorials/dev/custom_relay_backend.py

+# if(_gcc_idx GREATER -1)
+#     file(GLOB GCC_RELAY_CONTRIB_SRC src/relay/backend/contrib/gcc/codegen.cc)
+#     list(APPEND COMPILER_SRCS ${GCC_RELAY_CONTRIB_SRC})
+#     file(GLOB GCC_CONTRIB_SRC src/runtime/contrib/gcc/*.cc)
+#     list(APPEND RUNTIME_SRCS ${GCC_CONTRIB_SRC})
+#     message(STATUS "Use extern library: GCC")
+# endif()


Can more than one such contrib codegen paths exist in the source base ? I presume as long as the parameter to FIND_USE_EXTERN is unique , that's ok ?

Yes we can have more codegen paths exist in the source base.
The current prototype allows you to specify a list in FIND_USE_EXTERN to enable more than one external backends, but we are still discussing the best way for users.

u99127 · 2019-11-12T21:13:59Z

tutorials/dev/custom_relay_backend.py

+# Finally, we include the implemented codegen to the cmake config so that
+# it will be built along with the TVM. In cmake/modules/contrib/Extern.cmake:
+#
+# list(FIND USE_EXTERN "gcc" _gcc_idx)


What is the meaning of FIND_USE_EXTERN here ? the name isn't obvious , neither can I find any comment around it.

This is a CMake workaround in order to check if "gcc" is the list of USE_EXTERN. In the later version of CMake we can simply use the keyword in like Python, but it is not allowed in the old CMake. Again since we haven't confirmed the best way of enabling the external backend, this is just a prototype.

tutorials/dev/custom_relay_backend.py

u99127 · 2019-11-12T21:16:56Z

tutorials/dev/custom_relay_backend.py

+######################################################################
+# Define The Supported Operators
+# ------------------------------
+# The first step is to define which operators are supported by your backend.


Can we support any dialect operators as well ? Should we make that explicit ?

The custom op will not have a Relay mapping so it is not recognizable here.
It seems to me that if an external backend can support some ops that Relay/TVM cannot support, we should at least support them first in Relay/TVM. But we can discuss more about this in the RFC

Well, relay can additionally define dialects, for e.g. one has the qnn ops as a dialect on top of relay . Can I match qnn.conv2d in these rules for instance ?

comaniac · 2019-11-12T22:00:06Z

@u99127 thanks for th comments.

As we are making another PR #4280 for the runtime and will refine this PR accordingly after #4280 has been merged, I would suggest to review #4280 first for the implementation details of the runtime support.

u99127 · 2019-11-13T22:06:01Z

include/tvm/relay/expr.h

@@ -589,6 +597,25 @@ std::string AsText(const NodeRef& node,
                   bool show_meta_data = true,
                   runtime::TypedPackedFunc<std::string(Expr)> annotate = nullptr);

+/*! \brief namespace of the attributes that are attached to a function. */
+namespace attr {
+/*! \brief Mark the function as a primitive function. */


What do we mean by a "primitive" function ?

u99127 · 2019-11-13T22:20:07Z

tutorials/dev/custom_relay_backend.py

+######################################################################
+# Define The Supported Operators
+# ------------------------------
+# The first step is to define which operators are supported by your backend.


Well, relay can additionally define dialects, for e.g. one has the qnn ops as a dialect on top of relay . Can I match qnn.conv2d in these rules for instance ?

u99127 · 2019-11-13T22:25:28Z

@comaniac - I'll have a look at #4280 , thanks.

comaniac · 2019-11-25T23:32:01Z

Hi @tqchen,

Based on the runtime PR, we are now back to the approach of building external runtime module and integrate it to the DSO module. Specifically, when users invoke build and their external codegen attempts to generate a C source module, then we should build an extenral runtime module here. The design options are:

Build an external.so file using system("g++ ...) like we've done before, load it back and import to the DSO module. The drawback is that we will have that external.so file in the disk.
Invoke Clang frontend APIs to first compile the generated external code to LLVM IR, and then use llvm::parseIR like TVM LLVM backend to get an executable module. The uncertain part for this option is that we have no idea about how to accept compile flags from users.

Please let us know which option do you prefer or you have a better solution.

Thanks.

zhiics · 2019-11-25T23:40:17Z

The problem we have right now is to let the DSOModule (in graph runtime) find a symbol from the imported CSourceModule and execute it. It looks that we need to generate code for the CSourceModule first which was done in the example through export_library in Python. But if we want to have in memory execution, it looks that we have to be able to invoke the CSourceModule functions directly. Do we miss something?

masahi · 2019-11-26T04:26:52Z

Hi @zhiics @comaniac I'm trying this PR. I've verified that both the GCC example and DNNL example in the slide works. Can you fix the build after the rebase?

zhiics · 2019-11-26T04:42:44Z

@masahi Thanks for your interest. We changed the runtime a bit. We tried to generate a DSO module directly and load it there, but now we made the external as either CSourceModule or a simple JSONModule. We had a bit confusion on how to integrate the CSourceModule into the DSO module. We will cleanup the PR and address all comments once this is figured out.

masahi · 2019-11-26T04:54:16Z

@zhiics ok, until then I can play around with my working build (before rebase). You also need to fix build errors due to recent "Unified object protocol" initiative by tqchen.

masahi · 2019-11-26T05:57:04Z

@zhiics sorry my earlier comment on build error was due to my corrupted environment. Build is working now.

- Separate compilation and runtime - Create separate build pipeline for external and normal functions - Serialize InvokeExternal Instruction

src/relay/backend/contrib/dnnl/codegen.cc

src/relay/pass/partition_graph.cc

tutorials/dev/custom_relay_backend.py

tqchen · 2019-12-04T00:38:16Z

src/relay/backend/contrib/gcc/codegen.cc

+  std::vector<std::pair<std::string, int>> out_;
+};
+
+class GccCodegen : public ExternCodegenBase {


GccCodgen Codes not necessarily makes sense, as the code backend is not gcc specific (works for any c compiler)

Yes, you are right. The name doesn't make sense. How about CSourceCodegen?

tqchen · 2019-12-04T00:39:28Z

include/tvm/runtime/contrib/dnnl/dnnl_kernel.h

+#include "dnnl.hpp"
+
+namespace tvm {
+namespace runtime {


The header should be part of the internal header, (move to src/) as we don't want to expose them to users' of tvm

Yeah, I intentionally moved it from src to header. The reason was because the wrapper could then just directly include it and export_libaray will then take care of finding the path under include/tvm here:

https://github.com/apache/incubator-tvm/blob/279a8ebae6d507f02d904397672dc44982719645/python/tvm/_ffi/libinfo.py#L183

Otherwise, We may expect users to pass -I$PATH_TO_TVM/src/runtime/contrib

tqchen · 2019-12-04T00:44:18Z

src/relay/backend/graph_runtime_codegen.cc

@@ -639,6 +668,35 @@ class GraphRuntimeCodegenModule : public runtime::ModuleNode {
      return PackedFunc([sptr_to_self, this](TVMArgs args, TVMRetValue* rv) {
        *rv = this->output_.lowered_funcs;
      });
+    } else if (name == "get_external_funcs") {


I don't really like the current monolithic approach to handle the external compilations. Perhaps we can think a bit deeper.

tqchen · 2019-12-04T00:54:53Z

After looking more closely at the PR again, there are a few high-level things.

We have done an iteration that removes the special handling code in the runtime, and just make use of the runtime DSO module mechanism, which is great.

There are however, still quite a lot of special codepath for code generation(GraphRuntimeCodegen), which is not a good thing. Because we have multiple variations runtime(e.g. VM).

Codegen Logic

Ideally, we want something like IRModule->Codegen -> RuntimeModule, or a collection of them. Where IRModule could contain functions with an explicit compiler annotation, so and a specific compiler is invoked. I can imagine us handling this in the compile_engine, so that the caller do not have to worry about the extern vs non-extern.

This is something that we might be able to separate out as another PR

Graph Partition and Annotation

The graph partition and annotation should be a pass that takes IRModule->IRModule, which then makes use of the data.

zhiics · 2019-12-04T06:45:27Z

Codegen Logic

Ideally, we want something like IRModule->Codegen -> RuntimeModule, or a collection of them. Where IRModule could contain functions with an explicit compiler annotation, so and a specific compiler is invoked. I can imagine us handling this in the compile_engine, so that the caller do not have to worry about the extern vs non-extern.

This is something that we might be able to separate out as another PR

Thanks for pointing this out. This is also something we were trying to achieve and the external codegen (xxCodegen.cc) exactly looks like that. I agree that putting the codegen logic in GraphRuntimeCodegen is not clean. But it seems that compile_engine does not really need to do much (or even anything) for external functions. I was thinking that we can probably just have a packed function, CompileExternalFuncs (could be in compile_engine), and pass all collected external functions to it to generate runtime modules. We only need to collect these functions from GraphRuntimeCodegen and VMCompiler when traversing the AST. Does this sound good to you?

Graph Partition and Annotation

The graph partition and annotation should be a pass that takes IRModule->IRModule, which then makes use of the data.

Yes, they are IRModule->IRModule.

TaoLv · 2019-12-27T14:22:45Z

src/runtime/contrib/dnnl/dnnl.cc

+  auto conv2d_src_md = memory::desc({conv2d_src_tz}, dt::f32, tag::any);
+  auto conv2d_bias_md = memory::desc({conv2d_bias_tz}, dt::f32, tag::any);
+  auto conv2d_weights_md = memory::desc({conv2d_weights_tz}, dt::f32, tag::any);
+  auto conv2d_dst_md = memory::desc({conv2d_dst_tz}, dt::f32, tag::nchw);


Why hard code the dst format to nchw?

TaoLv · 2019-12-27T14:27:46Z

src/runtime/contrib/dnnl/dnnl.cc

+  auto conv2d_dst_memory = memory(conv2d_prim_desc.dst_desc(), eng);
+
+  auto conv = convolution_forward(conv2d_prim_desc);
+  conv.execute(s, {{DNNL_ARG_SRC, conv2d_src_memory},


tag::any is used to create primitive, so need to query the optimal format and perform reorder from nchw to optimal format. But since dst format is hard coded to nchw, most likely the optimal format here might be nchw though. The implementation here is a little strange to me.

TaoLv · 2019-12-27T14:29:49Z

src/runtime/contrib/dnnl/dnnl.cc

+  memory::dims dst_tz = {p_B_, p_O_};
+
+  auto data_md = memory::desc{{data_tz}, dt::f32, tag::nc};
+  auto weight_md = memory::desc({{weight_tz}, dt::f32, tag::nc});


Better to be tag::io or tag::oi.

zhiics · 2020-01-18T18:21:48Z

Let's close this since most of the work are merged. The annotation template will be considered separately.

tqchen self-assigned this Nov 5, 2019

jroesch reviewed Nov 6, 2019

View reviewed changes

include/tvm/runtime/vm.h Outdated Show resolved Hide resolved

soiferj suggested changes Nov 6, 2019

View reviewed changes

soiferj reviewed Nov 6, 2019

View reviewed changes

src/runtime/contrib/extern_common.h Outdated Show resolved Hide resolved

zhiics mentioned this pull request Nov 8, 2019

[TVM][RUNTIME] A minimum example to generate external library wrappers for DSOModule #4280

Merged

jroesch reviewed Nov 11, 2019

View reviewed changes

jroesch requested changes Nov 11, 2019

View reviewed changes

u99127 suggested changes Nov 12, 2019

View reviewed changes

u99127 reviewed Nov 13, 2019

View reviewed changes

zhiics force-pushed the partitioning branch from 72d3a38 to 73bebb5 Compare November 25, 2019 23:13

zhiics force-pushed the partitioning branch from 73bebb5 to 7f5ea88 Compare November 27, 2019 18:37

comaniac force-pushed the partitioning branch from e09d2fa to 8d49610 Compare November 27, 2019 21:13

comaniac and others added 13 commits November 28, 2019 22:10

remove get lib path API

c7c74c3

initial commit tutorial

5d5c907

add annotation to tutorial

0dcc5d0

Refine tutorial a bit

70d1e33

rebase to upstream

125b28f

Improve:

a298f9c

- Separate compilation and runtime - Create separate build pipeline for external and normal functions - Serialize InvokeExternal Instruction

change macro to function to reduce .so size

7891208

create attr const string for Function

b8f1db0

rebase to upstream

2736619

return csourcemodule from external codegen

d950234

fix test and clean code

6fb0605

fix ci

7beb0e3

more cleanup

5ff7fa6

zhiics force-pushed the partitioning branch from e6a934c to 5ff7fa6 Compare November 28, 2019 22:11

masahi reviewed Nov 30, 2019

View reviewed changes

src/relay/backend/contrib/dnnl/codegen.cc Outdated Show resolved Hide resolved

masahi reviewed Nov 30, 2019

View reviewed changes

src/relay/pass/partition_graph.cc Show resolved Hide resolved

masahi reviewed Nov 30, 2019

View reviewed changes

tutorials/dev/custom_relay_backend.py Outdated Show resolved Hide resolved

masahi reviewed Nov 30, 2019

View reviewed changes

tutorials/dev/custom_relay_backend.py Outdated Show resolved Hide resolved

fix typo

cf7ac3c

tqchen requested changes Dec 4, 2019

View reviewed changes

zhiics mentioned this pull request Dec 8, 2019

[Relay] External codegen #4482

Merged

6 tasks

This was referenced Dec 18, 2019

[relay][codegen] VM external codegen #4544

Merged

[relay] Relay annotation and partitioning for external compilers #4570

Merged

masahi mentioned this pull request Dec 23, 2019

[ONNX] Remove unnecessary cast of constants to int32 #4573

Merged

TaoLv reviewed Dec 27, 2019

View reviewed changes

zhiics closed this Jan 18, 2020

zhiics deleted the partitioning branch May 13, 2020 00:54

[WIP][TVM] Bring Your Own Codegen to TVM #4258

[WIP][TVM] Bring Your Own Codegen to TVM #4258

Conversation

zhiics commented Nov 5, 2019

tqchen commented Nov 5, 2019

zhiics commented Nov 5, 2019

comaniac commented Nov 5, 2019

tqchen commented Nov 5, 2019

zhiics commented Nov 5, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tqchen commented Nov 6, 2019

zhiics commented Nov 6, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jroesch left a comment

Choose a reason for hiding this comment

zhiics commented Nov 11, 2019

u99127 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

u99127 Nov 12, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

comaniac commented Nov 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

u99127 commented Nov 13, 2019

comaniac commented Nov 25, 2019

zhiics commented Nov 25, 2019

masahi commented Nov 26, 2019 • edited Loading

zhiics commented Nov 26, 2019

masahi commented Nov 26, 2019

masahi commented Nov 26, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tqchen commented Dec 4, 2019

Codegen Logic

Graph Partition and Annotation

zhiics commented Dec 4, 2019 • edited Loading

Codegen Logic

Graph Partition and Annotation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhiics commented Jan 18, 2020

u99127 Nov 12, 2019 •

edited

Loading

masahi commented Nov 26, 2019 •

edited

Loading

zhiics commented Dec 4, 2019 •

edited

Loading