Enable QNN primitives for DNNL runtime #11642

apeskov · 2022-06-09T11:59:25Z

Support of QNN primitives and their fused versions for DNNL

DNNL has support of int8 convolution/dense/matmul primitives with some list of post operations. Which usually looks like next:

%0 = conv(src<int8>, wgh<int8>)       -> <int32>
%1 = %0 + bias                        -> <int32>
%2 = %1 * o_scl                       -> <float>
%3 = act(%2) * act_scl                -> <float> // some activate like clip/relu/gelu and others
%4 = %3 + src2 * sum_scl              -> <float>
%5 = cast<int32>(%4) + dst_zp         -> <int32>
%6 = cast<int8>(%5)                   -> <int8>

Equivalent graph on relay level will looks like:

op = qnn.conv2d(data, weight, src_scl, src_zp, wgh_scl, wgh_zp)
op = add(op, bias)
op = qnn.requantize(op, rq_in_scl, rq_in_zp, rq_out_scl, rq_out_zp)
op = act(op, "clip")   // also ident, qelu, relu
op = cast(op, "int8")
op = qnn.add(op, sum_data, add_lhs_scl, add_lhs_zp, add_rhs_scl, add_rhs_zp, add_out_scl, add_out_zp)
op = clip(op)

To reach this mapping I have to introduce one more legalisation pass legalize_qnn_for_dnnl which reformulate qnn relay pattern to DNNL compatible form. Subsequent DNNL primitive handling is regular (pattern matching, transform to json, dnnl runtime support of new qnn nodes).

Additional things

There is issue with analysis of complex pattern after partitioning. Some engineers use linear search like GetRootCall(fn->body.as<CallNode>(), 2, {"conv", "add"}), which is not fully correct and inconvenient. In this PR I use DFPatternMatcher for this purpose.
There is potential issue with constant naming and ordering. Because current mechanic of handling JSONRuntimeBase suppose that you visit original graph as is without dropping of some paths or recalculation of constants. To avoid this issue this PR introduces DNNLConstantUpdater utility which guarantee correctness of constant handling.
Positional arguments is not convenient approach is you have a lot of arguments and some of then optional. To simplify handling of optional arguments I use node attributes with index of corresponding input. Like "bias_idx=2", "o_scl"=5 and so on.
Added support of "quasi in-place" primitives. Currently that is stub and in-place behaviour is simulated via src->dst copy. After update of "MemoryPlan" with inlace support this simulation in runtime will be switched off.

yangulei · 2022-06-16T03:47:05Z

Hi apeskov,

I'm working on zero-copy in DNNL which may relate to the in-place primitives you mentioned:

Added support of "quasi in-place" primitives. Currently that is stub and in-place behavior is simulated via src->dst copy. After update of "MemoryPlan" with in-place support this simulation in runtime will be switched off.

I tried to enable zero-copy when the tensors are read or write by DNNL primitives by assigning the handle of the DNNL memory to TVM buffer before the execution of the primitives. It works for most of the CV models I have tested, while produce wrong results in cases when:

TVM add is converted to post-op sum in DNNL, and
one of the inputs has a non in-place layout transform ahead of add.

With my understanding, the post-op sum is more like accumulation which requires the two inputs and the outputs to be the same buffer, and the non in-place OP before add breaks this requirement. I tried to replace post-op sum with post-op binary add, then the results are correct but DNNL may run into ref:any with terrible performance.
I couldn't find a solution to ensure both the correctness and optimal performance now. Do you have any idea about this? Is memory copy inevitable in this scenario?

apeskov · 2022-06-16T09:50:03Z

Hi @yangulei,

As I remember zero-copy input/output of tensors should already works in TVM main. If you define relay layouts match with DNNL expectations it will use external buffer as is without any copying or processing. It was one of the goal of PR #11345. Could you please be more specific about scenarios you would like to optimise?

Regarding post-op sum. You are absolutely right, non in-place op before add break correctness. Mem copy is inevitable. In case of post op sum input data should be put into DST tensor of DNNL primitive and execution of primitive will rewrite this data. In contrast of post-op binary add read data from separate input tensor. Currently binary add has a limited support by primitives which lead to ref:any. Also it has slightly worse performance because it lead to one more memory access pass.

In case of missing layouts DNNL BOC runtime automatically inject required reorder primitives. And it will looks like next:

                    bias --
                           \
in1 -> RORDER_1 -> tmp_1 -> CONV -> tmp3 -> RORDER_3 -> out
                                  /
               in2 -> RORDER_2 --

Problem in tensor tmp_3. There is 2 primate which produce data of tmp_3. That's break concept of data flow graph. REORDER_2 should be executed strongly before CONV primitive. If you take a look in this patch in code relate with in-place simulation (link to it) you will see exactly I said. Essentially it's just copy input data to dst tensor exactly before convolution primitive.

Post op sum is very tricky and has a lot of requirements. It works only if proper layouts for conv and add was selected. It requires validation of ability to rewrite input tensor memory (for Resnet50 this is correct but in arbitrary case it should be checked).

yangulei · 2022-06-16T10:36:58Z

Could you please be more specific about scenarios you would like to optimize?

I tried zero-copy before the merging of #11345, it's not necessary now since it's already covered in your work.

Thanks for your explanation about post-op sum. As residual structure is widely used in CV models, the performance gain from post-op sum is very important, so we want to use post-op sum if possible. I'll try to add a reorder to copy SRC to DST in DNNL, it will be nice if DNNL do nothing if SRC and DST are identical.

Regarding to the automatic injection of required reorder primitives, I think it's useful to ensure optimal DNNL layout in runtime. But I think it's better to do the layout transform in AlterOpLayout before the graph is consumed by DNNL, thus we can do the layout transform in compile/build time instead of runtime.

apeskov · 2022-06-16T12:56:57Z

But I think it's better to do the layout transform in AlterOpLayout.

Absolutely agree with you! Automatic reordering is just a fallback mechanics which prevent DNNL from using "ref:any" implementations. "TensorRequisite" mechanic was designed specially to do nothing in case of matched layouts, and smooth out performance degradation if layout was not specified properly.

Current mechanic with "get_optimal_layout_for_XXX" has a lot of limitations and may works incorrect for some cases. So auto reordering is still needed.

In context of enabling sum post op. Absolutely agree with you again! This is important feature, specially in case of latency optimisation on system with big cores count. As you see I'm working on that. This patch contain "stub" inlace support, because a lot of changes required on level of TVM GraphExecutor/VM memory plan manager. In-place memory modification support should be added to TVM as common feature. Otherwise it will looks like hack/workaround only for DNNL runtime.

Signed-off-by: Alexander Peskov <peskovnn@gmail.com>

apeskov · 2022-06-16T15:24:23Z

@masahi Could you please take a look. That is a second part of "huge" PR #9618 you was reviewing previously.

apeskov · 2022-06-16T15:39:34Z

@comaniac @crazydemo you may be also interested in topic of this change. Feel free to comment.

yangulei · 2022-06-17T02:02:08Z

src/relay/backend/contrib/dnnl/codegen.cc

      } else {
        LOG(FATAL) << "Unrecognized DNNL pattern: " << name;
      }
+
+      if (args.empty() && args.empty()) {


yeah let's quickly fix this

definitely a typo. Fixed.

masahi

Very nice! We have quantized bert-base working in TVM using VNNI tensorization. I wanted to compare against DNNL performance on the same model, I can try that now.

Also, the use of pattern matching in place of GetRootCall thing is interesting.

masahi · 2022-06-17T05:26:31Z

src/relay/backend/contrib/dnnl/codegen.cc

      } else {
        LOG(FATAL) << "Unrecognized DNNL pattern: " << name;
      }
+
+      if (args.empty() && args.empty()) {


yeah let's quickly fix this

Signed-off-by: Alexander Peskov <peskovnn@gmail.com>

billishyahao · 2022-06-20T07:52:47Z

Hi @yangulei , Thanks for the comments. I am aware of the problem in #11513
Perhaps I can propose a new pr soon to recover my previous change about altering dense layout.

apeskov force-pushed the ap/dnnl-qnn branch 4 times, most recently from 6be73ae to 8e6b25e Compare June 16, 2022 00:07

apeskov force-pushed the ap/dnnl-qnn branch from 0549011 to 80e596c Compare June 16, 2022 11:07

apeskov added 2 commits June 16, 2022 15:57

[DNNL] Enable QNN primitives

bbcf497

Signed-off-by: Alexander Peskov <peskovnn@gmail.com>

[DNNL] add qnn test

e4121d8

Signed-off-by: Alexander Peskov <peskovnn@gmail.com>

apeskov force-pushed the ap/dnnl-qnn branch from 80e596c to e4121d8 Compare June 16, 2022 12:58

yangulei reviewed Jun 17, 2022

View reviewed changes

masahi self-assigned this Jun 17, 2022

masahi approved these changes Jun 17, 2022

View reviewed changes

typo fix

4100d84

Signed-off-by: Alexander Peskov <peskovnn@gmail.com>

masahi merged commit 5aabeb7 into apache:main Jun 17, 2022

driazati mentioned this pull request Jul 14, 2022

TVM v0.9.0.rc0 Release Candidate Notes #12102

Closed

liaopeiyuan added a commit to zk-ml/tachikoma that referenced this pull request Oct 1, 2022

adapting apache#11642

01989e5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable QNN primitives for DNNL runtime #11642

Enable QNN primitives for DNNL runtime #11642

apeskov commented Jun 9, 2022 •

edited

Loading

yangulei commented Jun 16, 2022

apeskov commented Jun 16, 2022

yangulei commented Jun 16, 2022

apeskov commented Jun 16, 2022 •

edited

Loading

apeskov commented Jun 16, 2022

apeskov commented Jun 16, 2022

yangulei Jun 17, 2022

masahi Jun 17, 2022

apeskov Jun 17, 2022

masahi left a comment

masahi Jun 17, 2022

billishyahao commented Jun 20, 2022

Enable QNN primitives for DNNL runtime #11642

Enable QNN primitives for DNNL runtime #11642

Conversation

apeskov commented Jun 9, 2022 • edited Loading

yangulei commented Jun 16, 2022

apeskov commented Jun 16, 2022

yangulei commented Jun 16, 2022

apeskov commented Jun 16, 2022 • edited Loading

apeskov commented Jun 16, 2022

apeskov commented Jun 16, 2022

yangulei Jun 17, 2022

Choose a reason for hiding this comment

masahi Jun 17, 2022

Choose a reason for hiding this comment

apeskov Jun 17, 2022

Choose a reason for hiding this comment

masahi left a comment

Choose a reason for hiding this comment

masahi Jun 17, 2022

Choose a reason for hiding this comment

billishyahao commented Jun 20, 2022

apeskov commented Jun 9, 2022 •

edited

Loading

apeskov commented Jun 16, 2022 •

edited

Loading