Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable QNN primitives for DNNL runtime #11642

Merged
merged 3 commits into from
Jun 17, 2022
Merged

Conversation

apeskov
Copy link
Contributor

@apeskov apeskov commented Jun 9, 2022

Support of QNN primitives and their fused versions for DNNL

DNNL has support of int8 convolution/dense/matmul primitives with some list of post operations. Which usually looks like next:

%0 = conv(src<int8>, wgh<int8>)       -> <int32>
%1 = %0 + bias                        -> <int32>
%2 = %1 * o_scl                       -> <float>
%3 = act(%2) * act_scl                -> <float> // some activate like clip/relu/gelu and others
%4 = %3 + src2 * sum_scl              -> <float>
%5 = cast<int32>(%4) + dst_zp         -> <int32>
%6 = cast<int8>(%5)                   -> <int8> 

Equivalent graph on relay level will looks like:

op = qnn.conv2d(data, weight, src_scl, src_zp, wgh_scl, wgh_zp)
op = add(op, bias)
op = qnn.requantize(op, rq_in_scl, rq_in_zp, rq_out_scl, rq_out_zp)
op = act(op, "clip")   // also ident, qelu, relu
op = cast(op, "int8")
op = qnn.add(op, sum_data, add_lhs_scl, add_lhs_zp, add_rhs_scl, add_rhs_zp, add_out_scl, add_out_zp)
op = clip(op)

To reach this mapping I have to introduce one more legalisation pass legalize_qnn_for_dnnl which reformulate qnn relay pattern to DNNL compatible form. Subsequent DNNL primitive handling is regular (pattern matching, transform to json, dnnl runtime support of new qnn nodes).

Additional things

  1. There is issue with analysis of complex pattern after partitioning. Some engineers use linear search like GetRootCall(fn->body.as<CallNode>(), 2, {"conv", "add"}), which is not fully correct and inconvenient. In this PR I use DFPatternMatcher for this purpose.
  2. There is potential issue with constant naming and ordering. Because current mechanic of handling JSONRuntimeBase suppose that you visit original graph as is without dropping of some paths or recalculation of constants. To avoid this issue this PR introduces DNNLConstantUpdater utility which guarantee correctness of constant handling.
  3. Positional arguments is not convenient approach is you have a lot of arguments and some of then optional. To simplify handling of optional arguments I use node attributes with index of corresponding input. Like "bias_idx=2", "o_scl"=5 and so on.
  4. Added support of "quasi in-place" primitives. Currently that is stub and in-place behaviour is simulated via src->dst copy. After update of "MemoryPlan" with inlace support this simulation in runtime will be switched off.

@apeskov apeskov force-pushed the ap/dnnl-qnn branch 4 times, most recently from 6be73ae to 8e6b25e Compare June 16, 2022 00:07
@yangulei
Copy link
Contributor

Hi apeskov,

I'm working on zero-copy in DNNL which may relate to the in-place primitives you mentioned:

  1. Added support of "quasi in-place" primitives. Currently that is stub and in-place behavior is simulated via src->dst copy. After update of "MemoryPlan" with in-place support this simulation in runtime will be switched off.

I tried to enable zero-copy when the tensors are read or write by DNNL primitives by assigning the handle of the DNNL memory to TVM buffer before the execution of the primitives. It works for most of the CV models I have tested, while produce wrong results in cases when:

  • TVM add is converted to post-op sum in DNNL, and
  • one of the inputs has a non in-place layout transform ahead of add.

With my understanding, the post-op sum is more like accumulation which requires the two inputs and the outputs to be the same buffer, and the non in-place OP before add breaks this requirement. I tried to replace post-op sum with post-op binary add, then the results are correct but DNNL may run into ref:any with terrible performance.
I couldn't find a solution to ensure both the correctness and optimal performance now. Do you have any idea about this? Is memory copy inevitable in this scenario?

@apeskov
Copy link
Contributor Author

apeskov commented Jun 16, 2022

Hi @yangulei,

As I remember zero-copy input/output of tensors should already works in TVM main. If you define relay layouts match with DNNL expectations it will use external buffer as is without any copying or processing. It was one of the goal of PR #11345. Could you please be more specific about scenarios you would like to optimise?

Regarding post-op sum. You are absolutely right, non in-place op before add break correctness. Mem copy is inevitable. In case of post op sum input data should be put into DST tensor of DNNL primitive and execution of primitive will rewrite this data. In contrast of post-op binary add read data from separate input tensor. Currently binary add has a limited support by primitives which lead to ref:any. Also it has slightly worse performance because it lead to one more memory access pass.

In case of missing layouts DNNL BOC runtime automatically inject required reorder primitives. And it will looks like next:

                    bias --
                           \
in1 -> RORDER_1 -> tmp_1 -> CONV -> tmp3 -> RORDER_3 -> out
                                  /
               in2 -> RORDER_2 --

Problem in tensor tmp_3. There is 2 primate which produce data of tmp_3. That's break concept of data flow graph. REORDER_2 should be executed strongly before CONV primitive. If you take a look in this patch in code relate with in-place simulation (link to it) you will see exactly I said. Essentially it's just copy input data to dst tensor exactly before convolution primitive.

Post op sum is very tricky and has a lot of requirements. It works only if proper layouts for conv and add was selected. It requires validation of ability to rewrite input tensor memory (for Resnet50 this is correct but in arbitrary case it should be checked).

@yangulei
Copy link
Contributor

Could you please be more specific about scenarios you would like to optimize?

I tried zero-copy before the merging of #11345, it's not necessary now since it's already covered in your work.

Thanks for your explanation about post-op sum. As residual structure is widely used in CV models, the performance gain from post-op sum is very important, so we want to use post-op sum if possible. I'll try to add a reorder to copy SRC to DST in DNNL, it will be nice if DNNL do nothing if SRC and DST are identical.

Regarding to the automatic injection of required reorder primitives, I think it's useful to ensure optimal DNNL layout in runtime. But I think it's better to do the layout transform in AlterOpLayout before the graph is consumed by DNNL, thus we can do the layout transform in compile/build time instead of runtime.

@apeskov
Copy link
Contributor Author

apeskov commented Jun 16, 2022

But I think it's better to do the layout transform in AlterOpLayout.

Absolutely agree with you! Automatic reordering is just a fallback mechanics which prevent DNNL from using "ref:any" implementations. "TensorRequisite" mechanic was designed specially to do nothing in case of matched layouts, and smooth out performance degradation if layout was not specified properly.

Current mechanic with "get_optimal_layout_for_XXX" has a lot of limitations and may works incorrect for some cases. So auto reordering is still needed.

In context of enabling sum post op. Absolutely agree with you again! This is important feature, specially in case of latency optimisation on system with big cores count. As you see I'm working on that. This patch contain "stub" inlace support, because a lot of changes required on level of TVM GraphExecutor/VM memory plan manager. In-place memory modification support should be added to TVM as common feature. Otherwise it will looks like hack/workaround only for DNNL runtime.

Signed-off-by: Alexander Peskov <peskovnn@gmail.com>
Signed-off-by: Alexander Peskov <peskovnn@gmail.com>
@apeskov
Copy link
Contributor Author

apeskov commented Jun 16, 2022

@masahi Could you please take a look. That is a second part of "huge" PR #9618 you was reviewing previously.

@apeskov
Copy link
Contributor Author

apeskov commented Jun 16, 2022

@comaniac @crazydemo you may be also interested in topic of this change. Feel free to comment.

} else {
LOG(FATAL) << "Unrecognized DNNL pattern: " << name;
}

if (args.empty() && args.empty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah let's quickly fix this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely a typo. Fixed.

@masahi masahi self-assigned this Jun 17, 2022
Copy link
Member

@masahi masahi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! We have quantized bert-base working in TVM using VNNI tensorization. I wanted to compare against DNNL performance on the same model, I can try that now.

Also, the use of pattern matching in place of GetRootCall thing is interesting.

} else {
LOG(FATAL) << "Unrecognized DNNL pattern: " << name;
}

if (args.empty() && args.empty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah let's quickly fix this

Signed-off-by: Alexander Peskov <peskovnn@gmail.com>
@masahi masahi merged commit 5aabeb7 into apache:main Jun 17, 2022
@billishyahao
Copy link
Contributor

Hi @yangulei , Thanks for the comments. I am aware of the problem in #11513
Perhaps I can propose a new pr soon to recover my previous change about altering dense layout.

liaopeiyuan added a commit to zk-ml/tachikoma that referenced this pull request Oct 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants