[PROFILER] Theoretical roofline models #11066

tkonolige · 2022-04-19T20:45:30Z

tvm.analysis.roofline_analysis add estimated roofline performance to a profiling report. The roofline model measures how close a operator gets to best possible memory bandwidth or FLOP/s depending on whether it is memory or compute bound. This computation uses the runtime of the operator along with two numbers extracted from the TIR code: bytes of memory touched and number of floating point operations. Because these numbers are extracted from TIR, they may not be 100% accurate. The best possible memory bandwidth and FLOP/s are measured by running small programs that are memory and compute bound respectively.

For now, this function only works with llvm cpu targets, but it should be possible to extend to GPU targets.

@AndrewZhaoLuo @mbrookhart @masahi

python/tvm/analysis/__init__.py

jwfromm · 2022-04-20T17:15:50Z

I agree that the analysis namespace is confusing due to overloading. I personally would prefer something like either python/tvm/utils or python/tvm/tools for this.

tqchen · 2022-04-20T21:18:45Z

python/tvm/utils might indeed be a better choice, in the spirit of that, it might be useful to introduce python.tvm.utils.roofline, assuming the collection of utils growing in the future.

I would also recommend opening a discuss thread for posterity when introducing a top-level namespace if we think that is a better path.

python/tvm/relay/backend/vm.py

AndrewZhaoLuo

Didn't read the actual beefy stuff, but will later today. For now, some nits

src/runtime/profiling.cc

tests/python/unittest/test_runtime_profiling.py

AndrewZhaoLuo

Hmm overall looks fine to me. But suggest someone else also take a look who knows this roofline/tir stuff better than I

python/tvm/utils/__init__.py

AndrewZhaoLuo · 2022-04-26T19:23:15Z

@csullivan ?

csullivan

Cool stuff @tkonolige

include/tvm/runtime/profiling.h

csullivan · 2022-04-26T20:29:33Z

python/tvm/utils/__init__.py

+
+def _vec_width_registers(target, vec_width, num_vector_registers):
+    if vec_width is None:
+        if target.device_name == "":  # indicates x86


not necessarily, could be arm, hexagon, etc., right?

My understanding is that the llvm target (which is where device_name=="") implies x86. If we look at all llvm strategies in topi (called cpu there), they assume x86.

Yeah there is some oddness going on #11066 (comment) <-- perhaps this assumption is not entirely accurate or maybe llvm acts as a nice compatibility layer

Yeah, having an empty device name is more just a special case for llvm.

target = tvm.target.Target("
llvm -keys=hexagon,cpu
-link-params=0 -mattr=+hvxv68,
+hvx-length128b, +hvx-qfloat, -hvx-ieee-fp
-mcpu=hexagonv68 -mtriple=hexagon
")

will also have an empty device name similar to Andrew's comment. I think we need a different way to dispatch here.

In general I'd like to see this information queried as attrs from the target which, when not provided, can be queried from the device via an attr preprocessor used to update the attributes for a target kind -- which queries the local or remote device api for the necessary information.

I don't think the hexagon target string you have provided there is correct. If I look at every "builtin" target provided by tvm, they all set the -device flag to something (with the exception of hexagon).

In [2]: tvm.target.arm_cpu() Out[2]: llvm -keys=arm_cpu,cpu -device=arm_cpu -link-params=0 -model=unknown In [3]: tvm.target.hexagon() Out[3]: hexagon -keys=hexagon -link-params=0 -mattr=+hvxv66,+hvx-length128b -mcpu=hexagonv66 -mtriple=hexagon In [4]: tvm.target.cuda() Out[4]: cuda -keys=cuda,gpu -arch=sm_86 -max_num_threads=1024 -model=unknown -thread_warp_size=32 In [5]: tvm.target.bifrost() Out[5]: opencl -keys=bifrost,opencl,gpu -device=bifrost -max_num_threads=256 -model=unknown -thread_warp_size=1 In [6]: tvm.target.intel_graphics() Out[6]: opencl -keys=intel_graphics,opencl,gpu -device=intel_graphics -max_num_threads=256 -model=unknown -thread_warp_size=16 In [7]: tvm.target.mali() Out[7]: opencl -keys=mali,opencl,gpu -device=mali -max_num_threads=256 -model=unknown -thread_warp_size=1 In [8]: tvm.target.rasp() Out[8]: llvm -keys=arm_cpu,cpu -device=arm_cpu -link-params=0 -mattr=+neon -model=bcm2837 -mtriple=armv7l-linux-gnueabihf In [9]: tvm.target.riscv_cpu() Out[9]: llvm -keys=arm_cpu,cpu -device=arm_cpu -link-params=0 -mabi=lp64d -mcpu=sifive-u54 -model=sifive-u54 -mtriple=riscv64-unknown-linux-gnu In [10]: tvm.target.rocm() Out[10]: rocm -keys=rocm,gpu -max_num_threads=256 -mcpu=gfx900 -model=unknown -mtriple=amdgcn-amd-amdhsa-hcc -thread_warp_size=64

(I have been unable to find any documentation on what device actually means/does).

If you have a better way to detect x86, I'd be happy to use it. The only way I've seen this detection done in the codebase is by looking at mcpu.

I'd love to have this information available from the targets, but all the x86 code in topi appears to do it the way I do it here.

-device is used by TOPI to provide a layer of specialization when the target is generic. So with mali for example, we can have separate topi impls for opencl. And I agree that the way TOPI does hardware specific attribute lookup is how you are doing it here, I just don't like it :).

It's essentially a side channel for information to flow into the compiler from outside of the target that assumes that the hardware we are compiling for is locally accessible which only works in a limited set of cases. In my view, any device specific attribute that the compiler needs should be retrievable from the target directly, and we now have a path to do this with the attribute preprocessor I mentioned above that directly queries the DeviceAPI for this information.

In my view, any device specific attribute that the compiler needs should be retrievable from the target directly, and we now have a path to do this with the attribute preprocessor I mentioned above that directly queries the DeviceAPI for this information.

I agree on this point, but I think doing this change is out of scope of this PR. I'd have to change a bunch of stuff within topi.

I agree there are lots of places where this is assumed in TOPI, and I don't want you to be burdened with that in this PR, but for the cases in which target.device_name is not set and the machine is not x86 the profiler will silently do the wrong thing. This applies for hexagon, cuda, opencl, rocm, and others.

At minimum I guess we need to check that the target is llvm, and the device name is empty.

Can we also make it more strict such that get_simd_32bit_lanes does not return a default if the mcpu name is not found in its knowledgebase, or is that default behavior being relied on? If it's being relied on maybe we simple make get_simd_32bit_lanes take a parameter that defaults to True which indicates whether we should guess the number of lanes, and when false an error is thrown. Then here we can set that param to False so that for the case of profiling we don't make target specific assumptions.

I've changed this check to make sure the target is llvm and that there aren't any keys besides "cpu". I think that should cover our bases for now.

get_simd_32bit_lanes defaults to 4 if it can't figure out what the target is. Not sure this is a reasonable default, but I think we should leave changes to get_simd_32bit_lanes to a separate PR.

python/tvm/utils/__init__.py

csullivan · 2022-04-26T21:59:08Z

tests/python/unittest/test_runtime_profiling.py

+            # don't have an operator that uses a specific number of flops.
+            assert call["Percent of Theoretical Optimal"].ratio >= 0
+
+


A couple classification tests would be rather impressive. e.g. asserting that dense (as above) of sufficient size is compute bound, and some injective op is memory bound

The default (untuned) schedules for dense seems to always be memory bound instead of compute bound, so I can't really test for that without tuning. I'll try some other operators and see if it is true for them too.

python/tvm/utils/__init__.py

csullivan

LGTM, thanks for the great feature and discussions. I look forward to using this

tqchen · 2022-04-29T12:09:03Z

Thank you @tkonolige for the contribution and @csullivan @AndrewZhaoLuo for reviewing.

As the current proposal now we introduced a new top-level namespace (utils). While I personally also would tend to agree with the namespace choice, as stated in an earlier comment, please open a discuss thread early for posterity since this would be a case to get community involvement and awareness as new top-level is a main architectural choice.

After a new namespace is introduced, it usually less common to have implementations directly sit in __init__.py, as a new namespace would entail further grows of tools of similar kind in the same namespace. As a result, it would be helpful to move the implementation to a separate file(or files, e.g. roofline.py or other better names) under the same namespace then import.

tkonolige · 2022-04-29T17:11:21Z

@tqchen I've move the implementation into tvm/utils/roofline.py as I think you've requested. In the future, could you please explicitly approve PRs or request changes. See https://tvm.apache.org/docs/contribute/code_review.html#approve-and-request-changes-explicitly.

I've also posted to discuss as requested: https://discuss.tvm.apache.org/t/new-top-level-python-namespace-tvm-utils/12683.

tqchen · 2022-04-29T18:35:02Z

Thanks @tkonolige I have not yet approved the change explicitly because of the suggestions (as in the comment part of this review). Given the suggestions are not specific for a particular line of code(but architectural) in general I did not use the review part, and instead chose to give comment explicitly in the comment thread. Hopefully that has given enough context that aligns with the guideline without being too rigid.

`tvm.analysis.roofline_analysis` adds estimated roofline performance to a profiling report. The roofline model measures how close an operator gets to best possible memory bandwidth or FLOP/s depending on whether it is memory or compute bound. This computation uses the runtime of the operator along with two numbers extracted from the TIR code: bytes of memory touched and number of floating point operations. Because these numbers are extracted from TIR, they may not be 100% accurate. The best possible memory bandwidth and FLOP/s are measured by running small programs that are memory and compute bound respectively. For now, this function only works with llvm cpu targets, but it should be possible to extend to GPU targets.

AndrewZhaoLuo · 2022-05-03T21:54:34Z

I'm going to merge, looks like in discussion thread there is little controversy

`tvm.analysis.roofline_analysis` adds estimated roofline performance to a profiling report. The roofline model measures how close an operator gets to best possible memory bandwidth or FLOP/s depending on whether it is memory or compute bound. This computation uses the runtime of the operator along with two numbers extracted from the TIR code: bytes of memory touched and number of floating point operations. Because these numbers are extracted from TIR, they may not be 100% accurate. The best possible memory bandwidth and FLOP/s are measured by running small programs that are memory and compute bound respectively. For now, this function only works with llvm cpu targets, but it should be possible to extend to GPU targets.

tkonolige force-pushed the profiler_est_flops branch from d38b5b4 to 8a7c2d7 Compare April 19, 2022 20:51

tqchen reviewed Apr 19, 2022

View reviewed changes

python/tvm/analysis/__init__.py Outdated Show resolved Hide resolved

AndrewZhaoLuo reviewed Apr 21, 2022

View reviewed changes

python/tvm/relay/backend/vm.py Outdated Show resolved Hide resolved

AndrewZhaoLuo reviewed Apr 25, 2022

View reviewed changes

AndrewZhaoLuo reviewed Apr 26, 2022

View reviewed changes

csullivan reviewed Apr 26, 2022

View reviewed changes

tkonolige force-pushed the profiler_est_flops branch from 63b8746 to 4286aa3 Compare April 27, 2022 17:11

csullivan approved these changes Apr 28, 2022

View reviewed changes

AndrewZhaoLuo approved these changes Apr 28, 2022

View reviewed changes

tkonolige force-pushed the profiler_est_flops branch from 564cc1b to 241043e Compare April 29, 2022 17:07

tkonolige force-pushed the profiler_est_flops branch from 241043e to 12cbf0e Compare May 2, 2022 16:18

AndrewZhaoLuo merged commit 90084ab into apache:main May 3, 2022

driazati mentioned this pull request Jul 14, 2022

TVM v0.9.0.rc0 Release Candidate Notes #12102

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROFILER] Theoretical roofline models #11066

[PROFILER] Theoretical roofline models #11066

tkonolige commented Apr 19, 2022

jwfromm commented Apr 20, 2022

tqchen commented Apr 20, 2022 •

edited

Loading

AndrewZhaoLuo left a comment

AndrewZhaoLuo left a comment

AndrewZhaoLuo commented Apr 26, 2022

csullivan left a comment

csullivan Apr 26, 2022

tkonolige Apr 26, 2022

AndrewZhaoLuo Apr 26, 2022

csullivan Apr 26, 2022 •

edited

Loading

tkonolige Apr 26, 2022

csullivan Apr 27, 2022

tkonolige Apr 27, 2022

csullivan Apr 27, 2022

tkonolige Apr 27, 2022

csullivan Apr 26, 2022

tkonolige Apr 27, 2022

csullivan left a comment

tqchen commented Apr 29, 2022 •

edited

Loading

tkonolige commented Apr 29, 2022

tqchen commented Apr 29, 2022 •

edited

Loading

AndrewZhaoLuo commented May 3, 2022

		# don't have an operator that uses a specific number of flops.
		assert call["Percent of Theoretical Optimal"].ratio >= 0

[PROFILER] Theoretical roofline models #11066

[PROFILER] Theoretical roofline models #11066

Conversation

tkonolige commented Apr 19, 2022

jwfromm commented Apr 20, 2022

tqchen commented Apr 20, 2022 • edited Loading

AndrewZhaoLuo left a comment

Choose a reason for hiding this comment

AndrewZhaoLuo left a comment

Choose a reason for hiding this comment

AndrewZhaoLuo commented Apr 26, 2022

csullivan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csullivan Apr 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csullivan left a comment

Choose a reason for hiding this comment

tqchen commented Apr 29, 2022 • edited Loading

tkonolige commented Apr 29, 2022

tqchen commented Apr 29, 2022 • edited Loading

AndrewZhaoLuo commented May 3, 2022

tqchen commented Apr 20, 2022 •

edited

Loading

csullivan Apr 26, 2022 •

edited

Loading

tqchen commented Apr 29, 2022 •

edited

Loading

tqchen commented Apr 29, 2022 •

edited

Loading