Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROFILER] Theoretical roofline models #11066

Merged
merged 1 commit into from
May 3, 2022

Conversation

tkonolige
Copy link
Contributor

tvm.analysis.roofline_analysis add estimated roofline performance to a profiling report. The roofline model measures how close a operator gets to best possible memory bandwidth or FLOP/s depending on whether it is memory or compute bound. This computation uses the runtime of the operator along with two numbers extracted from the TIR code: bytes of memory touched and number of floating point operations. Because these numbers are extracted from TIR, they may not be 100% accurate. The best possible memory bandwidth and FLOP/s are measured by running small programs that are memory and compute bound respectively.

For now, this function only works with llvm cpu targets, but it should be possible to extend to GPU targets.

@AndrewZhaoLuo @mbrookhart @masahi

@jwfromm
Copy link
Contributor

jwfromm commented Apr 20, 2022

I agree that the analysis namespace is confusing due to overloading. I personally would prefer something like either python/tvm/utils or python/tvm/tools for this.

@tqchen
Copy link
Member

tqchen commented Apr 20, 2022

python/tvm/utils might indeed be a better choice, in the spirit of that, it might be useful to introduce python.tvm.utils.roofline, assuming the collection of utils growing in the future.

I would also recommend opening a discuss thread for posterity when introducing a top-level namespace if we think that is a better path.

Copy link
Contributor

@AndrewZhaoLuo AndrewZhaoLuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't read the actual beefy stuff, but will later today. For now, some nits

src/runtime/profiling.cc Outdated Show resolved Hide resolved
src/runtime/profiling.cc Show resolved Hide resolved
tests/python/unittest/test_runtime_profiling.py Outdated Show resolved Hide resolved
tests/python/unittest/test_runtime_profiling.py Outdated Show resolved Hide resolved
Copy link
Contributor

@AndrewZhaoLuo AndrewZhaoLuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm overall looks fine to me. But suggest someone else also take a look who knows this roofline/tir stuff better than I

python/tvm/utils/__init__.py Outdated Show resolved Hide resolved
python/tvm/utils/__init__.py Outdated Show resolved Hide resolved
python/tvm/utils/__init__.py Outdated Show resolved Hide resolved
python/tvm/utils/__init__.py Outdated Show resolved Hide resolved
python/tvm/utils/__init__.py Outdated Show resolved Hide resolved
python/tvm/utils/__init__.py Outdated Show resolved Hide resolved
python/tvm/utils/__init__.py Outdated Show resolved Hide resolved
python/tvm/utils/__init__.py Outdated Show resolved Hide resolved
@AndrewZhaoLuo
Copy link
Contributor

@csullivan ?

Copy link
Contributor

@csullivan csullivan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool stuff @tkonolige

include/tvm/runtime/profiling.h Outdated Show resolved Hide resolved

def _vec_width_registers(target, vec_width, num_vector_registers):
if vec_width is None:
if target.device_name == "": # indicates x86
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not necessarily, could be arm, hexagon, etc., right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the llvm target (which is where device_name=="") implies x86. If we look at all llvm strategies in topi (called cpu there), they assume x86.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah there is some oddness going on #11066 (comment) <-- perhaps this assumption is not entirely accurate or maybe llvm acts as a nice compatibility layer

Copy link
Contributor

@csullivan csullivan Apr 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, having an empty device name is more just a special case for llvm.

target = tvm.target.Target("
llvm -keys=hexagon,cpu
-link-params=0 -mattr=+hvxv68,
+hvx-length128b, +hvx-qfloat, -hvx-ieee-fp
-mcpu=hexagonv68 -mtriple=hexagon
")

will also have an empty device name similar to Andrew's comment. I think we need a different way to dispatch here.

In general I'd like to see this information queried as attrs from the target which, when not provided, can be queried from the device via an attr preprocessor used to update the attributes for a target kind -- which queries the local or remote device api for the necessary information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the hexagon target string you have provided there is correct. If I look at every "builtin" target provided by tvm, they all set the -device flag to something (with the exception of hexagon).

In [2]: tvm.target.arm_cpu()
Out[2]: llvm -keys=arm_cpu,cpu -device=arm_cpu -link-params=0 -model=unknown

In [3]: tvm.target.hexagon()
Out[3]: hexagon -keys=hexagon -link-params=0 -mattr=+hvxv66,+hvx-length128b -mcpu=hexagonv66 -mtriple=hexagon

In [4]: tvm.target.cuda()
Out[4]: cuda -keys=cuda,gpu -arch=sm_86 -max_num_threads=1024 -model=unknown -thread_warp_size=32

In [5]: tvm.target.bifrost()
Out[5]: opencl -keys=bifrost,opencl,gpu -device=bifrost -max_num_threads=256 -model=unknown -thread_warp_size=1

In [6]: tvm.target.intel_graphics()
Out[6]: opencl -keys=intel_graphics,opencl,gpu -device=intel_graphics -max_num_threads=256 -model=unknown -thread_warp_size=16

In [7]: tvm.target.mali()
Out[7]: opencl -keys=mali,opencl,gpu -device=mali -max_num_threads=256 -model=unknown -thread_warp_size=1

In [8]: tvm.target.rasp()
Out[8]: llvm -keys=arm_cpu,cpu -device=arm_cpu -link-params=0 -mattr=+neon -model=bcm2837 -mtriple=armv7l-linux-gnueabihf

In [9]: tvm.target.riscv_cpu()
Out[9]: llvm -keys=arm_cpu,cpu -device=arm_cpu -link-params=0 -mabi=lp64d -mcpu=sifive-u54 -model=sifive-u54 -mtriple=riscv64-unknown-linux-gnu

In [10]: tvm.target.rocm()
Out[10]: rocm -keys=rocm,gpu -max_num_threads=256 -mcpu=gfx900 -model=unknown -mtriple=amdgcn-amd-amdhsa-hcc -thread_warp_size=64

(I have been unable to find any documentation on what device actually means/does).

If you have a better way to detect x86, I'd be happy to use it. The only way I've seen this detection done in the codebase is by looking at mcpu.

I'd love to have this information available from the targets, but all the x86 code in topi appears to do it the way I do it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-device is used by TOPI to provide a layer of specialization when the target is generic. So with mali for example, we can have separate topi impls for opencl. And I agree that the way TOPI does hardware specific attribute lookup is how you are doing it here, I just don't like it :).

It's essentially a side channel for information to flow into the compiler from outside of the target that assumes that the hardware we are compiling for is locally accessible which only works in a limited set of cases. In my view, any device specific attribute that the compiler needs should be retrievable from the target directly, and we now have a path to do this with the attribute preprocessor I mentioned above that directly queries the DeviceAPI for this information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my view, any device specific attribute that the compiler needs should be retrievable from the target directly, and we now have a path to do this with the attribute preprocessor I mentioned above that directly queries the DeviceAPI for this information.

I agree on this point, but I think doing this change is out of scope of this PR. I'd have to change a bunch of stuff within topi.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree there are lots of places where this is assumed in TOPI, and I don't want you to be burdened with that in this PR, but for the cases in which target.device_name is not set and the machine is not x86 the profiler will silently do the wrong thing. This applies for hexagon, cuda, opencl, rocm, and others.

At minimum I guess we need to check that the target is llvm, and the device name is empty.

Can we also make it more strict such that get_simd_32bit_lanes does not return a default if the mcpu name is not found in its knowledgebase, or is that default behavior being relied on? If it's being relied on maybe we simple make get_simd_32bit_lanes take a parameter that defaults to True which indicates whether we should guess the number of lanes, and when false an error is thrown. Then here we can set that param to False so that for the case of profiling we don't make target specific assumptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed this check to make sure the target is llvm and that there aren't any keys besides "cpu". I think that should cover our bases for now.

get_simd_32bit_lanes defaults to 4 if it can't figure out what the target is. Not sure this is a reasonable default, but I think we should leave changes to get_simd_32bit_lanes to a separate PR.

python/tvm/utils/__init__.py Outdated Show resolved Hide resolved
python/tvm/utils/__init__.py Outdated Show resolved Hide resolved
# don't have an operator that uses a specific number of flops.
assert call["Percent of Theoretical Optimal"].ratio >= 0


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple classification tests would be rather impressive. e.g. asserting that dense (as above) of sufficient size is compute bound, and some injective op is memory bound

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default (untuned) schedules for dense seems to always be memory bound instead of compute bound, so I can't really test for that without tuning. I'll try some other operators and see if it is true for them too.

python/tvm/utils/__init__.py Outdated Show resolved Hide resolved
python/tvm/utils/__init__.py Outdated Show resolved Hide resolved
python/tvm/utils/__init__.py Outdated Show resolved Hide resolved
python/tvm/utils/__init__.py Outdated Show resolved Hide resolved
Copy link
Contributor

@csullivan csullivan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the great feature and discussions. I look forward to using this

@tqchen
Copy link
Member

tqchen commented Apr 29, 2022

Thank you @tkonolige for the contribution and @csullivan @AndrewZhaoLuo for reviewing.

As the current proposal now we introduced a new top-level namespace (utils). While I personally also would tend to agree with the namespace choice, as stated in an earlier comment, please open a discuss thread early for posterity since this would be a case to get community involvement and awareness as new top-level is a main architectural choice.

After a new namespace is introduced, it usually less common to have implementations directly sit in __init__.py, as a new namespace would entail further grows of tools of similar kind in the same namespace. As a result, it would be helpful to move the implementation to a separate file(or files, e.g. roofline.py or other better names) under the same namespace then import.

@tkonolige
Copy link
Contributor Author

@tqchen I've move the implementation into tvm/utils/roofline.py as I think you've requested. In the future, could you please explicitly approve PRs or request changes. See https://tvm.apache.org/docs/contribute/code_review.html#approve-and-request-changes-explicitly.

I've also posted to discuss as requested: https://discuss.tvm.apache.org/t/new-top-level-python-namespace-tvm-utils/12683.

@tqchen
Copy link
Member

tqchen commented Apr 29, 2022

Thanks @tkonolige I have not yet approved the change explicitly because of the suggestions (as in the comment part of this review). Given the suggestions are not specific for a particular line of code(but architectural) in general I did not use the review part, and instead chose to give comment explicitly in the comment thread. Hopefully that has given enough context that aligns with the guideline without being too rigid.

`tvm.analysis.roofline_analysis` adds estimated roofline performance to a
profiling report. The roofline model measures how close an operator gets
to best possible memory bandwidth or FLOP/s depending on whether it is
memory or compute bound. This computation uses the runtime of the
operator along with two numbers extracted from the TIR code: bytes of
memory touched and number of floating point operations. Because these
numbers are extracted from TIR, they may not be 100% accurate. The best
possible memory bandwidth and FLOP/s are measured by running small
programs that are memory and compute bound respectively.

For now, this function only works with llvm cpu targets, but it should
be possible to extend to GPU targets.
@AndrewZhaoLuo
Copy link
Contributor

I'm going to merge, looks like in discussion thread there is little controversy

@AndrewZhaoLuo AndrewZhaoLuo merged commit 90084ab into apache:main May 3, 2022
shtinsa pushed a commit to Deelvin/tvm that referenced this pull request May 17, 2022
`tvm.analysis.roofline_analysis` adds estimated roofline performance to a
profiling report. The roofline model measures how close an operator gets
to best possible memory bandwidth or FLOP/s depending on whether it is
memory or compute bound. This computation uses the runtime of the
operator along with two numbers extracted from the TIR code: bytes of
memory touched and number of floating point operations. Because these
numbers are extracted from TIR, they may not be 100% accurate. The best
possible memory bandwidth and FLOP/s are measured by running small
programs that are memory and compute bound respectively.

For now, this function only works with llvm cpu targets, but it should
be possible to extend to GPU targets.
SebastianBoblest pushed a commit to SebastianBoblest/tvm that referenced this pull request May 27, 2022
`tvm.analysis.roofline_analysis` adds estimated roofline performance to a
profiling report. The roofline model measures how close an operator gets
to best possible memory bandwidth or FLOP/s depending on whether it is
memory or compute bound. This computation uses the runtime of the
operator along with two numbers extracted from the TIR code: bytes of
memory touched and number of floating point operations. Because these
numbers are extracted from TIR, they may not be 100% accurate. The best
possible memory bandwidth and FLOP/s are measured by running small
programs that are memory and compute bound respectively.

For now, this function only works with llvm cpu targets, but it should
be possible to extend to GPU targets.
juda pushed a commit to juda/tvm that referenced this pull request Jun 21, 2022
`tvm.analysis.roofline_analysis` adds estimated roofline performance to a
profiling report. The roofline model measures how close an operator gets
to best possible memory bandwidth or FLOP/s depending on whether it is
memory or compute bound. This computation uses the runtime of the
operator along with two numbers extracted from the TIR code: bytes of
memory touched and number of floating point operations. Because these
numbers are extracted from TIR, they may not be 100% accurate. The best
possible memory bandwidth and FLOP/s are measured by running small
programs that are memory and compute bound respectively.

For now, this function only works with llvm cpu targets, but it should
be possible to extend to GPU targets.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants