[flang][OpenMP] Implement more robust loop-nest detection logic #127

ergawy · 2024-07-30T04:31:59Z

The previous loop-nest detection algorithm fell short, in some cases, to detect whether a pair of do concurrent loops are perfectly nested or not. This is a re-implementation using forward and backward slice extraction algorithms to compare the set of ops required to setup the inner loop bounds vs. the set of ops nested in the outer loop other thatn the nested loop itself.

TIFitis · 2024-07-30T10:15:37Z

flang/test/Transforms/DoConcurrent/loop_nest_test.f90

Can you please alter the test to not rely on debug statements for failure/success determination.

It is not ideal indeed. Do you have any suggestions on how to do this differently?

I thought about doing this as a unit test, but setting up the test and later reading the test I think would be more complicated than a lit test the clearly shows what the loops look like and what the outcome should be.

We can a special flag to print loop info to llvm::outs(). But I am not sure this is worth it tbh.

Don't like relying on debug output either. It randomly interleaves with stdout and only works with assertions-builds. Additionally, it makes the test dependent on how often/in which order the isPerfectlyNested function is called internally, making it sensitive to even NFC patches. But I have seen this in LLVM and Clang enough to be consider a established pattern there, although not for MLIR/Flang.

If you do this, you still must:

Do not redirect/pipe stdout and stderr at the same time. either guarentee that just one of the ones is used or 2> instead &>

Add REQUIRES: asserts or it will fail in release builds

In MLIR I indeed usually see an internal option enabling additional printing. or print debug counters E.g. -test-print-shape-mapping, or a pass that just prints the analysis result, e.g. -pass-pipeline=...test-print-dominance, or test the diagnostic from -Rpass output.

Since it is a transformation pass, one would typically (also) test whether the output is what was expected.

Thanks for the info @Meinersbur, really useful.

I used 2> and REQUIRES: asserts.

Since it is a transformation pass, one would typically (also) test whether the output is what was expected.

All other do-concurrent-conversion tests verify the output. I wanted this one to test only one thing which is loop-nest detection. My reasoning is that this test isolates this particular part of the pass so that we debug issues in nest detection more easily.

ergawy · 2024-08-01T05:29:10Z

Ping! Please 🙏 take a look when you have time.

Meinersbur

The algorithm checks whether only "expected" instruction are present in-between code, but I think it is more relevant whether they have side-effects. That's because:

Any operation that does not have side-effects can just be sunk into the inner loop or hoisted outside the outer loop, no matter whether it use used to compute the inner loop bounds or not. If it is invariant, an optimization can hoist it out again. I think this is the relevant property: "able to move all the code away".
Ops might be needed to compute the inner loop's but have side-effects, e.g. a function call accesses a global variable.

I do not understand the argument about mem alloc. When does this happen? Shouldn't mem2reg have removed such allocations?

Meinersbur · 2024-08-01T12:51:22Z

flang/lib/Optimizer/Transforms/DoConcurrentConversion.cpp

@@ -36,7 +36,8 @@ namespace fir {
 #include "flang/Optimizer/Transforms/Passes.h.inc"
 } // namespace fir

-#define DEBUG_TYPE "fopenmp-do-concurrent-conversion"
+#define DEBUG_TYPE "do-concurrent-conversion"
+#define DBGS() (llvm::dbgs() << "[" DEBUG_TYPE << "]: ")


The canonical form is LLVM_DEBUG(dbgs() << "text"). I don't think introducing new patterns for a single file is a good idea. DBGS may easily conflict with something else, as was the case with DEBUG which was eventually renamed to LLVM_DEBUG. Getting only a specific debug type can be done via cmdline: -mllvm -debug-only do-concurrent-conversion

This same pattern is used in a lot of places in MLIR (38 existing times). For example: Dialect/Transform/IR/TransformOps.cpp, Dialect/Linalg/Transforms/Vectorization.cpp, and Dialect/Linalg/Transforms/Transforms.cpp, ...

Mmmh, OK. I don't think it is a good idea but apparently some MLIR folks do.

Meinersbur · 2024-08-01T13:11:02Z

flang/test/Transforms/DoConcurrent/loop_nest_test.f90

Don't like relying on debug output either. It randomly interleaves with stdout and only works with assertions-builds. Additionally, it makes the test dependent on how often/in which order the isPerfectlyNested function is called internally, making it sensitive to even NFC patches. But I have seen this in LLVM and Clang enough to be consider a established pattern there, although not for MLIR/Flang.

If you do this, you still must:

Do not redirect/pipe stdout and stderr at the same time. either guarentee that just one of the ones is used or 2> instead &>

Add REQUIRES: asserts or it will fail in release builds

In MLIR I indeed usually see an internal option enabling additional printing. or print debug counters E.g. -test-print-shape-mapping, or a pass that just prints the analysis result, e.g. -pass-pipeline=...test-print-dominance, or test the diagnostic from -Rpass output.

Since it is a transformation pass, one would typically (also) test whether the output is what was expected.

Meinersbur · 2024-08-01T13:15:45Z

flang/test/Transforms/DoConcurrent/loop_nest_test.f90

@@ -0,0 +1,77 @@
+! Tests loop-nest detection algorithm for do-concurrent mapping.
+
+! RUN: %flang_fc1 -emit-hlfir  -fopenmp -fdo-concurrent-parallel=host \


do-concurrent-conversion is a MLIR-to-MLIR pass. Those tests usually only contains the input MLIR of the pass so we don't test more than necessary.

I did this to show loops on the Fortran source level. Just makes it easy to correlate for which loop nests do we detect that they are perfectly nested.

If you don't think this is a good enough arugment to have the test on the fortran level, I will replace it with MLIR instead.

flang/lib/Optimizer/Transforms/DoConcurrentConversion.cpp

ergawy · 2024-08-02T07:33:00Z

I do not understand the argument about mem alloc. When does this happen? Shouldn't mem2reg have removed such allocations?

This happens if cases like the following (see complete sample here):

  do concurrent(i=1:n, j=1:bar(n*m, n/m))
    a(i) = n
  end do

If you look the IR, you will see:

    fir.do_loop %arg1 = %42 to %44 step %c1 unordered {
      ...
      %53:3 = hlfir.associate %49 {adapt.valuebyref} : (i32) -> (!fir.ref<i32>, !fir.ref<i32>, i1)
      %54:3 = hlfir.associate %52 {adapt.valuebyref} : (i32) -> (!fir.ref<i32>, !fir.ref<i32>, i1)
      %55 = fir.call @_QFPbar(%53#1, %54#1) fastmath<contract> : (!fir.ref<i32>, !fir.ref<i32>) -> i32
      hlfir.end_associate %53#1, %53#2 : !fir.ref<i32>, i1
      hlfir.end_associate %54#1, %54#2 : !fir.ref<i32>, i1
      %56 = fir.convert %55 : (i32) -> index
      ...
      fir.do_loop %arg2 = %46 to %56 step %c1_4 unordered {
        ...
      }
    }

The problem here are the hlfir.end_associate ops. Even though the "effectively 2" loops are perfectly nested, we have these hlfir.end_associate ops that are not part of the slice responsible for computing the upper bound (in this case) of the inner loop even though they are in-practice exist only for the purpose of that computation.

ergawy · 2024-08-02T07:36:16Z

The algorithm checks whether only "expected" instruction are present in-between code, but I think it is more relevant whether they have side-effects. That's because: ...

These are very good points that I did not consider to be honest. But my conclusion from what you mentioned about side-effects is that flang potentially emits wrong IR!

If you take again the same sample mentioned the previous comment, shouldn't flang have emitted:

      %55 = fir.call @_QFPbar(%53#1, %54#1) fastmath<contract> : (!fir.ref<i32>, !fir.ref<i32>) -> i32

before the outermost loop (the i loop)?

mjklemm · 2024-08-02T08:15:19Z

The algorithm checks whether only "expected" instruction are present in-between code, but I think it is more relevant whether they have side-effects. That's because: ...

These are very good points that I did not consider to be honest. But my conclusion from what you mentioned about side-effects is that flang potentially emits wrong IR!

How would that happen? From a Fortran perspective there should not be any side effects in the code inside the DO CONCURRENT.

ergawy · 2024-08-02T08:22:48Z

How would that happen? From a Fortran perspective there should not be any side effects in the code inside the DO CONCURRENT.

flang is potentially emitting wrong IR atm (as I mentioned but for different reasons in my previous reply). Nothing is preventing bar from mutating global state (see Fortran code and corresponding HLFIR IR in this comment).

ergawy · 2024-08-02T08:26:24Z

Just as a follow up, if you modify the loop I posted earlier to be:

  do concurrent(i=1:n, j=1:bar(n*m, n/m))
    a(i) = bar(n,m)
  end do

You get:

/tmp/test.f90:15:5: error: Impure procedure 'bar' may not be referenced in DO CONCURRENT
      a(i) = bar(n,m)
      ^^^^^^^^^^^^^^^

So, side-effects are checked only for the body of the loop and not for the bounds calcualations (which are still emitted inside the loop body).

ergawy · 2024-08-02T13:18:25Z

From the spec:

C1143 A reference to an impure procedure shall not appear within a DO CONCURRENT construct.

which is expected, but what I cannot put my hands on is what constitues "within a DO CONCURRENT"? Does it include the concurrent-header?

Meinersbur

I do not understand the argument about mem alloc. When does this happen? Shouldn't mem2reg have removed such allocations?

The problem here are the hlfir.end_associate ops. Even though the "effectively 2" loops are perfectly nested, we have these hlfir.end_associate ops that are not part of the slice responsible for computing the upper bound (in this case) of the inner loop even though they are in-practice exist only for the purpose of that computation.

Thanks for the explanation. I wouldn't consider the loops perfectly nested though. Calling bar can have arbitrary side-effects, like accessing and incrementing global variables.

For such cases OpenMP specifies that side-effects of upper/lower bound expressions are undefined when/how often they are evaluated, but this does not apply here, so we cannot optimize based on that. Even if, the produced HLFIR seems indistinguishable from

  do concurrent(i=1:n)
    ub = bar(n*m, n/m)
    do concurrent(j=1:ub)
      a(i) = n
    end do
  end do

which (obviously?) is not perfectly nested. I think the frontend should be changed here. the call to bar should be emitted before the outer loop, it cannot depend on i and needs to be evaluated only once. Potentially adding a special case for when the n is zero to not call bar at all, if that is an issue.

In the meantime I this it is OK to just not support function calls as lb/ub expressions.

If you take again the same sample mentioned the previous comment, shouldn't flang have emitted:
      %55 = fir.call @_QFPbar(%53#1, %54#1) fastmath<contract> : (!fir.ref<i32>, !fir.ref<i32>) -> i32
before the outermost loop (the i loop)?

Wrote all of the above before I continued reading the following discussion... 😒

Meinersbur · 2024-08-06T11:36:50Z

flang/lib/Optimizer/Transforms/DoConcurrentConversion.cpp

@@ -36,7 +36,8 @@ namespace fir {
 #include "flang/Optimizer/Transforms/Passes.h.inc"
 } // namespace fir

-#define DEBUG_TYPE "fopenmp-do-concurrent-conversion"
+#define DEBUG_TYPE "do-concurrent-conversion"
+#define DBGS() (llvm::dbgs() << "[" DEBUG_TYPE << "]: ")


Mmmh, OK. I don't think it is a good idea but apparently some MLIR folks do.

flang/lib/Optimizer/Transforms/DoConcurrentConversion.cpp

Meinersbur · 2024-08-06T12:10:40Z

From the spec:
C1143 A reference to an impure procedure shall not appear within a DO CONCURRENT construct.
which is expected, but what I cannot put my hands on is what constitues "within a DO CONCURRENT"? Does it include the concurrent-header?

I think they mean only the body. The header expressions can be evaluated once before entering any concurrent execution, so should be fine. Unfortunately flang currently doesn't seem to do so atm. Note that is make the way flang emits the loop violate this:

  do concurrent(i=1:n)
    ub = bar(n*m, n/m)
    do concurrent(j=1:ub)
      a(i) = n
    end do
  end do

since the non-pure bar is in the outer loop's body.

ergawy · 2024-08-16T08:47:13Z

Update: I had a call with Kiran and Harish, and they will be working on making sure we emit code that is more consistent with the spec.

ergawy · 2024-08-22T05:33:07Z

@Meinersbur @mjklemm Even though we still need to resolve the code-gen issue of do concurrent (as I mentioned above Kiran and Harish are looking into it), I want to move this forward for the non-controversial parts. Therefore, I removed the parts related to collecting mem free ops. This results in being a bit more conservative in detecting loop-nests (see the test file loop_nest_test.f90 for cases where we currently fail to detect perfect nests).

But at the same time, we now have better detection logic than before. And the algorithm to do so is cleaner and easier to understand.

flang/test/Transforms/DoConcurrent/loop_nest_test.f90

The previous loop-nest detection algorithm fell short, in some cases, to detect whether a pair of `do concurrent` loops are perfectly nested or not. This is a re-implementation using forward and backward slice extraction algorithms to compare the set of ops required to setup the inner loop bounds vs. the set of ops nested in the outer loop other thatn the nested loop itself.

ergawy · 2024-09-05T03:40:43Z

@Meinersbur @mjklemm Even though we still need to resolve the code-gen issue of do concurrent (as I mentioned above Kiran and Harish are looking into it), I want to move this forward for the non-controversial parts. Therefore, I removed the parts related to collecting mem free ops. This results in being a bit more conservative in detecting loop-nests (see the test file loop_nest_test.f90 for cases where we currently fail to detect perfect nests).

But at the same time, we now have better detection logic than before. And the algorithm to do so is cleaner and easier to understand.

Ping Ping! 🔔 Please take a look when you have time 🙏

ergawy force-pushed the more_robust_loop_nest_detection branch from be5a0a0 to 83776bf Compare July 30, 2024 04:33

ergawy requested review from skatrak, jsjodin, abidh, mjklemm, raghavendhra, agozillon, DominikAdamski, kparzysz, TIFitis and pbhandar-amd July 30, 2024 04:35

TIFitis reviewed Jul 30, 2024

View reviewed changes

Meinersbur reviewed Aug 1, 2024

View reviewed changes

ergawy force-pushed the more_robust_loop_nest_detection branch from 83776bf to 4c3d9f1 Compare August 2, 2024 05:05

Meinersbur reviewed Aug 6, 2024

View reviewed changes

ergawy force-pushed the more_robust_loop_nest_detection branch 2 times, most recently from fbbf705 to 8df3bac Compare August 22, 2024 05:30

ergawy force-pushed the more_robust_loop_nest_detection branch 2 times, most recently from 71ece67 to c41c26a Compare August 22, 2024 05:43

mjklemm reviewed Aug 22, 2024

View reviewed changes

flang/test/Transforms/DoConcurrent/loop_nest_test.f90 Outdated Show resolved Hide resolved

ergawy force-pushed the more_robust_loop_nest_detection branch 2 times, most recently from a4e6df3 to ef20b85 Compare August 27, 2024 15:53

ergawy force-pushed the more_robust_loop_nest_detection branch from ef20b85 to 2447f0a Compare August 31, 2024 13:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[flang][OpenMP] Implement more robust loop-nest detection logic #127

[flang][OpenMP] Implement more robust loop-nest detection logic #127

ergawy commented Jul 30, 2024

TIFitis Jul 30, 2024

ergawy Jul 31, 2024

ergawy Aug 1, 2024

Meinersbur Aug 1, 2024

ergawy Aug 2, 2024

ergawy commented Aug 1, 2024

Meinersbur left a comment

Meinersbur Aug 1, 2024

ergawy Aug 2, 2024

Meinersbur Aug 6, 2024

Meinersbur Aug 1, 2024

Meinersbur Aug 1, 2024

ergawy Aug 2, 2024

ergawy commented Aug 2, 2024

ergawy commented Aug 2, 2024 •

edited

Loading

mjklemm commented Aug 2, 2024

ergawy commented Aug 2, 2024

ergawy commented Aug 2, 2024

ergawy commented Aug 2, 2024

Meinersbur left a comment

Meinersbur Aug 6, 2024

Meinersbur commented Aug 6, 2024 •

edited

Loading

ergawy commented Aug 16, 2024

ergawy commented Aug 22, 2024 •

edited

Loading

ergawy commented Sep 5, 2024

		@@ -0,0 +1,77 @@
		! Tests loop-nest detection algorithm for do-concurrent mapping.

		! RUN: %flang_fc1 -emit-hlfir -fopenmp -fdo-concurrent-parallel=host \

[flang][OpenMP] Implement more robust loop-nest detection logic #127

Are you sure you want to change the base?

[flang][OpenMP] Implement more robust loop-nest detection logic #127

Conversation

ergawy commented Jul 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ergawy commented Aug 1, 2024

Meinersbur left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ergawy commented Aug 2, 2024

ergawy commented Aug 2, 2024 • edited Loading

mjklemm commented Aug 2, 2024

ergawy commented Aug 2, 2024

ergawy commented Aug 2, 2024

ergawy commented Aug 2, 2024

Meinersbur left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Meinersbur commented Aug 6, 2024 • edited Loading

ergawy commented Aug 16, 2024

ergawy commented Aug 22, 2024 • edited Loading

ergawy commented Sep 5, 2024

ergawy commented Aug 2, 2024 •

edited

Loading

Meinersbur commented Aug 6, 2024 •

edited

Loading

ergawy commented Aug 22, 2024 •

edited

Loading