Cost-based optimizer #1616

andygrove · 2021-01-28T20:52:49Z

Signed-off-by: Andy Grove andygrove@nvidia.com

This PR implements a simple cost-based optimizer that will attempt to avoid transitioning from CPU to GPU when there is no benefit to running on GPU.

There are two rules implemented:

Do not move from CPU to GPU for an operator that has no benefit on GPU, such as a projection that just changes the order of some attributes
Review sections of the plan that are on GPU and move them back to the CPU if the overall cost does no longer justifies being on the GPU.

There are two aspects to the work here - a cost model, and a mechanism for applying the cost model. We may want to experiment with different cost models or even consider allowing users to provide their own cost models tuned to their workloads.

Explain output

There is a new indicator char of $ to show which operators/expressions have been forced onto CPU by the cost-based optimizer:

$Exec <SortExec> cannot run on GPU because Removed by cost-based optimizer
  $Expression <SortOrder> more_strings_2#58 ASC NULLS FIRST cannot run on GPU because Removed by cost-based optimizer
    $Expression <AttributeReference> more_strings_2#58 cannot run on GPU because Removed by cost-based optimizer
  !Exec <ShuffleExchangeExec> cannot run on GPU because Columnar exchange without columnar children is inefficient

This is followed by a summary of the optimizations that were applied:

It is not worth moving to the GPU just for operator: Exchange hashpartitioning(more_strings_2#58, 200), true, [id=#89]
It is not worth moving to the GPU just for operator: Project [more_strings_2#58, more_strings_1#44]
It is no longer worth keeping this section on GPU; gpuCost=5.1000000000000005, cpuCost=5.0:
SortMergeJoin [strings#40], [strings#54], Inner, (cast(more_strings_2#58 as timestamp) < cast(more_strings_1#44 as timestamp))
  Sort [strings#54 ASC NULLS FIRST], false, 0

This is still WIP but getting closer.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/CostBasedOptimizer.scala

revans2 · 2021-02-03T15:38:00Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/CostBasedOptimizer.scala

+    if (numTransitions > 0) {
+      if (plan.canThisBeReplaced) {
+        // transition from CPU to GPU
+        val transitionCost = numTransitions * costModel.transitionToGpuCost(plan)


I think we need to be clearer on what transitionToGpuCost is actually doing. Right now it is mostly a place holder, but in the future it could look at the schema to determine the cost of transitioning. We might also want to guess at the size of the data that will transfer. To me it feels like we would want to do something like.

plan.childPlans.filter(_.canThisBeReplaced != plan.canThisBeReplaces).map(costModel.transitionToGpuCost(_)).sum

So that the transition cost is looking at the output of the plan, so it knows what to transition instead of some vague thing about what the input might be. Or not pass in the plan at all if we just want it to be a single value and we can change things when the models get better.

revans2 · 2021-02-03T15:40:08Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/CostBasedOptimizer.scala

+    if (plan.canThisBeReplaced) {
+      plan.willNotWorkOnGpu(reason)
+    }
+    plan.childPlans.foreach(plan => forceOnCpuRecursively(plan, reason))


We need this to stop before it hits the end or if the last stage should not be on the GPU the entire query will be on the CPU.

I'm not sure if I have completely addressed your concern here, but I changed it to only replace sections of the plan if the final operator could have been replaced.

I see the limitations of this approach now that I am testing with TPC-DS and I am rewriting this part. The current approach was too naive and we want to selectively force sections of the plan back onto CPU, not everything below a certain point.

Right now we put everything back onto the CPU until we hit something that is already on the CPU. I think that works fine for most of the stupid things we do now (like put something on the GPU just to rearrange columns). But I don't think it will handle cases where something is more expensive on the GPU than the CPU and we may want to exclude just a part of the plan (instead of all of it). But I think we can tackle that problem when we see it in practice.

Signed-off-by: Andy Grove <andygrove@nvidia.com>

…as part of EXPLAIN output Signed-off-by: Andy Grove <andygrove@nvidia.com>

andygrove · 2021-02-19T21:51:12Z

@revans2 @jlowe This PR is still WIP but I have updated the description to show the direction I am going in with the EXPLAIN output and it would be good to get some feedback on this before I go too far.

jlowe · 2021-02-19T23:00:20Z

I like the new explain output, but I'm wondering if it could quickly get very verbose on large plans. We may want the ability to show the fact that the CBO removed an operation from the GPU but suppress the gory details that come out later. On a large plan, there could easily be a ton of "after" output by the CBO that will be a chore to line up with the original. Seems like it would be easy to extend the existing spark.rapids.sql.explain config to have another setting value. We already do ALL, NOT_ON_GPU, maybe some CBO-specific levels?

…o GpuOverrides Signed-off-by: Andy Grove <andygrove@nvidia.com>

Signed-off-by: Andy Grove <andygrove@nvidia.com>

revans2 · 2021-02-24T16:48:40Z

* There is a new setting `ALL_CBO` for `spark.rapids.sql.explain` that enables the CBO explain output. I'm not crazy about the name and `ALL` no longer means everything, which is slightly confusing.

Could we rename things then and have ALL be the same as ALL_CBO is now and introduce a new one that removes CBO details from ALL? Perhaps ALL_NO_CBO. If w want to have CBO details with other levels of explain, then we should have a separate config for printing CBO details.

Signed-off-by: Andy Grove <andygrove@nvidia.com>

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsMeta.scala

Signed-off-by: Andy Grove <andygrove@nvidia.com>

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

Signed-off-by: Andy Grove <andygrove@nvidia.com>

jlowe

I think this is a decent starting point worth checkpointing, but it would be good to hear from @revans2 before merging.

revans2

I agree that it is time to checkpoint this. I think the next steps really need to be towards getting it to the point that we can turn it on by default so we stop doing stupid things (and document it). After that we can really dig into what it would take to have it make smart choices.

andygrove · 2021-03-02T19:37:15Z

build

…mizer-poc

andygrove · 2021-03-03T00:43:12Z

build

andygrove · 2021-03-03T00:43:46Z

The tests in this PR depend on nullableStringsDf which was recently removed, causing the previous build failure. I have added it back.

andygrove · 2021-03-03T05:19:56Z

@jlowe @revans2 I had to merge latest from branch-0.5 and re-instate a method in SparkQueryCompareTestSuite to fix CI.

andygrove added this to the Jan 18 - Jan 29 milestone Jan 28, 2021

andygrove self-assigned this Jan 28, 2021

andygrove added feature request New feature or request performance A performance related task/issue labels Jan 28, 2021

revans2 reviewed Jan 28, 2021

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/CostBasedOptimizer.scala Outdated Show resolved Hide resolved

sameerz modified the milestones: Jan 18 - Jan 29, Feb 1 - Feb 12 Jan 30, 2021

gerashegalov reviewed Feb 2, 2021

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/CostBasedOptimizer.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/com/nvidia/spark/rapids/CostBasedOptimizer.scala Outdated Show resolved Hide resolved

revans2 reviewed Feb 3, 2021

View reviewed changes

Cost-based optimizer

3433846

Signed-off-by: Andy Grove <andygrove@nvidia.com>

andygrove force-pushed the cost-based-optimizer-poc branch from 5d64d0d to 3433846 Compare February 11, 2021 18:57

andygrove changed the title ~~Experimental cost-based optimizer [WIP / Proof-of-concept]~~ Cost-based optimizer Feb 11, 2021

Remove printlns from test

9d763ff

Signed-off-by: Andy Grove <andygrove@nvidia.com>

sameerz modified the milestones: Feb 1 - Feb 12, Feb 16 - Feb 26 Feb 12, 2021

sameerz linked an issue Feb 16, 2021 that may be closed by this pull request

[FEA] Implement a cost-based optimizer #1633

Closed

andygrove changed the base branch from branch-0.4 to branch-0.5 February 16, 2021 15:10

Remove redundant code and initial changes to EXPLAIN output

f8752f4

Signed-off-by: Andy Grove <andygrove@nvidia.com>

andygrove marked this pull request as ready for review February 18, 2021 22:04

Remove more redundant code and show which CBO optimizations happened …

9ec2156

…as part of EXPLAIN output Signed-off-by: Andy Grove <andygrove@nvidia.com>

andygrove changed the title ~~Cost-based optimizer~~ WIP: Cost-based optimizer Feb 19, 2021

andygrove added 2 commits February 23, 2021 14:37

Use case classes to represent optimizations, add listener mechanism t…

c359e15

…o GpuOverrides Signed-off-by: Andy Grove <andygrove@nvidia.com>

Improve unit tests

78b769d

Signed-off-by: Andy Grove <andygrove@nvidia.com>

andygrove added 2 commits February 24, 2021 10:39

remove listeners before and after each test

abbd607

Signed-off-by: Andy Grove <andygrove@nvidia.com>

Rename explain configuration values

59733bc

Signed-off-by: Andy Grove <andygrove@nvidia.com>

jlowe reviewed Feb 24, 2021

View reviewed changes

andygrove added 3 commits February 25, 2021 10:27

Addressing feedback: rename config settings and remove get/set methods

018a090

Signed-off-by: Andy Grove <andygrove@nvidia.com>

Move CBO classes from RapidsConf to CostBasedOptimizer

2e6eda4

Signed-off-by: Andy Grove <andygrove@nvidia.com>

Change configs for explain

8fe7fea

Signed-off-by: Andy Grove <andygrove@nvidia.com>

jlowe reviewed Mar 1, 2021

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala Outdated Show resolved Hide resolved

Remove unused config

e3984b0

Signed-off-by: Andy Grove <andygrove@nvidia.com>

jlowe previously approved these changes Mar 1, 2021

View reviewed changes

jlowe modified the milestones: Feb 16 - Feb 26, Mar 1 - Mar 12 Mar 1, 2021

revans2 previously approved these changes Mar 2, 2021

View reviewed changes

andygrove added 2 commits March 2, 2021 17:20

Merge remote-tracking branch 'nvidia/branch-0.5' into cost-based-opti…

6e0be23

…mizer-poc

Add nullableStringsDf back

d3fbe56

andygrove dismissed stale reviews from revans2 and jlowe via d3fbe56 March 3, 2021 00:43

jlowe approved these changes Mar 3, 2021

View reviewed changes

andygrove merged commit 32213fa into NVIDIA:branch-0.5 Mar 3, 2021

andygrove deleted the cost-based-optimizer-poc branch March 3, 2021 14:33

nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021

Cost-based optimizer (NVIDIA#1616)

8f4011c

nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021

Cost-based optimizer (NVIDIA#1616)

f079c5b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cost-based optimizer #1616

Cost-based optimizer #1616

andygrove commented Jan 28, 2021 •

edited

Loading

revans2 Feb 3, 2021

revans2 Feb 3, 2021

andygrove Feb 4, 2021

andygrove Feb 5, 2021

revans2 Feb 8, 2021

andygrove commented Feb 19, 2021

jlowe commented Feb 19, 2021

revans2 commented Feb 24, 2021

jlowe left a comment

revans2 left a comment

andygrove commented Mar 2, 2021

andygrove commented Mar 3, 2021

andygrove commented Mar 3, 2021

andygrove commented Mar 3, 2021

Cost-based optimizer #1616

Cost-based optimizer #1616

Conversation

andygrove commented Jan 28, 2021 • edited Loading

Explain output

revans2 Feb 3, 2021

Choose a reason for hiding this comment

revans2 Feb 3, 2021

Choose a reason for hiding this comment

andygrove Feb 4, 2021

Choose a reason for hiding this comment

andygrove Feb 5, 2021

Choose a reason for hiding this comment

revans2 Feb 8, 2021

Choose a reason for hiding this comment

andygrove commented Feb 19, 2021

jlowe commented Feb 19, 2021

revans2 commented Feb 24, 2021

jlowe left a comment

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

andygrove commented Mar 2, 2021

andygrove commented Mar 3, 2021

andygrove commented Mar 3, 2021

andygrove commented Mar 3, 2021

andygrove commented Jan 28, 2021 •

edited

Loading