Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] koalas.sql fails with java.lang.ArrayIndexOutOfBoundsException #2079

Closed
viadea opened this issue Apr 5, 2021 · 4 comments · Fixed by #2097
Closed

[BUG] koalas.sql fails with java.lang.ArrayIndexOutOfBoundsException #2079

viadea opened this issue Apr 5, 2021 · 4 comments · Fixed by #2097
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@viadea
Copy link
Collaborator

viadea commented Apr 5, 2021

Describe the bug
A clear and concise description of what the bug is.

Koalas.sql fails with java.lang.ArrayIndexOutOfBoundsException.
Sample stacktrace is:

Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 0 out of bounds for length 0
	at ai.rapids.cudf.Table.<init>(Table.java:62)
	at com.nvidia.spark.rapids.GpuColumnVector.from(GpuColumnVector.java:498)
	at com.nvidia.spark.rapids.GpuWindowExpression.$anonfun$evaluateRowBasedWindowExpression$1(GpuWindowExpression.scala:188)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
	at com.nvidia.spark.rapids.GpuWindowExpression.withResource(GpuWindowExpression.scala:132)
	at com.nvidia.spark.rapids.GpuWindowExpression.evaluateRowBasedWindowExpression(GpuWindowExpression.scala:187)
	at com.nvidia.spark.rapids.GpuWindowExpression.columnarEval(GpuWindowExpression.scala:167)
	at com.nvidia.spark.rapids.RapidsPluginImplicits$ReallyAGpuExpression.columnarEval(implicits.scala:35)
	at com.nvidia.spark.rapids.GpuAlias.columnarEval(namedExpressions.scala:93)
	at com.nvidia.spark.rapids.RapidsPluginImplicits$ReallyAGpuExpression.columnarEval(implicits.scala:35)
	at com.nvidia.spark.rapids.GpuProjectExec$.$anonfun$project$1(basicPhysicalOperators.scala:50)
	at com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.$anonfun$safeMap$1(implicits.scala:161)
	at com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.$anonfun$safeMap$1$adapted(implicits.scala:158)
	at scala.collection.immutable.Stream.foreach(Stream.scala:533)
	at com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.safeMap(implicits.scala:158)
	at com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableProducingSeq.safeMap(implicits.scala:193)
	at com.nvidia.spark.rapids.GpuProjectExec$.project(basicPhysicalOperators.scala:49)
	at com.nvidia.spark.rapids.GpuProjectExec$.projectAndClose(basicPhysicalOperators.scala:40)
	at com.nvidia.spark.rapids.GpuWindowExec.$anonfun$doExecuteColumnar$1(GpuWindowExec.scala:149)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
	at com.nvidia.spark.rapids.GpuBaseLimitExec$$anon$1.next(limit.scala:67)
	at com.nvidia.spark.rapids.GpuBaseLimitExec$$anon$1.next(limit.scala:61)
	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$$anon$1.partNextBatch(GpuShuffleExchangeExec.scala:206)
	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$$anon$1.hasNext(GpuShuffleExchangeExec.scala:222)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	... 1 more

Steps/Code to reproduce bug
Please provide a list of steps or a code sample to reproduce the issue.
Avoid posting private or sensitive data.

Minimum reproduce using .sql

import databricks.koalas as ks

kdf = ks.DataFrame({'a': [1, 2, 3],
                    'b': [4, 5, 6]})
ks.sql("SELECT * FROM {kdf} WHERE a > 1")

Some other Koalas functions also fail with similar stacktrace, 2nd example:

import numpy as np
import databricks.koalas as ks

kdf = ks.DataFrame({'A': np.random.rand(5),
                    'B': np.random.rand(5)})
                    
def square(x) -> ks.Series[np.float64]:
    return x ** 2
kdf.apply(square)

Expected behavior
A clear and concise description of what you expected to happen.

Above 2 examples should work fine such as in CPU mode.

Environment details (please complete the following information)

  • Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
  • Spark configuration settings related to the issue

Spark standalone cluster 3.1.1
Rapids accelerator 0.4.1
Koalas 1.7

Additional context
Add any other context about the problem here.

@viadea viadea added bug Something isn't working ? - Needs Triage Need team to review and classify labels Apr 5, 2021
@sameerz sameerz added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Apr 6, 2021
@GaryShen2008
Copy link
Collaborator

GaryShen2008 commented Apr 7, 2021

The explain log of gpu running is as below.

== Parsed Logical Plan ==
Project [__index_level_0__#228L, a#209L, b#210L]
+- Project [__index_level_0__#228L, a#209L, b#210L, monotonically_increasing_id() AS __natural_order__#233L]
   +- Window [(cast(row_number() windowspecdefinition(_nondeterministic#229L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) as bigint) - cast(1 as bigint)) AS __index_level_0__#228L, a#209L, b#210L], [_nondeterministic#229L ASC NULLS FIRST]
      +- Project [a#209L, b#210L, monotonically_increasing_id() AS _nondeterministic#229L]
         +- Project [a#209L, b#210L]
            +- Project [a#209L, b#210L]
               +- Filter (a#209L > cast(1 as bigint))
                  +- SubqueryAlias koalas_140297832854608
                     +- Project [a#209L, b#210L]
                        +- Project [__index_level_0__#208L, a#209L, b#210L, monotonically_increasing_id() AS __natural_order__#214L]
                           +- LogicalRDD [__index_level_0__#208L, a#209L, b#210L], false

== Analyzed Logical Plan ==
__index_level_0__: bigint, a: bigint, b: bigint
Project [__index_level_0__#228L, a#209L, b#210L]
+- Project [__index_level_0__#228L, a#209L, b#210L, monotonically_increasing_id() AS __natural_order__#233L]
   +- Window [(cast(row_number() windowspecdefinition(_nondeterministic#229L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) as bigint) - cast(1 as bigint)) AS __index_level_0__#228L, a#209L, b#210L], [_nondeterministic#229L ASC NULLS FIRST]
      +- Project [a#209L, b#210L, monotonically_increasing_id() AS _nondeterministic#229L]
         +- Project [a#209L, b#210L]
            +- Project [a#209L, b#210L]
               +- Filter (a#209L > cast(1 as bigint))
                  +- SubqueryAlias koalas_140297832854608
                     +- Project [a#209L, b#210L]
                        +- Project [__index_level_0__#208L, a#209L, b#210L, monotonically_increasing_id() AS __natural_order__#214L]
                           +- LogicalRDD [__index_level_0__#208L, a#209L, b#210L], false

== Optimized Logical Plan ==
Window [(cast(row_number() windowspecdefinition(_nondeterministic#229L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) as bigint) - 1) AS __index_level_0__#228L, a#209L, b#210L], [_nondeterministic#229L ASC NULLS FIRST]
+- Project [a#209L, b#210L, monotonically_increasing_id() AS _nondeterministic#229L]
   +- Filter (a#209L > 1)
      +- LogicalRDD [__index_level_0__#208L, a#209L, b#210L], false

== Physical Plan ==
*(2) RunningWindowFunction [(cast(row_number() windowspecdefinition(_nondeterministic#229L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) as bigint) - 1) AS __index_level_0__#228L, a#209L, b#210L], [_nondeterministic#229L ASC NULLS FIRST]
+- GpuColumnarToRow false
   +- GpuSort [_nondeterministic#229L ASC NULLS FIRST], false, RequireSingleBatch, 0
      +- GpuCoalesceBatches RequireSingleBatch
         +- GpuShuffleCoalesce 2147483647
            +- GpuColumnarExchange gpusinglepartitioning$(), false, [id=#772]
               +- GpuProject [a#209L, b#210L, monotonically_increasing_id() AS _nondeterministic#229L]
                  +- GpuCoalesceBatches TargetSize(2147483647)
                     +- GpuFilter (a#209L > 1)
                        +- GpuRowToColumnar TargetSize(2147483647)
                           +- *(1) Scan ExistingRDD arrow[__index_level_0__#208L,a#209L,b#210L]

@GaryShen2008
Copy link
Collaborator

GaryShen2008 commented Apr 7, 2021

Explain for cpu running:

== Parsed Logical Plan ==
Project [__index_level_0__#22L, a#3L, b#4L]
+- Project [__index_level_0__#22L, a#3L, b#4L, monotonically_increasing_id() AS __natural_order__#27L]
   +- Window [(cast(row_number() windowspecdefinition(_nondeterministic#23L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) as bigint) - cast(1 as bigint)) AS __index_level_0__#22L, a#3L, b#4L], [_nondeterministic#23L ASC NULLS FIRST]
      +- Project [a#3L, b#4L, monotonically_increasing_id() AS _nondeterministic#23L]
         +- Project [a#3L, b#4L]
            +- Project [a#3L, b#4L]
               +- Filter (a#3L > cast(1 as bigint))
                  +- SubqueryAlias koalas_139858738474320
                     +- Project [a#3L, b#4L]
                        +- Project [__index_level_0__#2L, a#3L, b#4L, monotonically_increasing_id() AS __natural_order__#8L]
                           +- LogicalRDD [__index_level_0__#2L, a#3L, b#4L], false

== Analyzed Logical Plan ==
__index_level_0__: bigint, a: bigint, b: bigint
Project [__index_level_0__#22L, a#3L, b#4L]
+- Project [__index_level_0__#22L, a#3L, b#4L, monotonically_increasing_id() AS __natural_order__#27L]
   +- Window [(cast(row_number() windowspecdefinition(_nondeterministic#23L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) as bigint) - cast(1 as bigint)) AS __index_level_0__#22L, a#3L, b#4L], [_nondeterministic#23L ASC NULLS FIRST]
      +- Project [a#3L, b#4L, monotonically_increasing_id() AS _nondeterministic#23L]
         +- Project [a#3L, b#4L]
            +- Project [a#3L, b#4L]
               +- Filter (a#3L > cast(1 as bigint))
                  +- SubqueryAlias koalas_139858738474320
                     +- Project [a#3L, b#4L]
                        +- Project [__index_level_0__#2L, a#3L, b#4L, monotonically_increasing_id() AS __natural_order__#8L]
                           +- LogicalRDD [__index_level_0__#2L, a#3L, b#4L], false

== Optimized Logical Plan ==
Window [(cast(row_number() windowspecdefinition(_nondeterministic#23L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) as bigint) - 1) AS __index_level_0__#22L, a#3L, b#4L], [_nondeterministic#23L ASC NULLS FIRST]
+- Project [a#3L, b#4L, monotonically_increasing_id() AS _nondeterministic#23L]
   +- Filter (a#3L > 1)
      +- LogicalRDD [__index_level_0__#2L, a#3L, b#4L], false

== Physical Plan ==
*(2) RunningWindowFunction [(cast(row_number() windowspecdefinition(_nondeterministic#23L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) as bigint) - 1) AS __index_level_0__#22L, a#3L, b#4L], [_nondeterministic#23L ASC NULLS FIRST]
+- *(2) Sort [_nondeterministic#23L ASC NULLS FIRST], false, 0
   +- Exchange SinglePartition, true, [id=#56]
      +- *(1) Project [a#3L, b#4L, monotonically_increasing_id() AS _nondeterministic#23L]
         +- *(1) Filter (a#3L > 1)
            +- *(1) Scan ExistingRDD arrow[__index_level_0__#2L,a#3L,b#4L]

@GaryShen2008
Copy link
Collaborator

GaryShen2008 commented Apr 7, 2021

The rapids explain log when only using .explain():

!NOT_FOUND <RunningWindowFunctionExec> cannot run on GPU because no GPU enabled version of operator class com.databricks.sql.execution.window.RunningWindowFunctionExec could be found
  @Expression <Alias> (cast(row_number() windowspecdefinition(_nondeterministic#23L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) as bigint) - 1) AS __index_level_0__#22L could run on GPU
    @Expression <Subtract> (cast(row_number() windowspecdefinition(_nondeterministic#23L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) as bigint) - 1) could run on GPU
      @Expression <Cast> cast(row_number() windowspecdefinition(_nondeterministic#23L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) as bigint) could run on GPU
        @Expression <WindowExpression> row_number() windowspecdefinition(_nondeterministic#23L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) could run on GPU
          @Expression <RowNumber> row_number() could run on GPU
          @Expression <WindowSpecDefinition> windowspecdefinition(_nondeterministic#23L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) could run on GPU
            @Expression <SortOrder> _nondeterministic#23L ASC NULLS FIRST could run on GPU
              @Expression <AttributeReference> _nondeterministic#23L could run on GPU
            @Expression <SpecifiedWindowFrame> specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$()) could run on GPU
              @Expression <UnboundedPreceding$> unboundedpreceding$() could run on GPU
              @Expression <CurrentRow$> currentrow$() could run on GPU
      @Expression <Literal> 1 could run on GPU
  @Expression <AttributeReference> a#3L could run on GPU
  @Expression <AttributeReference> b#4L could run on GPU
  @Expression <SortOrder> _nondeterministic#23L ASC NULLS FIRST could run on GPU
    @Expression <AttributeReference> _nondeterministic#23L could run on GPU
  *Exec <SortExec> will run on GPU
    *Expression <SortOrder> _nondeterministic#23L ASC NULLS FIRST will run on GPU
    *Exec <ShuffleExchangeExec> will run on GPU
      *Partitioning <SinglePartition$> will run on GPU
      *Exec <ProjectExec> will run on GPU
        *Expression <Alias> monotonically_increasing_id() AS _nondeterministic#23L will run on GPU
          *Expression <MonotonicallyIncreasingID> monotonically_increasing_id() will run on GPU
        *Exec <FilterExec> will run on GPU
          *Expression <GreaterThan> (a#3L > 1) will run on GPU
          !NOT_FOUND <RDDScanExec> cannot run on GPU because no GPU enabled version of operator class org.apache.spark.sql.execution.RDDScanExec could be found
            @Expression <AttributeReference> __index_level_0__#2L could run on GPU
            @Expression <AttributeReference> a#3L could run on GPU
            @Expression <AttributeReference> b#4L could run on GPU

The explain log when running the real query(without explain()):

21/04/07 08:18:14 WARN GpuOverrides: 
*Exec <GlobalLimitExec> will run on GPU
  *Exec <LocalLimitExec> will run on GPU
    *Exec <WindowExec> will run on GPU
      *Expression <Alias> (cast(row_number() windowspecdefinition(_nondeterministic#572L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) as bigint) - 1) AS col_0#602L will run on GPU
        *Expression <Subtract> (cast(row_number() windowspecdefinition(_nondeterministic#572L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) as bigint) - 1) will run on GPU
          *Expression <Cast> cast(row_number() windowspecdefinition(_nondeterministic#572L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) as bigint) will run on GPU
            *Expression <WindowExpression> row_number() windowspecdefinition(_nondeterministic#572L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) will run on GPU
              *Expression <RowNumber> row_number() will run on GPU
              *Expression <WindowSpecDefinition> windowspecdefinition(_nondeterministic#572L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) will run on GPU
                *Expression <SortOrder> _nondeterministic#572L ASC NULLS FIRST will run on GPU
                *Expression <SpecifiedWindowFrame> specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$()) will run on GPU
                  *Expression <UnboundedPreceding$> unboundedpreceding$() will run on GPU
                  *Expression <CurrentRow$> currentrow$() will run on GPU
      *Expression <Alias> a#552L AS col_1#603L will run on GPU
      *Expression <Alias> b#553L AS col_2#604L will run on GPU
      *Expression <SortOrder> _nondeterministic#572L ASC NULLS FIRST will run on GPU
      *Exec <SortExec> will run on GPU
        *Expression <SortOrder> _nondeterministic#572L ASC NULLS FIRST will run on GPU
        *Exec <ShuffleExchangeExec> will run on GPU
          *Partitioning <SinglePartition$> will run on GPU
          *Exec <ProjectExec> will run on GPU
            *Expression <Alias> monotonically_increasing_id() AS _nondeterministic#572L will run on GPU
              *Expression <MonotonicallyIncreasingID> monotonically_increasing_id() will run on GPU
            *Exec <FilterExec> will run on GPU
              *Expression <GreaterThan> (a#552L > 1) will run on GPU
              !NOT_FOUND <RDDScanExec> cannot run on GPU because no GPU enabled version of operator class org.apache.spark.sql.execution.RDDScanExec could be found
                @Expression <AttributeReference> __index_level_0__#551L could run on GPU
                @Expression <AttributeReference> a#552L could run on GPU
                @Expression <AttributeReference> b#553L could run on GPU

And the page of SQL looks as below:
Databricks Shell - Details for Query 1

@wbo4958
Copy link
Collaborator

wbo4958 commented Apr 7, 2021

Seems boundRowProjectList didn't consider the orderBy column which results in empty Table from project

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants