[BUG] koalas.sql fails with java.lang.ArrayIndexOutOfBoundsException #2079

viadea · 2021-04-05T19:14:32Z

Describe the bug
A clear and concise description of what the bug is.

Koalas.sql fails with java.lang.ArrayIndexOutOfBoundsException.
Sample stacktrace is:

Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 0 out of bounds for length 0
	at ai.rapids.cudf.Table.<init>(Table.java:62)
	at com.nvidia.spark.rapids.GpuColumnVector.from(GpuColumnVector.java:498)
	at com.nvidia.spark.rapids.GpuWindowExpression.$anonfun$evaluateRowBasedWindowExpression$1(GpuWindowExpression.scala:188)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
	at com.nvidia.spark.rapids.GpuWindowExpression.withResource(GpuWindowExpression.scala:132)
	at com.nvidia.spark.rapids.GpuWindowExpression.evaluateRowBasedWindowExpression(GpuWindowExpression.scala:187)
	at com.nvidia.spark.rapids.GpuWindowExpression.columnarEval(GpuWindowExpression.scala:167)
	at com.nvidia.spark.rapids.RapidsPluginImplicits$ReallyAGpuExpression.columnarEval(implicits.scala:35)
	at com.nvidia.spark.rapids.GpuAlias.columnarEval(namedExpressions.scala:93)
	at com.nvidia.spark.rapids.RapidsPluginImplicits$ReallyAGpuExpression.columnarEval(implicits.scala:35)
	at com.nvidia.spark.rapids.GpuProjectExec$.$anonfun$project$1(basicPhysicalOperators.scala:50)
	at com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.$anonfun$safeMap$1(implicits.scala:161)
	at com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.$anonfun$safeMap$1$adapted(implicits.scala:158)
	at scala.collection.immutable.Stream.foreach(Stream.scala:533)
	at com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.safeMap(implicits.scala:158)
	at com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableProducingSeq.safeMap(implicits.scala:193)
	at com.nvidia.spark.rapids.GpuProjectExec$.project(basicPhysicalOperators.scala:49)
	at com.nvidia.spark.rapids.GpuProjectExec$.projectAndClose(basicPhysicalOperators.scala:40)
	at com.nvidia.spark.rapids.GpuWindowExec.$anonfun$doExecuteColumnar$1(GpuWindowExec.scala:149)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
	at com.nvidia.spark.rapids.GpuBaseLimitExec$$anon$1.next(limit.scala:67)
	at com.nvidia.spark.rapids.GpuBaseLimitExec$$anon$1.next(limit.scala:61)
	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$$anon$1.partNextBatch(GpuShuffleExchangeExec.scala:206)
	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$$anon$1.hasNext(GpuShuffleExchangeExec.scala:222)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	... 1 more

Steps/Code to reproduce bug
Please provide a list of steps or a code sample to reproduce the issue.
Avoid posting private or sensitive data.

Minimum reproduce using .sql

import databricks.koalas as ks

kdf = ks.DataFrame({'a': [1, 2, 3],
                    'b': [4, 5, 6]})
ks.sql("SELECT * FROM {kdf} WHERE a > 1")

Some other Koalas functions also fail with similar stacktrace, 2nd example:

import numpy as np
import databricks.koalas as ks

kdf = ks.DataFrame({'A': np.random.rand(5),
                    'B': np.random.rand(5)})
                    
def square(x) -> ks.Series[np.float64]:
    return x ** 2
kdf.apply(square)

Expected behavior
A clear and concise description of what you expected to happen.

Above 2 examples should work fine such as in CPU mode.

Environment details (please complete the following information)

Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
Spark configuration settings related to the issue

Spark standalone cluster 3.1.1
Rapids accelerator 0.4.1
Koalas 1.7

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

GaryShen2008 · 2021-04-07T03:48:55Z

The explain log of gpu running is as below.

== Parsed Logical Plan ==
Project [__index_level_0__#228L, a#209L, b#210L]
+- Project [__index_level_0__#228L, a#209L, b#210L, monotonically_increasing_id() AS __natural_order__#233L]
   +- Window [(cast(row_number() windowspecdefinition(_nondeterministic#229L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) as bigint) - cast(1 as bigint)) AS __index_level_0__#228L, a#209L, b#210L], [_nondeterministic#229L ASC NULLS FIRST]
      +- Project [a#209L, b#210L, monotonically_increasing_id() AS _nondeterministic#229L]
         +- Project [a#209L, b#210L]
            +- Project [a#209L, b#210L]
               +- Filter (a#209L > cast(1 as bigint))
                  +- SubqueryAlias koalas_140297832854608
                     +- Project [a#209L, b#210L]
                        +- Project [__index_level_0__#208L, a#209L, b#210L, monotonically_increasing_id() AS __natural_order__#214L]
                           +- LogicalRDD [__index_level_0__#208L, a#209L, b#210L], false

== Analyzed Logical Plan ==
__index_level_0__: bigint, a: bigint, b: bigint
Project [__index_level_0__#228L, a#209L, b#210L]
+- Project [__index_level_0__#228L, a#209L, b#210L, monotonically_increasing_id() AS __natural_order__#233L]
   +- Window [(cast(row_number() windowspecdefinition(_nondeterministic#229L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) as bigint) - cast(1 as bigint)) AS __index_level_0__#228L, a#209L, b#210L], [_nondeterministic#229L ASC NULLS FIRST]
      +- Project [a#209L, b#210L, monotonically_increasing_id() AS _nondeterministic#229L]
         +- Project [a#209L, b#210L]
            +- Project [a#209L, b#210L]
               +- Filter (a#209L > cast(1 as bigint))
                  +- SubqueryAlias koalas_140297832854608
                     +- Project [a#209L, b#210L]
                        +- Project [__index_level_0__#208L, a#209L, b#210L, monotonically_increasing_id() AS __natural_order__#214L]
                           +- LogicalRDD [__index_level_0__#208L, a#209L, b#210L], false

== Optimized Logical Plan ==
Window [(cast(row_number() windowspecdefinition(_nondeterministic#229L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) as bigint) - 1) AS __index_level_0__#228L, a#209L, b#210L], [_nondeterministic#229L ASC NULLS FIRST]
+- Project [a#209L, b#210L, monotonically_increasing_id() AS _nondeterministic#229L]
   +- Filter (a#209L > 1)
      +- LogicalRDD [__index_level_0__#208L, a#209L, b#210L], false

== Physical Plan ==
*(2) RunningWindowFunction [(cast(row_number() windowspecdefinition(_nondeterministic#229L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) as bigint) - 1) AS __index_level_0__#228L, a#209L, b#210L], [_nondeterministic#229L ASC NULLS FIRST]
+- GpuColumnarToRow false
   +- GpuSort [_nondeterministic#229L ASC NULLS FIRST], false, RequireSingleBatch, 0
      +- GpuCoalesceBatches RequireSingleBatch
         +- GpuShuffleCoalesce 2147483647
            +- GpuColumnarExchange gpusinglepartitioning$(), false, [id=#772]
               +- GpuProject [a#209L, b#210L, monotonically_increasing_id() AS _nondeterministic#229L]
                  +- GpuCoalesceBatches TargetSize(2147483647)
                     +- GpuFilter (a#209L > 1)
                        +- GpuRowToColumnar TargetSize(2147483647)
                           +- *(1) Scan ExistingRDD arrow[__index_level_0__#208L,a#209L,b#210L]

GaryShen2008 · 2021-04-07T04:12:46Z

Explain for cpu running:

== Parsed Logical Plan ==
Project [__index_level_0__#22L, a#3L, b#4L]
+- Project [__index_level_0__#22L, a#3L, b#4L, monotonically_increasing_id() AS __natural_order__#27L]
   +- Window [(cast(row_number() windowspecdefinition(_nondeterministic#23L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) as bigint) - cast(1 as bigint)) AS __index_level_0__#22L, a#3L, b#4L], [_nondeterministic#23L ASC NULLS FIRST]
      +- Project [a#3L, b#4L, monotonically_increasing_id() AS _nondeterministic#23L]
         +- Project [a#3L, b#4L]
            +- Project [a#3L, b#4L]
               +- Filter (a#3L > cast(1 as bigint))
                  +- SubqueryAlias koalas_139858738474320
                     +- Project [a#3L, b#4L]
                        +- Project [__index_level_0__#2L, a#3L, b#4L, monotonically_increasing_id() AS __natural_order__#8L]
                           +- LogicalRDD [__index_level_0__#2L, a#3L, b#4L], false

== Analyzed Logical Plan ==
__index_level_0__: bigint, a: bigint, b: bigint
Project [__index_level_0__#22L, a#3L, b#4L]
+- Project [__index_level_0__#22L, a#3L, b#4L, monotonically_increasing_id() AS __natural_order__#27L]
   +- Window [(cast(row_number() windowspecdefinition(_nondeterministic#23L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) as bigint) - cast(1 as bigint)) AS __index_level_0__#22L, a#3L, b#4L], [_nondeterministic#23L ASC NULLS FIRST]
      +- Project [a#3L, b#4L, monotonically_increasing_id() AS _nondeterministic#23L]
         +- Project [a#3L, b#4L]
            +- Project [a#3L, b#4L]
               +- Filter (a#3L > cast(1 as bigint))
                  +- SubqueryAlias koalas_139858738474320
                     +- Project [a#3L, b#4L]
                        +- Project [__index_level_0__#2L, a#3L, b#4L, monotonically_increasing_id() AS __natural_order__#8L]
                           +- LogicalRDD [__index_level_0__#2L, a#3L, b#4L], false

== Optimized Logical Plan ==
Window [(cast(row_number() windowspecdefinition(_nondeterministic#23L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) as bigint) - 1) AS __index_level_0__#22L, a#3L, b#4L], [_nondeterministic#23L ASC NULLS FIRST]
+- Project [a#3L, b#4L, monotonically_increasing_id() AS _nondeterministic#23L]
   +- Filter (a#3L > 1)
      +- LogicalRDD [__index_level_0__#2L, a#3L, b#4L], false

== Physical Plan ==
*(2) RunningWindowFunction [(cast(row_number() windowspecdefinition(_nondeterministic#23L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) as bigint) - 1) AS __index_level_0__#22L, a#3L, b#4L], [_nondeterministic#23L ASC NULLS FIRST]
+- *(2) Sort [_nondeterministic#23L ASC NULLS FIRST], false, 0
   +- Exchange SinglePartition, true, [id=#56]
      +- *(1) Project [a#3L, b#4L, monotonically_increasing_id() AS _nondeterministic#23L]
         +- *(1) Filter (a#3L > 1)
            +- *(1) Scan ExistingRDD arrow[__index_level_0__#2L,a#3L,b#4L]

GaryShen2008 · 2021-04-07T04:26:20Z

The rapids explain log when only using .explain():

!NOT_FOUND <RunningWindowFunctionExec> cannot run on GPU because no GPU enabled version of operator class com.databricks.sql.execution.window.RunningWindowFunctionExec could be found
  @Expression <Alias> (cast(row_number() windowspecdefinition(_nondeterministic#23L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) as bigint) - 1) AS __index_level_0__#22L could run on GPU
    @Expression <Subtract> (cast(row_number() windowspecdefinition(_nondeterministic#23L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) as bigint) - 1) could run on GPU
      @Expression <Cast> cast(row_number() windowspecdefinition(_nondeterministic#23L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) as bigint) could run on GPU
        @Expression <WindowExpression> row_number() windowspecdefinition(_nondeterministic#23L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) could run on GPU
          @Expression <RowNumber> row_number() could run on GPU
          @Expression <WindowSpecDefinition> windowspecdefinition(_nondeterministic#23L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) could run on GPU
            @Expression <SortOrder> _nondeterministic#23L ASC NULLS FIRST could run on GPU
              @Expression <AttributeReference> _nondeterministic#23L could run on GPU
            @Expression <SpecifiedWindowFrame> specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$()) could run on GPU
              @Expression <UnboundedPreceding$> unboundedpreceding$() could run on GPU
              @Expression <CurrentRow$> currentrow$() could run on GPU
      @Expression <Literal> 1 could run on GPU
  @Expression <AttributeReference> a#3L could run on GPU
  @Expression <AttributeReference> b#4L could run on GPU
  @Expression <SortOrder> _nondeterministic#23L ASC NULLS FIRST could run on GPU
    @Expression <AttributeReference> _nondeterministic#23L could run on GPU
  *Exec <SortExec> will run on GPU
    *Expression <SortOrder> _nondeterministic#23L ASC NULLS FIRST will run on GPU
    *Exec <ShuffleExchangeExec> will run on GPU
      *Partitioning <SinglePartition$> will run on GPU
      *Exec <ProjectExec> will run on GPU
        *Expression <Alias> monotonically_increasing_id() AS _nondeterministic#23L will run on GPU
          *Expression <MonotonicallyIncreasingID> monotonically_increasing_id() will run on GPU
        *Exec <FilterExec> will run on GPU
          *Expression <GreaterThan> (a#3L > 1) will run on GPU
          !NOT_FOUND <RDDScanExec> cannot run on GPU because no GPU enabled version of operator class org.apache.spark.sql.execution.RDDScanExec could be found
            @Expression <AttributeReference> __index_level_0__#2L could run on GPU
            @Expression <AttributeReference> a#3L could run on GPU
            @Expression <AttributeReference> b#4L could run on GPU

The explain log when running the real query(without explain()):

21/04/07 08:18:14 WARN GpuOverrides: 
*Exec <GlobalLimitExec> will run on GPU
  *Exec <LocalLimitExec> will run on GPU
    *Exec <WindowExec> will run on GPU
      *Expression <Alias> (cast(row_number() windowspecdefinition(_nondeterministic#572L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) as bigint) - 1) AS col_0#602L will run on GPU
        *Expression <Subtract> (cast(row_number() windowspecdefinition(_nondeterministic#572L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) as bigint) - 1) will run on GPU
          *Expression <Cast> cast(row_number() windowspecdefinition(_nondeterministic#572L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) as bigint) will run on GPU
            *Expression <WindowExpression> row_number() windowspecdefinition(_nondeterministic#572L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) will run on GPU
              *Expression <RowNumber> row_number() will run on GPU
              *Expression <WindowSpecDefinition> windowspecdefinition(_nondeterministic#572L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) will run on GPU
                *Expression <SortOrder> _nondeterministic#572L ASC NULLS FIRST will run on GPU
                *Expression <SpecifiedWindowFrame> specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$()) will run on GPU
                  *Expression <UnboundedPreceding$> unboundedpreceding$() will run on GPU
                  *Expression <CurrentRow$> currentrow$() will run on GPU
      *Expression <Alias> a#552L AS col_1#603L will run on GPU
      *Expression <Alias> b#553L AS col_2#604L will run on GPU
      *Expression <SortOrder> _nondeterministic#572L ASC NULLS FIRST will run on GPU
      *Exec <SortExec> will run on GPU
        *Expression <SortOrder> _nondeterministic#572L ASC NULLS FIRST will run on GPU
        *Exec <ShuffleExchangeExec> will run on GPU
          *Partitioning <SinglePartition$> will run on GPU
          *Exec <ProjectExec> will run on GPU
            *Expression <Alias> monotonically_increasing_id() AS _nondeterministic#572L will run on GPU
              *Expression <MonotonicallyIncreasingID> monotonically_increasing_id() will run on GPU
            *Exec <FilterExec> will run on GPU
              *Expression <GreaterThan> (a#552L > 1) will run on GPU
              !NOT_FOUND <RDDScanExec> cannot run on GPU because no GPU enabled version of operator class org.apache.spark.sql.execution.RDDScanExec could be found
                @Expression <AttributeReference> __index_level_0__#551L could run on GPU
                @Expression <AttributeReference> a#552L could run on GPU
                @Expression <AttributeReference> b#553L could run on GPU

And the page of SQL looks as below:

wbo4958 · 2021-04-07T09:15:48Z

Seems boundRowProjectList didn't consider the orderBy column which results in empty Table from project

viadea added bug Something isn't working ? - Needs Triage Need team to review and classify labels Apr 5, 2021

sameerz added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Apr 6, 2021

GaryShen2008 assigned wbo4958 Apr 7, 2021

wbo4958 mentioned this issue Apr 8, 2021

fix the bound row project empty issue in row frame #2097

Merged

sameerz added this to the Apr 12 - Apr 23 milestone Apr 9, 2021

wbo4958 closed this as completed in #2097 Apr 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] koalas.sql fails with java.lang.ArrayIndexOutOfBoundsException #2079

[BUG] koalas.sql fails with java.lang.ArrayIndexOutOfBoundsException #2079

viadea commented Apr 5, 2021

GaryShen2008 commented Apr 7, 2021 •

edited

Loading

GaryShen2008 commented Apr 7, 2021 •

edited

Loading

GaryShen2008 commented Apr 7, 2021 •

edited

Loading

wbo4958 commented Apr 7, 2021

[BUG] koalas.sql fails with java.lang.ArrayIndexOutOfBoundsException #2079

[BUG] koalas.sql fails with java.lang.ArrayIndexOutOfBoundsException #2079

Comments

viadea commented Apr 5, 2021

GaryShen2008 commented Apr 7, 2021 • edited Loading

GaryShen2008 commented Apr 7, 2021 • edited Loading

GaryShen2008 commented Apr 7, 2021 • edited Loading

wbo4958 commented Apr 7, 2021

GaryShen2008 commented Apr 7, 2021 •

edited

Loading

GaryShen2008 commented Apr 7, 2021 •

edited

Loading

GaryShen2008 commented Apr 7, 2021 •

edited

Loading