Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][AUDIT] SPARK-45652 - SPJ: Handle empty input partitions after dynamic filtering #9743

Closed
tgravescs opened this issue Nov 16, 2023 · 1 comment · Fixed by #9962
Closed
Labels
audit_3.4.2 Audit related tasks for 3.4.2 audit_3.5.1 Audit related tasks for 3.5.1 audit_4.0.0 Audit related tasks for 4.0.0 bug Something isn't working

Comments

@tgravescs
Copy link
Collaborator

Describe the bug
https://issues.apache.org/jira/browse/SPARK-45652 was part of feature for storage partitioned join. It changes BatchScanExec which we have the equivalent code in GpuBatchScanExec so we should pull this change in.

apache/spark@75aed566018

@tgravescs tgravescs added bug Something isn't working ? - Needs Triage Need team to review and classify audit_3.4.2 Audit related tasks for 3.4.2 audit_3.5.1 Audit related tasks for 3.5.1 audit_4.0.0 Audit related tasks for 4.0.0 labels Nov 16, 2023
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Nov 21, 2023
@razajafri
Copy link
Collaborator

I have tried reproducing this but I am unable to see it in local mode. Here are the steps I took

$> spark-homes/spark-3.4.1-bin-hadoop3/bin/spark-shell --conf spark.sql.autoBroadcastJoinThreshold=“-1” --conf spark.sql.adaptive.enabled=“false” --conf spark.sql.optimizer.dynamicPartitionPruning.enabled=“true” --conf spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=“false” --conf spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio=“10”

scala> import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.internal.SQLConf
scala> spark.sql("create table test_purchases (item_id long, price float, time timestamp) using parquet partitioned by (item_id)")
scala> spark.sql("create table test_table (id long, name string, price float, arrive_time timestamp) using parquet partitioned by (id)")
scala> spark.sql("insert into test_purchases values((19.5, cast('2020-02-01' as timestamp), 3))")
scala> spark.sql("insert into test_purchases values((11.0, cast('2020-01-01' as timestamp), 2))")
scala> spark.sql("insert into test_purchases values((45.0, cast('2020-01-15' as timestamp), 1))")
scala> spark.sql("insert into test_purchases values((44.0, cast('2020-01-15' as timestamp), 1))")
scala> spark.sql("insert into test_purchases values((42.0, cast('2020-01-01' as timestamp), 1))")
scala> spark.sql("insert into test_table values(('cc', 15.5, cast('2020-02-01' as timestamp), 3))")
scala> spark.sql("insert into test_table values(('bb', 10.5, cast('2020-01-01' as timestamp), 2))")
scala> spark.sql("insert into test_table values(('bb', 10.0, cast('2020-01-01' as timestamp), 2))")
scala> spark.sql("insert into test_table values(('aa', 41.0, cast('2020-01-15' as timestamp), 1))")
scala> spark.sql("insert into test_table values(('aa', 40.0, cast('2020-01-01' as timestamp), 1))")
scala> Seq(true, false).foreach { pushDownValues =>
     | Seq(true, false).foreach { partiallyClustered => {
     | spark.conf.set(SQLConf.V2_BUCKETING_PARTIALLY_CLUSTERED_DISTRIBUTION_ENABLED.key, partiallyClustered)
     | spark.conf.set(SQLConf.V2_BUCKETING_PUSH_PART_VALUES_ENABLED.key, pushDownValues)
     | spark.sql("select /*+ REPARTITION(3)*/ p.price from test_table i, test_purchases p WHERE i.id = p.item_id AND i.price > 50.0").show
     | }
     | }
     | }
+-----+
|price|
+-----+
+-----+

+-----+
|price|
+-----+
+-----+

+-----+
|price|
+-----+
+-----+

+-----+
|price|
+-----+
+-----+

After discussing this with @tgravescs the next step he suggested was to build Spark 3.4.1 locally and run the unit tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
audit_3.4.2 Audit related tasks for 3.4.2 audit_3.5.1 Audit related tasks for 3.5.1 audit_4.0.0 Audit related tasks for 4.0.0 bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants