Support reading ANSI day time interval type from CSV source #4927

res-life · 2022-03-10T13:05:23Z

Contributes #4146

Support reading ANSI day time interval type from CSV source.
This PR is an implementation on the plugin side currently.
If cuDF supports this in future, should update, see rapidsai/cudf#10356

Signed-off-by: Chong Gao res_life@163.com

Signed-off-by: Chong Gao <res_life@163.com>

res-life · 2022-03-10T13:05:43Z

build

res-life · 2022-03-10T13:09:40Z

build

res-life · 2022-03-10T14:46:04Z

build

res-life · 2022-03-11T07:35:12Z

build

res-life · 2022-03-11T07:40:40Z

Filed 2 Spark issues:
Overflow occurs when reading ANSI day time interval from CSV file
https://issues.apache.org/jira/browse/SPARK-38520

The second range is not [0, 59] in the day time ANSI interval
https://issues.apache.org/jira/browse/SPARK-38324

integration_tests/src/main/python/csv_test.py

integration_tests/src/main/python/data_gen.py

firestarman · 2022-03-11T08:04:46Z

sql-plugin/src/main/301until330-all/scala/com/nvidia/spark/rapids/shims/GpuTypeShims.scala

+
+  def supportCsvRead(dt: DataType) : Boolean = false
+
+  def csvRead(cv: ColumnVector, dt: DataType): ColumnVector =


Suggested change

def csvRead(cv: ColumnVector, dt: DataType): ColumnVector =

def csvRead(cv: cudf.ColumnVector, dt: DataType): cudf.ColumnVector =

Otherwise, it will conflict with the PR #4926 who imports the org.apache.spark.sql.vectorized.ColumnVector

The same as above, #4926 is merged, then here should be a conflict now.

firestarman · 2022-03-11T08:04:58Z

sql-plugin/src/main/301until330-all/scala/com/nvidia/spark/rapids/shims/GpuTypeShims.scala

@@ -15,6 +15,7 @@
 */
 package com.nvidia.spark.rapids.shims

+import ai.rapids.cudf.ColumnVector


Suggested change

import ai.rapids.cudf.ColumnVector

import ai.rapids.cudf

I see nothing is changed. And the PR I mentioned is merged, so here should be a conflict now.

res-life · 2022-03-11T09:49:13Z

build

res-life · 2022-03-11T10:06:06Z

Performance test result

GPU	CPU	radio
5,057ms	30,569ms	6

Test file is 1.7G and has 12320000 rows and 10 columns, it's duplicated from day-time-interval.csv by many times. The day-time-interval.csv is in the tests module.

The sql is:
spark.sql("select count(c01) + count(c02) + count(c03) + count(c04) + count(c05) + count(c06) + count(c07) + count(c08) + count(c09) + count(c10) from tbl").show()

Details code:

import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.types.{DayTimeIntervalType, StructField, StructType}

def dayTime(s: Byte, e: Byte): DayTimeIntervalType = DayTimeIntervalType(s, e)

val schema = StructType(Seq(
  StructField("c01", dayTime(DayTimeIntervalType.DAY, DayTimeIntervalType.DAY)),
  StructField("c02", dayTime(DayTimeIntervalType.DAY, DayTimeIntervalType.HOUR)),
  StructField("c03", dayTime(DayTimeIntervalType.DAY, DayTimeIntervalType.MINUTE)),
  StructField("c04", dayTime(DayTimeIntervalType.DAY, DayTimeIntervalType.SECOND)),
  StructField("c05", dayTime(DayTimeIntervalType.HOUR, DayTimeIntervalType.HOUR)),
  StructField("c06", dayTime(DayTimeIntervalType.HOUR, DayTimeIntervalType.MINUTE)),
  StructField("c07", dayTime(DayTimeIntervalType.HOUR, DayTimeIntervalType.SECOND)),
  StructField("c08", dayTime(DayTimeIntervalType.MINUTE, DayTimeIntervalType.MINUTE)),
  StructField("c09", dayTime(DayTimeIntervalType.MINUTE, DayTimeIntervalType.SECOND)),
  StructField("c10", dayTime(DayTimeIntervalType.SECOND, DayTimeIntervalType.SECOND))
))

val path = "/tmp/my-tmp/large.csv"

spark.conf.set("spark.rapids.sql.enabled", false)
var start = System.currentTimeMillis()
spark.read.schema(schema).csv(path).createOrReplaceTempView("tbl")
spark.sql("select count(c01)  + count(c02) + count(c03) + count(c04)  + count(c05) + count(c06) + count(c07) + count(c08)  + count(c09) + count(c10) from tbl").show()
println((System.currentTimeMillis() - start) + "ms")


spark.conf.set("spark.rapids.sql.enabled", true)
start = System.currentTimeMillis()
spark.read.schema(schema).csv(path).createOrReplaceTempView("tbl")
spark.sql("select count(c01)  + count(c02) + count(c03) + count(c04)  + count(c05) + count(c06) + count(c07) + count(c08)  + count(c09) + count(c10) from tbl").show()
println((System.currentTimeMillis() - start) + "ms")

res-life · 2022-03-11T10:50:08Z

build

firestarman · 2022-03-14T08:02:38Z

integration_tests/src/main/python/data_gen.py

+        seconds = 0
+        microseconds = 0
+
+        if (start_field, end_field) == ("day", "day"):


I think it would be better to use the below structure, which should get a better perf.

if: ... elif: ... elif: ... else: ...

Can we do this totally differently? There is just so much copy/paste between each part of the code. We know that the python code really is only converting it to micro seconds. So can we just generate a random number for the microseconds, up to max_micros, and then truncate it to the proper time? Also can we convert the "string" start/end fields into something simpler to write code for?

# in __init__ DAY = 0 HOUR = 1 MIN = 2 SEC = 3 fields_to_look_at = ["day", "hour", "minute", "second"] si = fields_to_look_at.index(start_field) ei = fileds_to_look_at.index(end_field) assert si <= ei self.hasDays = si <= DAY and ei >= DAY self.hasHours = si <= HOUR and ei >= HOUR self.hasMin = si <= MIN and ei >= MIN self.hasSec = si <= SEC and ei >= SEC

sql-plugin/src/main/330+/scala/com/nvidia/spark/rapids/shims/GpuIntervalUtils.scala

tests/src/test/resources/day-time-interval-to-be-fix.csv

sql-plugin/src/main/330+/scala/com/nvidia/spark/rapids/shims/GpuIntervalUtils.scala

res-life · 2022-03-14T11:34:18Z

@revans2 help to review

integration_tests/src/main/python/data_gen.py

revans2 · 2022-03-14T14:24:03Z

integration_tests/src/main/python/data_gen.py

+        seconds = 0
+        microseconds = 0
+
+        if (start_field, end_field) == ("day", "day"):


Can we do this totally differently? There is just so much copy/paste between each part of the code. We know that the python code really is only converting it to micro seconds. So can we just generate a random number for the microseconds, up to max_micros, and then truncate it to the proper time? Also can we convert the "string" start/end fields into something simpler to write code for?

# in __init__ DAY = 0 HOUR = 1 MIN = 2 SEC = 3 fields_to_look_at = ["day", "hour", "minute", "second"] si = fields_to_look_at.index(start_field) ei = fileds_to_look_at.index(end_field) assert si <= ei self.hasDays = si <= DAY and ei >= DAY self.hasHours = si <= HOUR and ei >= HOUR self.hasMin = si <= MIN and ei >= MIN self.hasSec = si <= SEC and ei >= SEC

sql-plugin/src/main/330+/scala/com/nvidia/spark/rapids/shims/GpuIntervalUtils.scala

res-life · 2022-03-15T13:10:44Z

build

res-life · 2022-03-16T10:25:18Z

The last comment: #4927 (comment)

res-life · 2022-03-16T10:27:48Z

Will resolve the conflict after #4946 is merged, because of 4946 is also conflict with this.

res-life · 2022-03-17T01:41:13Z

Filed a Spark issue: Interval types are not truncated to the expected endField when creating a DataFrame via Duration https://issues.apache.org/jira/browse/SPARK-38577

res-life · 2022-03-17T03:24:31Z

build

res-life · 2022-03-17T06:30:30Z

build

revans2 · 2022-03-17T16:03:12Z

tests/src/test/330/scala/com/nvidia/spark/rapids/CsvScanForIntervalSuite.scala

+  }
+
+  /**
+   * TODO: Blocked by Spark overflow issue: https://issues.apache.org/jira/browse/SPARK-38520


Do we do the right thing with the overflow? If so then we need to document this.

Yes, should update doc, another followup PR: #4981

Support ANSI interval types from CSV source

6d2cb94

Signed-off-by: Chong Gao <res_life@163.com>

res-life requested review from jlowe, revans2, tgravescs, GaryShen2008 and NvTimLiu as code owners March 10, 2022 13:05

res-life changed the title ~~Support read ANSI day time interval type from CSV source~~ Support reading ANSI day time interval type from CSV source Mar 10, 2022

res-life requested a review from firestarman March 10, 2022 13:06

Fix build error

a760a21

Merge branch-22.04

654850a

sameerz added the audit_3.3.0 Audit related tasks for 3.3.0 label Mar 11, 2022

sameerz added this to the Feb 28 - Mar 18 milestone Mar 11, 2022

Add test cases

3e66268

firestarman reviewed Mar 11, 2022

View reviewed changes

Fix

0014e09

Fix a test case

ce14ebd

firestarman reviewed Mar 14, 2022

View reviewed changes

revans2 reviewed Mar 14, 2022

View reviewed changes

Refactor

bd70f08

Merge branch-22.04

f4430fd

Update test cases

9980a2b

revans2 approved these changes Mar 17, 2022

View reviewed changes

res-life merged commit 3dfd03c into NVIDIA:branch-22.04 Mar 18, 2022

res-life mentioned this pull request Mar 18, 2022

Update doc for CSV reading interval #4981

Merged

res-life deleted the support-ansi-intervals-for-csv branch April 2, 2022 08:03

nartal1 mentioned this pull request Apr 5, 2022

[FEA] Add ANSI support #5120

Open

49 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support reading ANSI day time interval type from CSV source #4927

Support reading ANSI day time interval type from CSV source #4927

res-life commented Mar 10, 2022 •

edited

Loading

res-life commented Mar 10, 2022

res-life commented Mar 10, 2022

res-life commented Mar 10, 2022

res-life commented Mar 11, 2022

res-life commented Mar 11, 2022

firestarman Mar 11, 2022

firestarman Mar 16, 2022

firestarman Mar 11, 2022

res-life Mar 15, 2022

firestarman Mar 16, 2022

res-life commented Mar 11, 2022

res-life commented Mar 11, 2022

res-life commented Mar 11, 2022

firestarman Mar 14, 2022

revans2 Mar 14, 2022

res-life commented Mar 14, 2022

revans2 Mar 14, 2022

res-life commented Mar 15, 2022

res-life commented Mar 16, 2022

res-life commented Mar 16, 2022

res-life commented Mar 17, 2022 •

edited

Loading

res-life commented Mar 17, 2022

res-life commented Mar 17, 2022

revans2 Mar 17, 2022

res-life Mar 18, 2022


		def supportCsvRead(dt: DataType) : Boolean = false

		def csvRead(cv: ColumnVector, dt: DataType): ColumnVector =

	def csvRead(cv: ColumnVector, dt: DataType): ColumnVector =
	def csvRead(cv: cudf.ColumnVector, dt: DataType): cudf.ColumnVector =

Support reading ANSI day time interval type from CSV source #4927

Support reading ANSI day time interval type from CSV source #4927

Conversation

res-life commented Mar 10, 2022 • edited Loading

res-life commented Mar 10, 2022

res-life commented Mar 10, 2022

res-life commented Mar 10, 2022

res-life commented Mar 11, 2022

res-life commented Mar 11, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

res-life commented Mar 11, 2022

res-life commented Mar 11, 2022

res-life commented Mar 11, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

res-life commented Mar 14, 2022

Choose a reason for hiding this comment

res-life commented Mar 15, 2022

res-life commented Mar 16, 2022

res-life commented Mar 16, 2022

res-life commented Mar 17, 2022 • edited Loading

res-life commented Mar 17, 2022

res-life commented Mar 17, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

res-life commented Mar 10, 2022 •

edited

Loading

res-life commented Mar 17, 2022 •

edited

Loading