Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decimal Support for writing Parquet #1531

Merged
merged 7 commits into from
Jan 23, 2021

Conversation

razajafri
Copy link
Collaborator

@razajafri razajafri commented Jan 15, 2021

This PR adds support for writing Decimal types to Parquet file.

There is an issue in cudf at the time of writing this PR (rapidsai/cudf#7152) i.e. Decimals with precision < 10 are not able to be read back using Spark-cpu.

This depends on rapidsai/cudf#7153

Signed-off-by: Raza Jafri rjafri@nvidia.com

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@@ -32,16 +33,30 @@
writer_confs={'spark.sql.legacy.parquet.datetimeRebaseModeInWrite': 'CORRECTED',
'spark.sql.legacy.parquet.int96RebaseModeInWrite': 'CORRECTED'}

# https://github.com/rapidsai/cudf/issues/7152
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could write the tests and just xfail them and we should file a followup issue to track it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... ok. I will see if I can do that and still use our data_gens


parquet_ts_write_options = ['INT96', 'TIMESTAMP_MICROS', 'TIMESTAMP_MILLIS']

@pytest.mark.parametrize('parquet_gens', parquet_write_gens_list, ids=idfn)
@pytest.mark.parametrize('reader_confs', reader_opt_confs)
@pytest.mark.parametrize('v1_enabled_list', ["", "parquet"])
@pytest.mark.parametrize('ts_type', parquet_ts_write_options)
@allow_non_gpu("CoalesceExec")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because CoalesceExec doesn't support Decimals yet

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to rather write a separate test for Decimals so we are at least testing coalesce with other types? we already have other tests testing coalesce though

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I don't want it to be ok if other types have that

@@ -41,6 +41,11 @@ object GpuParquetFileFormat {
spark: SparkSession,
options: Map[String, String],
schema: StructType): Option[GpuParquetFileFormat] = {

if(!schema.forall(field => GpuOverrides.isSupportedType(field.dataType, allowDecimal = true))) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit space after if



if(!schema.forall(field => GpuOverrides.isSupportedType(field.dataType))) {
meta.willNotWorkOnGpu("Not all datatypes are supported")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the very least print the data types here or tell us which one isn't supported

@tgravescs tgravescs added the feature request New feature or request label Jan 15, 2021
@tgravescs tgravescs added this to the Jan 4 - Jan 15 milestone Jan 15, 2021
@jlowe jlowe marked this pull request as draft January 15, 2021 21:34
@jlowe
Copy link
Member

jlowe commented Jan 15, 2021

Converting this to draft since it depends on a cudf change that is still pending.

Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to update the static documentation in TypeChecks to update the Input/Output table to specify Parquet supports decimals.

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>

parquet_ts_write_options = ['INT96', 'TIMESTAMP_MICROS', 'TIMESTAMP_MILLIS']

@pytest.mark.parametrize('parquet_gens', parquet_write_gens_list, ids=idfn)
@pytest.mark.parametrize('reader_confs', reader_opt_confs)
@pytest.mark.parametrize('v1_enabled_list', ["", "parquet"])
@pytest.mark.parametrize('ts_type', parquet_ts_write_options)
def test_write_round_trip(spark_tmp_path, parquet_gens, v1_enabled_list, ts_type, reader_confs):
def test_parquet_write_round_trip(spark_tmp_path, parquet_gens, v1_enabled_list, ts_type,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need parquet in the name when its in parquet_write_test? just make it wrap inputs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it. I made the change to be able to execute the test by name otherwise it would run the orc test as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok that is fine. you can also specify the test file. src/main/python/parquet_write_test.py -k test_write_round_trip

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's really helpful for future.

@@ -112,6 +118,7 @@ def test_compress_write_round_trip(spark_tmp_path, compress, v1_enabled_list, re

@pytest.mark.parametrize('parquet_gens', parquet_write_gens_list, ids=idfn)
@pytest.mark.parametrize('ts_type', parquet_ts_write_options)
@allow_non_gpu("CoalesceExec")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assume this isn't needed if in parquet_write_gens_list?

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@razajafri razajafri marked this pull request as ready for review January 22, 2021 19:46
@razajafri
Copy link
Collaborator Author

@jlowe I have converted this from a draft as the cudf issue is merged. Appreciate the review. Can you PTAL?

jlowe
jlowe previously approved these changes Jan 22, 2021
@jlowe
Copy link
Member

jlowe commented Jan 22, 2021

build

1 similar comment
@jlowe
Copy link
Member

jlowe commented Jan 22, 2021

build

@@ -41,6 +41,13 @@ object GpuParquetFileFormat {
spark: SparkSession,
options: Map[String, String],
schema: StructType): Option[GpuParquetFileFormat] = {

val unSupportedTypes =
schema.filter(field => !GpuOverrides.isSupportedType(field.dataType, allowDecimal = true))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider schema.filterNot for readability


val unSupportedTypes =
schema.filter(field => !GpuOverrides.isSupportedType(field.dataType, allowDecimal = true))
if (!unSupportedTypes.isEmpty) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider if (unSupportedTypes.nonEmpty)

def precisionsList(t: DataType): List[Int] = {
t match {
case d: DecimalType => List(d.precision)
case s: StructType => s.flatMap(f => precisionsList(f.dataType)).toList
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could save the conversion toList if the return type for precisionsList were Seq[Int]

options: Map[String, String],
schema: StructType): Option[GpuOrcFileFormat] = {

val unSupportedTypes = schema.filter(field => !GpuOverrides.isSupportedType(field.dataType))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks the same as in GpuParquetFileFormat.scala, consider a shared utils.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not exactly the same. The predicate being tested is different. If you still think we can benefit from refactoring this I can do it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the difference is parameterizable but it's not that big of a deal given it's just a few lines, up to you

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@razajafri
Copy link
Collaborator Author

@gerashegalov Thanks for the review I have incorporated most of your comments in the PR. The only thing that I haven't done is the utility method.

@razajafri
Copy link
Collaborator Author

build

Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@razajafri razajafri merged commit 088cd82 into NVIDIA:branch-0.4 Jan 23, 2021
@razajafri razajafri deleted the parquet_decimal branch January 23, 2021 00:54
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Decimal Support for writing Parquet

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* addressed review comments

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* updated static doc

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* generated supported_ops

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* addressed review comments

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

Co-authored-by: Raza Jafri <rjafri@nvidia.com>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Decimal Support for writing Parquet

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* addressed review comments

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* updated static doc

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* generated supported_ops

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* addressed review comments

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

Co-authored-by: Raza Jafri <rjafri@nvidia.com>
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023
…IDIA#1531)

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants