Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure column names are valid when writing benchmark query results to file #1247

Merged
merged 3 commits into from
Dec 3, 2020

Conversation

andygrove
Copy link
Contributor

Ensure column names are valid when writing benchmark query results to file.

This closes #1246

@andygrove andygrove added the benchmark Benchmarking, benchmarking tools label Dec 2, 2020
@andygrove andygrove added this to the Nov 23 - Dec 4 milestone Dec 2, 2020
@andygrove andygrove self-assigned this Dec 2, 2020
@andygrove andygrove changed the title Ensure column names are valid when writing benchmark query results to file WIP: Ensure column names are valid when writing benchmark query results to file Dec 2, 2020
@abellina
Copy link
Collaborator

abellina commented Dec 2, 2020

Is this an issue for CPU plans as well? Or just the GPU plans? From the exceptions in the issue it does look like Spark is complaining about this.

@andygrove
Copy link
Contributor Author

Yes, this happens on CPU as well:

org.apache.spark.sql.AnalysisException: Attribute name "round((sun_sales1 / sun_sales2), 2)" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
	at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkConversionRequirement(ParquetSchemaConverter.scala:583)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkFieldName(ParquetSchemaConverter.scala:574)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$.$anonfun$setSchema$2(ParquetWriteSupport.scala:472)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$.$anonfun$setSchema$2$adapted(ParquetWriteSupport.scala:472)

@andygrove andygrove changed the title WIP: Ensure column names are valid when writing benchmark query results to file Ensure column names are valid when writing benchmark query results to file Dec 2, 2020
@abellina
Copy link
Collaborator

abellina commented Dec 2, 2020

build

1 similar comment
@sameerz
Copy link
Collaborator

sameerz commented Dec 3, 2020

build

@abellina
Copy link
Collaborator

abellina commented Dec 3, 2020

This PR needs to be upmerged to include the latest changes in branch-0.3

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

tests timed out - build

@andygrove
Copy link
Contributor Author

build

@andygrove andygrove merged commit 4679591 into NVIDIA:branch-0.3 Dec 3, 2020
@andygrove andygrove deleted the tpcds-column-names branch December 3, 2020 21:53
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
… file (NVIDIA#1247)

* Enforce unique column names when writing query output to file

* preserve valid names

Signed-off-by: Andy Grove <andygrove@nvidia.com>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
… file (NVIDIA#1247)

* Enforce unique column names when writing query output to file

* preserve valid names

Signed-off-by: Andy Grove <andygrove@nvidia.com>
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023
…IDIA#1247)

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmark Benchmarking, benchmarking tools
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Many TPC-DS benchmarks fail when writing to Parquet
3 participants