Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sanitize column names in ParquetCachedBatchSerializer before writing to Parquet [databricks] #4258

Merged
merged 3 commits into from
Dec 7, 2021

Conversation

razajafri
Copy link
Collaborator

@razajafri razajafri commented Dec 1, 2021

There is a case where we use the original schema sent by Spark which has names that haven't been sanitized.

This PR sets the names before creating a Hadoop conf.

Signed-off-by: Raza Jafri rjafri@nvidia.com

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@razajafri razajafri self-assigned this Dec 1, 2021
@razajafri
Copy link
Collaborator Author

build

@razajafri razajafri changed the base branch from branch-22.02 to branch-21.12 December 2, 2021 00:33
@razajafri
Copy link
Collaborator Author

build

@gerashegalov
Copy link
Collaborator

can you create, reference an issue that is being solved by this PR?

// We want to change the original schema to have the new names as well
private def sanitizeColumnNames(originalSchema: Seq[Attribute],
schemaToCopyNamesFrom: Seq[Attribute]): Seq[Attribute] = {
originalSchema.zip(schemaToCopyNamesFrom).map(t => t._1.withName(t._2.name))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: prefer pattern-match to _1, _2, e.g.:

    originalSchema.zip(schemaToCopyNamesFrom).map { 
      case (origAttr, newAttr) => origAttr.withName(newAttr.name)
    }

@sameerz sameerz added the bug Something isn't working label Dec 3, 2021
@sameerz sameerz added this to the Nov 30 - Dec 10 milestone Dec 3, 2021
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@razajafri
Copy link
Collaborator Author

build

@gerashegalov
Copy link
Collaborator

#3879 is an epic. What point of #3879 are you working on? It sounds that you are fixing a bug. Can you point to a github bug issue with the problem description (repro and exception) or create one if there is none yet.

@razajafri
Copy link
Collaborator Author

razajafri commented Dec 6, 2021

#3879 is an epic. What point of #3879 are you working on? It sounds that you are fixing a bug. Can you point to a github bug issue with the problem description (repro and exception) or create one if there is none yet.

Sorry, I pasted the wrong link. This fixes a bug

jlowe
jlowe previously approved these changes Dec 6, 2021
Copy link
Collaborator

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally 👍. Minor nitpick.

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@razajafri
Copy link
Collaborator Author

build

@sameerz sameerz merged commit d03e51d into NVIDIA:branch-21.12 Dec 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants