-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] java.lang.NullPointerException when exporting delta table #1197
Comments
thanks @martinstuder for reporting we haven't seen this before so I'll investigate. |
@martinstuder I'm trying to reproduce this, but haven't been able to yet.
Was this on the first write to this path? The table didn't exist previously? can you tell me the schema for the partitionBy column - is it just a timestamp? Based on your comment that no jobs failed, this exception happened on the driver side then. can you reproduce this or was it just a one time thing? |
Looking at the last few lines here it seems its just trying to collect it back to the driver but the deserializer doesn't like the what it got back. the DatasetSparkResult is a databricks specific class which we don't have source for. Would you be able to perhaps tell me the last operation before the write? I'm wondering if that operation produced results that doesn't match what databricks expected. |
@tgravescs Reconstructing the code path it basically comes down to the following:
It's basically unioning two source data frames from parquet, adding an extra discriminator column to both and then writing to delta with a different partitioning scheme. The issue happened on the first write to We've recently changed things to have the source data frames in delta as well where we read them without specifying a schema explicitly. I'll check whether this has any impact. |
Just a guess: is it trying to read the metadata (*.json) back in to generate some log/diagnostic message but failing with the generated metadata? |
thanks for the update, I'll give it another try to reproduce using delta file input and see what is going on. I suspect you are right about the metadata or its recording something like input version or something since it knows its delta format |
Disabling parquet input/output acceleration also doesn't help ( |
yeah its likely your write wasn't happening on the GPU anyway at least my testing with writing to delta uses SaveIntoDataSourceCommand com.databricks.sql.transaction.tahoe.sources.DeltaDataSource which GPU doesn't know how to do delta format. I'm still trying to reproduce. Are any of the column types complex types - like array? Or is it all Long, Int, String types. We are adding some support for complex types in 0.3 and since you said this worked with 0.2 I'm wondering if something there is not complete yet or causing incompatibilityes with databricks. |
Sorry missed it earlier, I think its actually in your log and it looks like you are using structs: 20/12/02 19:35:39 INFO FileSourceStrategy: Output Data Schema: struct<metaData: struct<id: string, name: string, description: string, format: struct<provider: string, options: map<string,string>>, schemaString: string ... 6 more fields>, protocol: struct<minReaderVersion: int, minWriterVersion: int>> |
The actual data frames are all just int, long, double and byte. The schema you are referring to seems to be the schema that delta uses for its internal metadata. |
@martinstuder unfortunately I still can't reproduce, would you be able to try to get a small reproducible case? I"m not sure if its the data or perhaps something in delta configuration or setup. I tried both our latest 0,3,0 and went back to the commit you had. I've tried a bunch of different things. Perhaps you could try this sample I found on databricks to see if it reproduces for you? I tried with both saveAsTable and just save/load so feel free to change those below and update the paths as needed.
Also just to verify, if you use the 0.2.0 build with exact same setup you don't see the error? |
also if its possible could you send me the explain output for the failure? You can go to the SQL tab in UI and at the bottom is the text representation |
@martinstuder Sorry you can ignore the above questions I was finally able to reproduce it, will investigate more tomorrow. |
I'm working on a workaround for this where we have the delta log queries use the CPU. |
@martinstuder I checked in a fix for this would you be able to try it out? You would have to build with the latest branch. |
Hi @tgravescs, I can confirm that I can successfully export delta tables using a current build of the 0.3 branch. Thanks a lot for your efforts! |
…IDIA#1197) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Describe the bug
When saving a data frame using the delta format, I get the the following exception:
Steps/Code to reproduce bug
I haven't been able to reduce the issue to a small reproducible example yet. The Python stacktrace highlights the following location:
Interestingly, the Spark UI does not show any failed jobs or stages and the actual data write seems to have happened. Judging from the above stacktrace the delta log/metadata may be corrupt though since the issue seems to happen when writing the delta log.
Expected behavior
Data frame export to delta succeeds without exception.
Environment details
Additional context
There seems to be no issue with rapids-4-spark 0.2.0.
The text was updated successfully, but these errors were encountered: