Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document agg pushdown on ORC file limitation [skip ci] #4957

Merged
merged 2 commits into from
Mar 16, 2022

Conversation

amahussein
Copy link
Collaborator

closes #4950

Signed-off-by: Ahmed Hussein (amahussein) a@ahussein.me

  • document the expected failure caused by Spark-3.3.0 assumption that all ORC files must have file statistics.

  • document the usage of new spark-3.3.0 feature introduced in SPARK-34960

    • the limitations with RAPIDS while enabling this feature
    • how to avoid the SparkException

@amahussein amahussein self-assigned this Mar 15, 2022
Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>

**Limitations With RAPIDS**

[CUDF](https://github.com/rapidsai/cudf) still does not support writing the whole file statistcis into ORC file. The status of this feature request
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we say that the RAPIDS Accelerator does not support. Because technically CUDF does, just not for chunked writes, which is what we are doing. Or perhaps just say that we are working with CUDF to support this feature and you can track it here...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the changes.

BTW, There was another CUDF issue 10075 which reports that statistics are missing without being specific to the mode. I commented on the issues to verify whether they are dup or that Cudf does not support the file statistics at all.

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>
@amahussein amahussein changed the title Document agg pushdown on ORC file limitation Document agg pushdown on ORC file limitation [skip ci] Mar 15, 2022
@amahussein amahussein added the documentation Improvements or additions to documentation label Mar 15, 2022
@amahussein
Copy link
Collaborator Author

build

6 similar comments
@amahussein
Copy link
Collaborator Author

build

@amahussein
Copy link
Collaborator Author

build

@amahussein
Copy link
Collaborator Author

build

@sameerz
Copy link
Collaborator

sameerz commented Mar 15, 2022

build

@razajafri
Copy link
Collaborator

build

@amahussein
Copy link
Collaborator Author

build

@razajafri
Copy link
Collaborator

The PR has [skip ci] in the title. Why do you expect it to run ci?

@amahussein amahussein merged commit f1c4024 into NVIDIA:branch-22.04 Mar 16, 2022
@amahussein
Copy link
Collaborator Author

amahussein commented Mar 16, 2022

The PR has [skip ci] in the title. Why do you expect it to run ci?

I could not build previous PRs couple of weeks ago.
I wanted to get the authentication fixed. Skipping CI is fine so that I won't consume resources just for testing my authentication.

@sameerz sameerz added this to the Feb 28 - Mar 18 milestone Mar 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[DOC] Compatibility with Spark-330 AggregatePushDown on ORC files
4 participants