Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add abfs and abfss to the cloud scheme #4082

Merged
merged 3 commits into from
Nov 12, 2021

Conversation

wbo4958
Copy link
Collaborator

@wbo4958 wbo4958 commented Nov 11, 2021

Signed-off-by: Bobby Wang wbo4958@gmail.com

Signed-off-by: Bobby Wang <wbo4958@gmail.com>
@wbo4958
Copy link
Collaborator Author

wbo4958 commented Nov 11, 2021

build

"generally would be total separate from the executors and likely have a higher I/O read " +
"cost. Many times the cloud filesystems also get better throughput when you have multiple " +
"readers in parallel. This is used with spark.rapids.sql.format.parquet.reader.type")
"filesystems. Schemes already included: dbfs, s3, s3a, s3n, wasbs, gs, abfs, abfss. Cloud " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It would be nice to not have to keep these manually in-sync, and also would be nice to list them alphabetically. For example, moving the default cloud scheme list from GpuMultiFileReader to here with something like:

/** List of schemes that are always considered cloud storage schemes */
val DEFAULT_CLOUD_SCHEMES = Seq("abfs", "abfss", "dbfs", "gs", "s3", "s3a", "s3n", "wasbs")

val CLOUD_SCHEMES = conf("spark.rapids.cloudSchemes")
[...]
    s"filesystems. Schemes already included: ${DEFAULT_CLOUD_SCHEMES.mkString(", ")}. Cloud " +

Then GpuMultiFileReader can create its cloud schemes HashSet from the list in RapidsConf.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, Done

jlowe
jlowe previously approved these changes Nov 11, 2021
Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK to me. I'd like to see the nit addressed, but it can be a followup.

@tgravescs
Copy link
Collaborator

did we run any perf tests on this just to make sure this performed better?

@wbo4958
Copy link
Collaborator Author

wbo4958 commented Nov 12, 2021

build

@wbo4958
Copy link
Collaborator Author

wbo4958 commented Nov 12, 2021

did we run any perf tests on this just to make sure this performed better?

@tgravescs

I just did the perf test locally for COALESCING and cloud reading for total 1.3G and 5000 orc files residing in Azure.

total time
Cloud reading 506.653s
Coalescing reading 696.223s

@jlowe
Copy link
Member

jlowe commented Nov 12, 2021

build

@wbo4958 wbo4958 merged commit 602e754 into NVIDIA:branch-21.12 Nov 12, 2021
@sameerz sameerz added the task Work required that improves the product but is not user facing label Nov 16, 2021
@wbo4958 wbo4958 deleted the abfs-scheme branch February 17, 2022 00:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants