-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add abfs and abfss to the cloud scheme #4082
Conversation
Signed-off-by: Bobby Wang <wbo4958@gmail.com>
build |
"generally would be total separate from the executors and likely have a higher I/O read " + | ||
"cost. Many times the cloud filesystems also get better throughput when you have multiple " + | ||
"readers in parallel. This is used with spark.rapids.sql.format.parquet.reader.type") | ||
"filesystems. Schemes already included: dbfs, s3, s3a, s3n, wasbs, gs, abfs, abfss. Cloud " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: It would be nice to not have to keep these manually in-sync, and also would be nice to list them alphabetically. For example, moving the default cloud scheme list from GpuMultiFileReader
to here with something like:
/** List of schemes that are always considered cloud storage schemes */
val DEFAULT_CLOUD_SCHEMES = Seq("abfs", "abfss", "dbfs", "gs", "s3", "s3a", "s3n", "wasbs")
val CLOUD_SCHEMES = conf("spark.rapids.cloudSchemes")
[...]
s"filesystems. Schemes already included: ${DEFAULT_CLOUD_SCHEMES.mkString(", ")}. Cloud " +
Then GpuMultiFileReader can create its cloud schemes HashSet from the list in RapidsConf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx, Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK to me. I'd like to see the nit addressed, but it can be a followup.
did we run any perf tests on this just to make sure this performed better? |
build |
I just did the perf test locally for COALESCING and cloud reading for total 1.3G and 5000 orc files residing in Azure.
|
build |
Signed-off-by: Bobby Wang wbo4958@gmail.com