-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seeing performance differences of multi-threaded/coalesce/perfile Parquet reader type for a single file #1366
Comments
so when I run this on yarn, and just look at the parquet read stats on our yarn cluster, I see the opposite, the multi-threaded reader has a total scan time larger then auto = coalescing. But when I run the entire tpcds query 9 on the yarn cluster, using partitioned data, which by the comments I assume is what was done in raplab, the multi-threaded reader is faster. The issue here is the partitioning. If I pull in the PR to fix coalesce reader with partitioning (#1200) then AUTO = COALESCING is much faster. this was running the query with 10G tpcds data that is partitioned
I ran q88 with similar results as well. Going by the path: certainly looks like partitioned data. I'm guessing this is what is happening in raplab as well but we need to verify. |
one thing we could do for the 0.3 release is default it to the multi-threaded reader if we are worried to many people use partitioned data |
looking at the logs of one application that was worse on raplab we see that its reading many partitioned files:
These end up being split into separate batches and thus is not efficient intransferring:
so I think this is a dup of #1200 |
We ran into a case where a single Parquet file loaded an order of magnitude slower (800ms vs 80ms) comparing 0.3 snapshot vs 0.2 of the plugin.
But when we set:
spark.rapids.sql.format.parquet.reader.type=MULTITHREADED
orspark.rapids.sql.format.parquet.reader.type=COALESCING
we see consistent or better results <= 80ms in 0.3.This seems interesting because it's a single 128MB parquet file, and there's a single batch that comes out of it. So it looks like the reader has some overheads if left by default
AUTO
. Note that the file was being read fromhdfs://
.The file in question was one of the Parquet parts of the
web_sales
table for TPCDS at 1TB, and the query where we saw significant effect was q88. The file has decimals in it, and the test projected out the integers s.t. the gpu parquet scan would be enabled:The text was updated successfully, but these errors were encountered: