Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] limit function is producing inconsistent result when type is Byte, Long, Boolean and Timestamp #1008

Closed
razajafri opened this issue Oct 22, 2020 · 4 comments
Assignees
Labels
bug Something isn't working P1 Nice to have for release

Comments

@razajafri
Copy link
Collaborator

razajafri commented Oct 22, 2020

Describe the bug
Using the limit inside a function that reads a parquet file results in inconsistent results.

Steps/Code to reproduce bug

@pytest.mark.parametrize('data_gen', [ByteGen(), LongGen(), BooleanGen(), TimestampGen()], ids=idfn)
@pytest.mark.parametrize('ts_write', ['INT96', 'TIMESTAMP_MICROS', 'TIMESTAMP_MILLIS'])
@pytest.mark.parametrize('enableVectorized', ['true', 'false'], ids=idfn)
@allow_non_gpu('CollectLimitExec', 'DataWritingCommandExec')
@ignore_order
def test_cache_columnar(spark_tmp_path, data_gen, enableVectorized, ts_write):
    data_path_gpu = spark_tmp_path + '/PARQUET_DATA'
    def read_parquet_cached(data_path):
        def write_read_parquet_cached(spark):
            df = unary_op_df(spark, data_gen).select(f.col("a"))
            df.write.mode('overwrite').parquet(data_path)
            cached = spark.read.parquet(data_path)
            cached.count()
            return debug_df(cached.limit(50))
        return write_read_parquet_cached
    # rapids-spark doesn't support LEGACY read for parquet
    conf={'spark.sql.legacy.parquet.datetimeRebaseModeInWrite': 'CORRECTED',
          'spark.sql.legacy.parquet.datetimeRebaseModeInRead' : 'CORRECTED',
          'spark.sql.inMemoryColumnarStorage.enableVectorizedReader' : enableVectorized,
          'spark.sql.parquet.outputTimestampType': ts_write}
    assert_gpu_and_cpu_are_equal_collect(read_parquet_cached(data_path_gpu), conf)

Expected behavior
The above test should pass against Spark 3.0.0. and Spark 3.1.0

@razajafri razajafri added bug Something isn't working ? - Needs Triage Need team to review and classify labels Oct 22, 2020
@sameerz sameerz added P1 Nice to have for release and removed ? - Needs Triage Need team to review and classify labels Oct 27, 2020
@sameerz sameerz added this to the Oct 26 - Nov 6 milestone Oct 27, 2020
@tgravescs
Copy link
Collaborator

the ones I see failing are:
FAILED src/main/python/parquet_test.py::test_cache_columnar[true-INT96-Timestamp][IGNORE_ORDER, ALLOW_NON_GPU(CollectLimitExec,DataWritingCommandExec)]
FAILED src/main/python/parquet_test.py::test_cache_columnar[false-INT96-Timestamp][IGNORE_ORDER, ALLOW_NON_GPU(CollectLimitExec,DataWritingCommandExec)]

Which are both INT96 types, so is this a dup of #1007 or were you seeing more fail?

@razajafri
Copy link
Collaborator Author

@tgravescs run against Spark 3.1.0. I will update the description. I see a lot more.

FAILED ../../src/main/python/cache_test.py::test_cache_columnar_123[true-INT96-Byte][IGNORE_ORDER, ALLOW_NON_GPU(CollectLimitExec,DataWritingCommandExec)]
FAILED ../../src/main/python/cache_test.py::test_cache_columnar_123[true-INT96-Long][IGNORE_ORDER, ALLOW_NON_GPU(CollectLimitExec,DataWritingCommandExec)]
FAILED ../../src/main/python/cache_test.py::test_cache_columnar_123[true-INT96-Boolean][IGNORE_ORDER, ALLOW_NON_GPU(CollectLimitExec,DataWritingCommandExec)]
FAILED ../../src/main/python/cache_test.py::test_cache_columnar_123[true-INT96-Timestamp][IGNORE_ORDER, ALLOW_NON_GPU(CollectLimitExec,DataWritingCommandExec)]
FAILED ../../src/main/python/cache_test.py::test_cache_columnar_123[true-TIMESTAMP_MICROS-Byte][IGNORE_ORDER, ALLOW_NON_GPU(CollectLimitExec,DataWritingCommandExec)]
FAILED ../../src/main/python/cache_test.py::test_cache_columnar_123[true-TIMESTAMP_MICROS-Long][IGNORE_ORDER, ALLOW_NON_GPU(CollectLimitExec,DataWritingCommandExec)]
FAILED ../../src/main/python/cache_test.py::test_cache_columnar_123[true-TIMESTAMP_MICROS-Boolean][IGNORE_ORDER, ALLOW_NON_GPU(CollectLimitExec,DataWritingCommandExec)]
FAILED ../../src/main/python/cache_test.py::test_cache_columnar_123[true-TIMESTAMP_MILLIS-Byte][IGNORE_ORDER, ALLOW_NON_GPU(CollectLimitExec,DataWritingCommandExec)]
FAILED ../../src/main/python/cache_test.py::test_cache_columnar_123[true-TIMESTAMP_MILLIS-Long][IGNORE_ORDER, ALLOW_NON_GPU(CollectLimitExec,DataWritingCommandExec)]
FAILED ../../src/main/python/cache_test.py::test_cache_columnar_123[true-TIMESTAMP_MILLIS-Boolean][IGNORE_ORDER, ALLOW_NON_GPU(CollectLimitExec,DataWritingCommandExec)]
FAILED ../../src/main/python/cache_test.py::test_cache_columnar_123[false-INT96-Byte][IGNORE_ORDER, ALLOW_NON_GPU(CollectLimitExec,DataWritingCommandExec)]
FAILED ../../src/main/python/cache_test.py::test_cache_columnar_123[false-INT96-Long][IGNORE_ORDER, ALLOW_NON_GPU(CollectLimitExec,DataWritingCommandExec)]
FAILED ../../src/main/python/cache_test.py::test_cache_columnar_123[false-INT96-Boolean][IGNORE_ORDER, ALLOW_NON_GPU(CollectLimitExec,DataWritingCommandExec)]
FAILED ../../src/main/python/cache_test.py::test_cache_columnar_123[false-INT96-Timestamp][IGNORE_ORDER, ALLOW_NON_GPU(CollectLimitExec,DataWritingCommandExec)]
FAILED ../../src/main/python/cache_test.py::test_cache_columnar_123[false-TIMESTAMP_MICROS-Byte][IGNORE_ORDER, ALLOW_NON_GPU(CollectLimitExec,DataWritingCommandExec)]
FAILED ../../src/main/python/cache_test.py::test_cache_columnar_123[false-TIMESTAMP_MICROS-Long][IGNORE_ORDER, ALLOW_NON_GPU(CollectLimitExec,DataWritingCommandExec)]
FAILED ../../src/main/python/cache_test.py::test_cache_columnar_123[false-TIMESTAMP_MICROS-Boolean][IGNORE_ORDER, ALLOW_NON_GPU(CollectLimitExec,DataWritingCommandExec)]
FAILED ../../src/main/python/cache_test.py::test_cache_columnar_123[false-TIMESTAMP_MILLIS-Byte][IGNORE_ORDER, ALLOW_NON_GPU(CollectLimitExec,DataWritingCommandExec)]
FAILED ../../src/main/python/cache_test.py::test_cache_columnar_123[false-TIMESTAMP_MILLIS-Long][IGNORE_ORDER, ALLOW_NON_GPU(CollectLimitExec,DataWritingCommandExec)]
FAILED ../../src/main/python/cache_test.py::test_cache_columnar_123[false-TIMESTAMP_MILLIS-Boolean][IGNORE_ORDER, ALLOW_NON_GPU(CollectLimitExec,DataWritingCommandExec)]

@tgravescs
Copy link
Collaborator

ok looks like it requires more then 1 executor or by default --master local[*]

@tgravescs
Copy link
Collaborator

tgravescs commented Oct 28, 2020

this is because you are writing different data on each iteration (cpu and gpu):

def read_parquet_cached(data_path):
        def write_read_parquet_cached(spark):
            df = unary_op_df(spark, data_gen).select(f.col("a"))
            df.write.mode('overwrite').parquet(data_path)
            cached = spark.read.parquet(data_path)
            cached.count()
            return debug_df(cached.limit(50))
        return write_read_parquet_cached

So the data is going to be different because wrote different data each time.
You need to write a single file in advance and read on both cpu and gpu

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
…IDIA#1008)

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1 Nice to have for release
Projects
None yet
Development

No branches or pull requests

3 participants