Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] test_sample_produce_empty_batch failed in dataproc #3864

Closed
pxLi opened this issue Oct 20, 2021 · 2 comments · Fixed by #3899
Closed

[BUG] test_sample_produce_empty_batch failed in dataproc #3864

pxLi opened this issue Oct 20, 2021 · 2 comments · Fixed by #3899
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@pxLi
Copy link
Collaborator

pxLi commented Oct 20, 2021

Describe the bug
new test was introduced in #3789

This failure showed only in dataproc, and was not seen from other ENVs yet

cpu = [Row(a=None), Row(a='\x82âæb÷(´û'), Row(a='Ôq=`ȧ\x05:'), Row(a='ãç\x05\x82Â\x08L\xad')]
gpu = [Row(a='\x82âæb÷(´û'), Row(a='Ôq=`ȧ\x05:'), Row(a='ãç\x05\x82Â\x08L\xad')]

assert error,

[2021-10-19T11:21:18.998Z] =================================== FAILURES ===================================
[2021-10-19T11:21:18.998Z] ___________________ test_sample_produce_empty_batch[String] ____________________
[2021-10-19T11:21:18.998Z] 
[2021-10-19T11:21:18.998Z] data_gen = String
[2021-10-19T11:21:18.998Z] 
[2021-10-19T11:21:18.998Z]     @ignore_order
[2021-10-19T11:21:18.998Z]     @pytest.mark.parametrize('data_gen', [string_gen], ids=idfn)
[2021-10-19T11:21:18.998Z]     def test_sample_produce_empty_batch(data_gen):
[2021-10-19T11:21:18.998Z] >       assert_gpu_and_cpu_are_equal_collect(
[2021-10-19T11:21:18.998Z]             # length = 4 will generate empty batch after sample
[2021-10-19T11:21:18.998Z]             lambda spark: unary_op_df(spark, data_gen, length= 4).sample(fraction = 0.9, seed = 1)
[2021-10-19T11:21:18.998Z]         )
[2021-10-19T11:21:18.998Z] 
[2021-10-19T11:21:18.998Z] integration_tests/src/main/python/sample_test.py:29: 
[2021-10-19T11:21:18.998Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2021-10-19T11:21:18.998Z] integration_tests/src/main/python/asserts.py:501: in assert_gpu_and_cpu_are_equal_collect
[2021-10-19T11:21:18.998Z]     _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first)
[2021-10-19T11:21:18.998Z] integration_tests/src/main/python/asserts.py:432: in _assert_gpu_and_cpu_are_equal
[2021-10-19T11:21:18.998Z]     assert_equal(from_cpu, from_gpu)
[2021-10-19T11:21:18.998Z] integration_tests/src/main/python/asserts.py:101: in assert_equal
[2021-10-19T11:21:18.998Z]     _assert_equal(cpu, gpu, float_check=get_float_check(), path=[])
[2021-10-19T11:21:18.998Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2021-10-19T11:21:18.998Z] 
[2021-10-19T11:21:18.998Z] cpu = [Row(a=None), Row(a='\x82âæb÷(´û'), Row(a='Ôq=`ȧ\x05:'), Row(a='ãç\x05\x82Â\x08L\xad')]
[2021-10-19T11:21:18.998Z] gpu = [Row(a='\x82âæb÷(´û'), Row(a='Ôq=`ȧ\x05:'), Row(a='ãç\x05\x82Â\x08L\xad')]
[2021-10-19T11:21:18.998Z] float_check = <function get_float_check.<locals>.<lambda> at 0x7f12219a4040>
[2021-10-19T11:21:18.998Z] path = []
[2021-10-19T11:21:18.998Z] 
[2021-10-19T11:21:18.998Z]     def _assert_equal(cpu, gpu, float_check, path):
[2021-10-19T11:21:18.998Z]         t = type(cpu)
[2021-10-19T11:21:18.998Z]         if (t is Row):
[2021-10-19T11:21:18.998Z]             assert len(cpu) == len(gpu), "CPU and GPU row have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
[2021-10-19T11:21:18.998Z]             if hasattr(cpu, "__fields__") and hasattr(gpu, "__fields__"):
[2021-10-19T11:21:18.998Z]                 assert cpu.__fields__ == gpu.__fields__, "CPU and GPU row have different fields at {} CPU: {} GPU: {}".format(path, cpu.__fields__, gpu.__fields__)
[2021-10-19T11:21:18.998Z]                 for field in cpu.__fields__:
[2021-10-19T11:21:18.998Z]                     _assert_equal(cpu[field], gpu[field], float_check, path + [field])
[2021-10-19T11:21:18.998Z]             else:
[2021-10-19T11:21:18.998Z]                 for index in range(len(cpu)):
[2021-10-19T11:21:18.998Z]                     _assert_equal(cpu[index], gpu[index], float_check, path + [index])
[2021-10-19T11:21:18.998Z]         elif (t is list):
[2021-10-19T11:21:18.998Z] >           assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
[2021-10-19T11:21:18.998Z] E           AssertionError: CPU and GPU list have different lengths at [] CPU: 4 GPU: 3
@pxLi pxLi added bug Something isn't working ? - Needs Triage Need team to review and classify labels Oct 20, 2021
@pxLi pxLi added test Only impacts tests and removed test Only impacts tests labels Oct 20, 2021
@tgravescs tgravescs added the P0 Must have for release label Oct 21, 2021
@tgravescs
Copy link
Collaborator

@res-life if this is going to take a while (days) to fix please xfail the tests temporarily

@abellina
Copy link
Collaborator

Note, I see other issues also in our UCX job, perhaps because it isn't a local mode job: #3892.

res-life pushed a commit that referenced this issue Oct 22, 2021
Set "num_slices" when create "DataFrame" to pass the sample test cases.
This fixes #3864
This fixes #3892

Signed-off-by: Chong Gao <res_life@163.com>
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Oct 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants