Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Fix ORC reader issue with decimal type #6466

Merged
merged 9 commits into from
Oct 10, 2020

Conversation

rgsl888prabhu
Copy link
Contributor

@rgsl888prabhu rgsl888prabhu commented Oct 7, 2020

Number of values being read was't considering last update while breaking, which caused the subsequent calls behave oddly, leading to improper access of memory segments.

closes #5892

@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@rgsl888prabhu
Copy link
Contributor Author

@harrism should we push this 0.16 as this is priority-0 and and change is just one line.

@rgsl888prabhu rgsl888prabhu changed the base branch from branch-0.17 to branch-0.16 October 7, 2020 23:54
@rgsl888prabhu rgsl888prabhu self-assigned this Oct 7, 2020
@kkraus14
Copy link
Collaborator

kkraus14 commented Oct 8, 2020

@rgsl888prabhu is there a small test file we could add to cover the error case here?

@rgsl888prabhu
Copy link
Contributor Author

@rgsl888prabhu is there a small test file we could add to cover the error case here?

It is a 64KB file with which we can reproduce, would it be fine to add it ?

@vuule
Copy link
Contributor

vuule commented Oct 8, 2020

It is a 64KB file with which we can reproduce, would it be fine to add it ?

Is it possible to make a smaller repro?
Edit: 64K not that large, but it looks like this issue should be reproducible with a much smaller file.

@rgsl888prabhu
Copy link
Contributor Author

It is a 64KB file with which we can reproduce, would it be fine to add it ?

Is it possible to make a smaller repro?
Edit: 64K not that large, but it looks like this issue should be reproducible with a much smaller file.

I am trying to create smaller file to test it.

@codecov
Copy link

codecov bot commented Oct 8, 2020

Codecov Report

Merging #6466 into branch-0.16 will decrease coverage by 1.45%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.16    #6466      +/-   ##
===============================================
- Coverage        84.45%   83.00%   -1.46%     
===============================================
  Files               82       95      +13     
  Lines            13846    14901    +1055     
===============================================
+ Hits             11694    12368     +674     
- Misses            2152     2533     +381     
Impacted Files Coverage Δ
python/dask_cudf/dask_cudf/io/orc.py 90.90% <0.00%> (-1.28%) ⬇️
python/dask_cudf/dask_cudf/io/parquet.py 91.35% <0.00%> (-0.85%) ⬇️
python/cudf/cudf/utils/cudautils.py 48.55% <0.00%> (-0.74%) ⬇️
python/cudf/cudf/_version.py 44.80% <0.00%> (-0.72%) ⬇️
python/cudf/cudf/core/frame.py 89.86% <0.00%> (-0.53%) ⬇️
python/cudf/cudf/io/parquet.py 91.73% <0.00%> (-0.50%) ⬇️
python/dask_cudf/dask_cudf/core.py 75.64% <0.00%> (-0.26%) ⬇️
python/cudf/cudf/core/column/categorical.py 93.11% <0.00%> (-0.26%) ⬇️
python/cudf/cudf/core/series.py 91.10% <0.00%> (-0.23%) ⬇️
python/cudf/cudf/core/multiindex.py 80.52% <0.00%> (-0.09%) ⬇️
... and 40 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ffa2fba...6fdbf38. Read the comment docs.

@OlivierNV
Copy link
Contributor

OlivierNV commented Oct 8, 2020

Hmm. That doesn't look right: n is the number of valid runs. It breaks out of the loop without incrementing it if the payload for this run would exceed the amount of data available in the buffer. What are the values in the failure case ?
[Edit] I think we might need to limit the max # of scale values in RLE stream1 such that it can't overflow when decoding the secondary base128 stream, eg: 512-limit for max run = NTHREADS/2 at line 1551 (might exceed the max internal buffer size, and I don't think that's a case where overflow can be properly handled because of cross-stream dependency).
Long-term, this kernel should really be rewritten with 512-threads with separate queues for primary & secondary stream as it would simplify things a lot and likely better for perf when index is available (it's a big change though).

@rgsl888prabhu
Copy link
Contributor Author

rgsl888prabhu commented Oct 8, 2020

Hmm. That doesn't look right: n is the number of valid runs. It breaks out of the loop without incrementing it if the payload for this run would exceed the amount of data available in the buffer. What are the values in the failure case ?
[Edit] I think we might need to limit the max # of scale values in RLE stream1 such that it can't overflow when decoding the secondary base128 stream, eg: 512-limit for max run = NTHREADS/2 at line 1551 (might exceed the max internal buffer size, and I don't think that's a case where overflow can be properly handled because of cross-stream dependency).
Long-term, this kernel should really be rewritten with 512-threads with separate queues for primary & secondary stream as it would simplify things a lot and likely better for perf when index is available (it's a big change though).

@OlivierNV

if (t == 0) {
    uint32_t maxpos  = min(bs->len, bs->pos + (BYTESTREAM_BFRSZ - 8u));
    uint32_t lastpos = bs->pos;
    uint32_t n;
    for (n = 0; n < numvals; n++) {
      uint32_t pos                  = lastpos;
      *(volatile int32_t *)&vals[n] = lastpos;
      pos += varint_length<uint4>(bs, pos);
      if (pos > maxpos) break;
      lastpos = pos;
    }
    scratch->num_vals = n;
    bytestream_flush_bytes(bs, lastpos - bs->pos);
  }

This issue occurs when numvals=1024, so for n=1023, pos > max_pos and it would break. But at n=1023, we have already updated the value val[n] with lastpos, and as this is using zero indexing, number of run would always be n+1 if I am not wrong.

@rgsl888prabhu
Copy link
Contributor Author

Hmm. That doesn't look right: n is the number of valid runs. It breaks out of the loop without incrementing it if the payload for this run would exceed the amount of data available in the buffer. What are the values in the failure case ?
[Edit] I think we might need to limit the max # of scale values in RLE stream1 such that it can't overflow when decoding the secondary base128 stream, eg: 512-limit for max run = NTHREADS/2 at line 1551 (might exceed the max internal buffer size, and I don't think that's a case where overflow can be properly handled because of cross-stream dependency).
Long-term, this kernel should really be rewritten with 512-threads with separate queues for primary & secondary stream as it would simplify things a lot and likely better for perf when index is available (it's a big change though).

@OlivierNV And as you suggested, changing the max run = NTHREADS/2 also fixes the issue, but that creates issue in other scenarios as you have already mentioned.

@OlivierNV
Copy link
Contributor

This issue occurs when numvals=1024, so for n=1023, pos > max_pos and it would break. But at n=1023, we have already updated the value val[n] with lastpos, and as this is using zero indexing, number of run would always be n+1 if I am not wrong.

It's storing the last offset but not using it (which is likely problem since that same location is initially used to store the scale).
What about moving the line *(volatile int32_t *)&vals[n] = lastpos; after if (pos > maxpos) break; ?

@rgsl888prabhu rgsl888prabhu added the 3 - Ready for Review Ready for review by team label Oct 9, 2020
@rgsl888prabhu
Copy link
Contributor Author

github is experiencing issues, and that's why my commit is not yet being shown in this PR.

@rgsl888prabhu rgsl888prabhu requested a review from a team as a code owner October 9, 2020 22:19
Copy link
Contributor

@OlivierNV OlivierNV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rgsl888prabhu
Copy link
Contributor Author

Please select Hide whitespace changes option as normal diff shows lot of unrelated things.

@kkraus14
Copy link
Collaborator

kkraus14 commented Oct 9, 2020

@rgsl888prabhu did you forget to check in the test files?

@vuule
Copy link
Contributor

vuule commented Oct 9, 2020

@rgsl888prabhu did you forget to check in the test files?

I see the Python test file change

@kkraus14
Copy link
Collaborator

kkraus14 commented Oct 9, 2020

The Python test file refers to: TestOrcFile.decimal.same.values.orc and TestOrcFile.decimal.multiple.values.orc which aren't checked into CI as far as I could tell

@rgsl888prabhu
Copy link
Contributor Author

@rgsl888prabhu did you forget to check in the test files?

@vuule He meant orc format files for tests.

@kkraus14 I just uploaded them.

@kkraus14
Copy link
Collaborator

kkraus14 commented Oct 9, 2020

@vuule do you want to review before we mark this as ready to merge?

@vuule
Copy link
Contributor

vuule commented Oct 9, 2020

@rgsl888prabhu Do you think we should check for perf regression before merging this? Looks unlikely to me, but I don't know the context well.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff. Thanks for finding smaller repro files.

@rgsl888prabhu
Copy link
Contributor Author

Good stuff. Thanks for finding smaller repro files.

Thanks to @OlivierNV.

@rgsl888prabhu
Copy link
Contributor Author

@rgsl888prabhu Do you think we should check for perf regression before merging this? Looks unlikely to me, but I don't know the context well.

Will try it and let you know.

@rgsl888prabhu
Copy link
Contributor Author

rgsl888prabhu commented Oct 10, 2020

Tried with a ORC dataset with just one column with 440000000, orc file size is 250 MB.

There is not much change in spped.

older

In [2]: %timeit cudf.read_orc("/home/rgsl888/orc_8GB_250MB.orc")
1.7 s ± 6.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

newer

In [2]: %timeit cudf.read_orc("/home/rgsl888/orc_8GB_250MB.orc")
1.64 s ± 260 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

@kkraus14 kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs cuIO Reviewer labels Oct 10, 2020
@kkraus14 kkraus14 merged commit 64228dc into rapidsai:branch-0.16 Oct 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cuIO cuIO issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Incorrect values with read_orc
8 participants