[REVIEW] Fix ORC reader issue with decimal type #6466

rgsl888prabhu · 2020-10-07T23:45:58Z

Number of values being read was't considering last update while breaking, which caused the subsequent calls behave oddly, leading to improper access of memory segments.

closes #5892

GPUtester · 2020-10-07T23:46:30Z

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

rgsl888prabhu · 2020-10-07T23:54:23Z

@harrism should we push this 0.16 as this is priority-0 and and change is just one line.

kkraus14 · 2020-10-08T00:24:06Z

@rgsl888prabhu is there a small test file we could add to cover the error case here?

rgsl888prabhu · 2020-10-08T00:27:58Z

@rgsl888prabhu is there a small test file we could add to cover the error case here?

It is a 64KB file with which we can reproduce, would it be fine to add it ?

vuule · 2020-10-08T00:54:32Z

It is a 64KB file with which we can reproduce, would it be fine to add it ?

Is it possible to make a smaller repro?
Edit: 64K not that large, but it looks like this issue should be reproducible with a much smaller file.

rgsl888prabhu · 2020-10-08T01:04:50Z

It is a 64KB file with which we can reproduce, would it be fine to add it ?

Is it possible to make a smaller repro?
Edit: 64K not that large, but it looks like this issue should be reproducible with a much smaller file.

I am trying to create smaller file to test it.

codecov · 2020-10-08T02:18:14Z

Codecov Report

Merging #6466 into branch-0.16 will decrease coverage by 1.45%.
The diff coverage is n/a.

@@               Coverage Diff               @@
##           branch-0.16    #6466      +/-   ##
===============================================
- Coverage        84.45%   83.00%   -1.46%     
===============================================
  Files               82       95      +13     
  Lines            13846    14901    +1055     
===============================================
+ Hits             11694    12368     +674     
- Misses            2152     2533     +381

Impacted Files	Coverage Δ
python/dask_cudf/dask_cudf/io/orc.py	`90.90% <0.00%> (-1.28%)`	⬇️
python/dask_cudf/dask_cudf/io/parquet.py	`91.35% <0.00%> (-0.85%)`	⬇️
python/cudf/cudf/utils/cudautils.py	`48.55% <0.00%> (-0.74%)`	⬇️
python/cudf/cudf/_version.py	`44.80% <0.00%> (-0.72%)`	⬇️
python/cudf/cudf/core/frame.py	`89.86% <0.00%> (-0.53%)`	⬇️
python/cudf/cudf/io/parquet.py	`91.73% <0.00%> (-0.50%)`	⬇️
python/dask_cudf/dask_cudf/core.py	`75.64% <0.00%> (-0.26%)`	⬇️
python/cudf/cudf/core/column/categorical.py	`93.11% <0.00%> (-0.26%)`	⬇️
python/cudf/cudf/core/series.py	`91.10% <0.00%> (-0.23%)`	⬇️
python/cudf/cudf/core/multiindex.py	`80.52% <0.00%> (-0.09%)`	⬇️
... and 40 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ffa2fba...6fdbf38. Read the comment docs.

OlivierNV · 2020-10-08T06:21:10Z

Hmm. That doesn't look right: n is the number of valid runs. It breaks out of the loop without incrementing it if the payload for this run would exceed the amount of data available in the buffer. What are the values in the failure case ?
[Edit] I think we might need to limit the max # of scale values in RLE stream1 such that it can't overflow when decoding the secondary base128 stream, eg: 512-limit for max run = NTHREADS/2 at line 1551 (might exceed the max internal buffer size, and I don't think that's a case where overflow can be properly handled because of cross-stream dependency).
Long-term, this kernel should really be rewritten with 512-threads with separate queues for primary & secondary stream as it would simplify things a lot and likely better for perf when index is available (it's a big change though).

rgsl888prabhu · 2020-10-08T12:35:19Z

Hmm. That doesn't look right: n is the number of valid runs. It breaks out of the loop without incrementing it if the payload for this run would exceed the amount of data available in the buffer. What are the values in the failure case ?
[Edit] I think we might need to limit the max # of scale values in RLE stream1 such that it can't overflow when decoding the secondary base128 stream, eg: 512-limit for max run = NTHREADS/2 at line 1551 (might exceed the max internal buffer size, and I don't think that's a case where overflow can be properly handled because of cross-stream dependency).
Long-term, this kernel should really be rewritten with 512-threads with separate queues for primary & secondary stream as it would simplify things a lot and likely better for perf when index is available (it's a big change though).

@OlivierNV

if (t == 0) {
    uint32_t maxpos  = min(bs->len, bs->pos + (BYTESTREAM_BFRSZ - 8u));
    uint32_t lastpos = bs->pos;
    uint32_t n;
    for (n = 0; n < numvals; n++) {
      uint32_t pos                  = lastpos;
      *(volatile int32_t *)&vals[n] = lastpos;
      pos += varint_length<uint4>(bs, pos);
      if (pos > maxpos) break;
      lastpos = pos;
    }
    scratch->num_vals = n;
    bytestream_flush_bytes(bs, lastpos - bs->pos);
  }

This issue occurs when numvals=1024, so for n=1023, pos > max_pos and it would break. But at n=1023, we have already updated the value val[n] with lastpos, and as this is using zero indexing, number of run would always be n+1 if I am not wrong.

rgsl888prabhu · 2020-10-08T13:52:11Z

Hmm. That doesn't look right: n is the number of valid runs. It breaks out of the loop without incrementing it if the payload for this run would exceed the amount of data available in the buffer. What are the values in the failure case ?
[Edit] I think we might need to limit the max # of scale values in RLE stream1 such that it can't overflow when decoding the secondary base128 stream, eg: 512-limit for max run = NTHREADS/2 at line 1551 (might exceed the max internal buffer size, and I don't think that's a case where overflow can be properly handled because of cross-stream dependency).
Long-term, this kernel should really be rewritten with 512-threads with separate queues for primary & secondary stream as it would simplify things a lot and likely better for perf when index is available (it's a big change though).

@OlivierNV And as you suggested, changing the max run = NTHREADS/2 also fixes the issue, but that creates issue in other scenarios as you have already mentioned.

OlivierNV · 2020-10-08T16:50:39Z

This issue occurs when numvals=1024, so for n=1023, pos > max_pos and it would break. But at n=1023, we have already updated the value val[n] with lastpos, and as this is using zero indexing, number of run would always be n+1 if I am not wrong.

It's storing the last offset but not using it (which is likely problem since that same location is initially used to store the scale).
What about moving the line *(volatile int32_t *)&vals[n] = lastpos; after if (pos > maxpos) break; ?

rgsl888prabhu · 2020-10-09T21:54:08Z

github is experiencing issues, and that's why my commit is not yet being shown in this PR.

cpp/src/io/orc/stripe_data.cu

OlivierNV

LGTM

rgsl888prabhu · 2020-10-09T22:48:34Z

Please select Hide whitespace changes option as normal diff shows lot of unrelated things.

kkraus14 · 2020-10-09T23:01:22Z

@rgsl888prabhu did you forget to check in the test files?

vuule · 2020-10-09T23:06:26Z

@rgsl888prabhu did you forget to check in the test files?

I see the Python test file change

kkraus14 · 2020-10-09T23:08:55Z

The Python test file refers to: TestOrcFile.decimal.same.values.orc and TestOrcFile.decimal.multiple.values.orc which aren't checked into CI as far as I could tell

rgsl888prabhu · 2020-10-09T23:09:30Z

@rgsl888prabhu did you forget to check in the test files?

@vuule He meant orc format files for tests.

@kkraus14 I just uploaded them.

kkraus14 · 2020-10-09T23:11:05Z

@vuule do you want to review before we mark this as ready to merge?

vuule · 2020-10-09T23:30:07Z

@rgsl888prabhu Do you think we should check for perf regression before merging this? Looks unlikely to me, but I don't know the context well.

vuule

Good stuff. Thanks for finding smaller repro files.

rgsl888prabhu · 2020-10-09T23:39:19Z

Good stuff. Thanks for finding smaller repro files.

Thanks to @OlivierNV.

rgsl888prabhu · 2020-10-09T23:43:46Z

@rgsl888prabhu Do you think we should check for perf regression before merging this? Looks unlikely to me, but I don't know the context well.

Will try it and let you know.

rgsl888prabhu · 2020-10-10T00:44:25Z

Tried with a ORC dataset with just one column with 440000000, orc file size is 250 MB.

There is not much change in spped.

older

In [2]: %timeit cudf.read_orc("/home/rgsl888/orc_8GB_250MB.orc")
1.7 s ± 6.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

newer

In [2]: %timeit cudf.read_orc("/home/rgsl888/orc_8GB_250MB.orc")
1.64 s ± 260 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Code changes

b0cdac8

rgsl888prabhu requested a review from a team as a code owner October 7, 2020 23:45

rgsl888prabhu requested review from trxcllnt and karthikeyann October 7, 2020 23:45

rgsl888prabhu changed the base branch from branch-0.17 to branch-0.16 October 7, 2020 23:54

rgsl888prabhu self-assigned this Oct 7, 2020

rgsl888prabhu added 3 - Ready for Review Ready for review by team 4 - Needs cuIO Reviewer cuIO cuIO issue labels Oct 7, 2020

rgsl888prabhu and others added 2 commits October 7, 2020 18:56

CHANGELOG.md

9ab8c9c

Merge branch 'branch-0.16' into 5892_fix_issue_orc

664ddf4

rgsl888prabhu requested a review from vuule October 8, 2020 00:08

harrism approved these changes Oct 8, 2020

View reviewed changes

nvdbaranec approved these changes Oct 8, 2020

View reviewed changes

rgsl888prabhu added the 3 - Ready for Review Ready for review by team label Oct 9, 2020

reivew changes

1287936

rgsl888prabhu requested a review from a team as a code owner October 9, 2020 22:19

rgsl888prabhu requested a review from shwina October 9, 2020 22:19

OlivierNV reviewed Oct 9, 2020

View reviewed changes

cpp/src/io/orc/stripe_data.cu Show resolved Hide resolved

review changes

ab12d27

OlivierNV approved these changes Oct 9, 2020

View reviewed changes

rgsl888prabhu added 2 commits October 9, 2020 18:07

doc

44bb8c5

test files

6fdbf38

kkraus14 approved these changes Oct 9, 2020

View reviewed changes

vuule approved these changes Oct 9, 2020

View reviewed changes

karthikeyann approved these changes Oct 10, 2020

View reviewed changes

kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs cuIO Reviewer labels Oct 10, 2020

kkraus14 merged commit 64228dc into rapidsai:branch-0.16 Oct 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Fix ORC reader issue with decimal type #6466

[REVIEW] Fix ORC reader issue with decimal type #6466

rgsl888prabhu commented Oct 7, 2020 •

edited

Loading

GPUtester commented Oct 7, 2020

rgsl888prabhu commented Oct 7, 2020

kkraus14 commented Oct 8, 2020

rgsl888prabhu commented Oct 8, 2020

vuule commented Oct 8, 2020 •

edited

Loading

rgsl888prabhu commented Oct 8, 2020

codecov bot commented Oct 8, 2020 •

edited

Loading

OlivierNV commented Oct 8, 2020 •

edited

Loading

rgsl888prabhu commented Oct 8, 2020 •

edited

Loading

rgsl888prabhu commented Oct 8, 2020

OlivierNV commented Oct 8, 2020

rgsl888prabhu commented Oct 9, 2020

OlivierNV left a comment

rgsl888prabhu commented Oct 9, 2020

kkraus14 commented Oct 9, 2020

vuule commented Oct 9, 2020

kkraus14 commented Oct 9, 2020

rgsl888prabhu commented Oct 9, 2020

kkraus14 commented Oct 9, 2020

vuule commented Oct 9, 2020

vuule left a comment

rgsl888prabhu commented Oct 9, 2020

rgsl888prabhu commented Oct 9, 2020

rgsl888prabhu commented Oct 10, 2020 •

edited

Loading

[REVIEW] Fix ORC reader issue with decimal type #6466

[REVIEW] Fix ORC reader issue with decimal type #6466

Conversation

rgsl888prabhu commented Oct 7, 2020 • edited Loading

GPUtester commented Oct 7, 2020

rgsl888prabhu commented Oct 7, 2020

kkraus14 commented Oct 8, 2020

rgsl888prabhu commented Oct 8, 2020

vuule commented Oct 8, 2020 • edited Loading

rgsl888prabhu commented Oct 8, 2020

codecov bot commented Oct 8, 2020 • edited Loading

Codecov Report

OlivierNV commented Oct 8, 2020 • edited Loading

rgsl888prabhu commented Oct 8, 2020 • edited Loading

rgsl888prabhu commented Oct 8, 2020

OlivierNV commented Oct 8, 2020

rgsl888prabhu commented Oct 9, 2020

OlivierNV left a comment

Choose a reason for hiding this comment

rgsl888prabhu commented Oct 9, 2020

kkraus14 commented Oct 9, 2020

vuule commented Oct 9, 2020

kkraus14 commented Oct 9, 2020

rgsl888prabhu commented Oct 9, 2020

kkraus14 commented Oct 9, 2020

vuule commented Oct 9, 2020

vuule left a comment

Choose a reason for hiding this comment

rgsl888prabhu commented Oct 9, 2020

rgsl888prabhu commented Oct 9, 2020

rgsl888prabhu commented Oct 10, 2020 • edited Loading

rgsl888prabhu commented Oct 7, 2020 •

edited

Loading

vuule commented Oct 8, 2020 •

edited

Loading

codecov bot commented Oct 8, 2020 •

edited

Loading

OlivierNV commented Oct 8, 2020 •

edited

Loading

rgsl888prabhu commented Oct 8, 2020 •

edited

Loading

rgsl888prabhu commented Oct 10, 2020 •

edited

Loading