Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds type inference and type conversion for leaf-columns to the nested JSON parser #11574

Merged

Conversation

elstehle
Copy link
Contributor

@elstehle elstehle commented Aug 22, 2022

Description

Adds type inference and type conversion for leaf-columns to the nested JSON parser

Note to the reviewers:
It's important to note that we're talking about two different stages of quote-stripping here.

  1. Including/excluding quotes in the tokenizer stage (currently always set to true using a constexpr bool)
  2. Including/excluding quotes in the type conversion stage

Currently, we always include quotes in the tokenizer stage (1), such that the type casting stage (2) can differentiate between string values and literals (e.g. [true, "true"]) and, based on the user-provided choice in json_reader_options::keep_quotes, can strip off the quotes or keep them in the values returned to the user.

In addition to adding type inference and type casting:

  • Switches logic for inferring nested columns. Inferring any column with at least one nested item (list or struct) as that respective nested column, making all other non-nested items of that column invalid. E.g., [null,{"a":1},"foo"] => List<Struct<a:int>> with struct col validity: 0, 1, 0
  • Adds option for keep_quotes to differentiate between string values and numeric & literal values, like (123.4, true, false, null).
  • Migrated libcudf test to cudf test to avoid having large byte BLOBs in source file
  • Changing column order to match the behaviour of pandas and existing JSON lines reader. That is, column order corresponds to the order they were discovered in: [{"b":1, "c":1}, {"a":1}] => order: <b, c, a>
  • Support for escape sequences (see below)

Performance comparison

Tokenizer

The following is a comparison of the JSON tokenizer stage before this PR and after:

Before

# Benchmark Results

## json_tokenizer

### [0] Tesla V100-SXM2-32GB

|    string_size    | Samples | CPU Time  | Noise | GPU Time  | Noise |  Elem/s  |
|-------------------|---------|-----------|-------|-----------|-------|----------|
|    2^20 = 1048576 |   2176x |  2.489 ms | 9.62% |  2.480 ms | 9.61% | 422.729M |
|    2^21 = 2097152 |   1936x |  2.501 ms | 7.14% |  2.492 ms | 7.12% | 841.482M |
|    2^22 = 4194304 |   1152x |  2.612 ms | 5.43% |  2.604 ms | 5.42% |   1.611G |
|    2^23 = 8388608 |   1456x |  2.855 ms | 4.26% |  2.847 ms | 4.23% |   2.947G |
|   2^24 = 16777216 |   1104x |  3.395 ms | 5.34% |  3.387 ms | 5.33% |   4.954G |
|   2^25 = 33554432 |    560x |  4.410 ms | 2.25% |  4.402 ms | 2.25% |   7.623G |
|   2^26 = 67108864 |   1552x |  6.482 ms | 2.23% |  6.473 ms | 2.22% |  10.367G |
|  2^27 = 134217728 |   1435x | 10.430 ms | 2.70% | 10.422 ms | 2.70% |  12.879G |
|  2^28 = 268435456 |    815x | 18.396 ms | 1.95% | 18.387 ms | 1.95% |  14.599G |
|  2^29 = 536870912 |     15x | 34.389 ms | 0.42% | 34.381 ms | 0.42% |  15.615G |
| 2^30 = 1073741824 |     11x | 66.097 ms | 0.20% | 66.088 ms | 0.20% |  16.247G |

After

# Benchmark Results

## json_tokenizer

### [0] Tesla V100-SXM2-32GB

|    string_size    | Samples |  CPU Time  | Noise  |  GPU Time  | Noise  |  Elem/s  |
|-------------------|---------|------------|--------|------------|--------|----------|
|    2^20 = 1048576 |   1408x |   2.600 ms | 11.28% |   2.592 ms | 11.26% | 404.547M |
|    2^21 = 2097152 |    800x |   2.838 ms |  7.68% |   2.829 ms |  7.67% | 741.243M |
|    2^22 = 4194304 |   2752x |   3.719 ms |  9.24% |   3.710 ms |  9.23% |   1.130G |
|    2^23 = 8388608 |    128x |   4.855 ms |  3.38% |   4.846 ms |  3.37% |   1.731G |
|   2^24 = 16777216 |    720x |   7.029 ms |  4.67% |   7.021 ms |  4.66% |   2.390G |
|   2^25 = 33554432 |    832x |  10.760 ms |  3.83% |  10.751 ms |  3.83% |   3.121G |
|   2^26 = 67108864 |    576x |  17.961 ms |  2.86% |  17.953 ms |  2.86% |   3.738G |
|  2^27 = 134217728 |    461x |  32.550 ms |  2.13% |  32.542 ms |  2.13% |   4.124G |
|  2^28 = 268435456 |    243x |  61.813 ms |  1.60% |  61.805 ms |  1.60% |   4.343G |
|  2^29 = 536870912 |    125x | 120.445 ms |  1.21% | 120.437 ms |  1.21% |   4.458G |
| 2^30 = 1073741824 |     66x | 228.833 ms |  0.75% | 228.825 ms |  0.75% |   4.692G |

JSON Parser

The overall parser performance is obviously impacted as we're now also doing type conversion instead of just returning string columns.

Before

# Benchmark Results

## nested_json_gpu_parser

### [0] Tesla V100-SXM2-32GB

|    string_size    | Samples |  CPU Time  | Noise |  GPU Time  | Noise |  Elem/s  |
|-------------------|---------|------------|-------|------------|-------|----------|
|    2^20 = 1048576 |   1040x |   7.361 ms | 5.61% |   7.353 ms | 5.61% | 142.614M |
|    2^21 = 2097152 |    832x |  11.549 ms | 3.63% |  11.541 ms | 3.63% | 181.708M |
|    2^22 = 4194304 |    740x |  20.264 ms | 2.98% |  20.257 ms | 2.98% | 207.054M |
|    2^23 = 8388608 |    407x |  36.844 ms | 2.26% |  36.837 ms | 2.26% | 227.724M |
|   2^24 = 16777216 |     80x |  75.590 ms | 1.95% |  75.582 ms | 1.95% | 221.974M |
|   2^25 = 33554432 |     80x | 179.442 ms | 4.40% | 179.434 ms | 4.40% | 187.001M |
|   2^26 = 67108864 |     40x | 379.821 ms | 0.98% | 379.815 ms | 0.98% | 176.688M |
|  2^27 = 134217728 |     20x | 777.351 ms | 1.72% | 777.347 ms | 1.72% | 172.661M |
|  2^28 = 268435456 |     10x |    1.550 s | 0.99% |    1.550 s | 0.99% | 173.212M |
|  2^29 = 536870912 |      5x |    3.055 s | 0.41% |    3.055 s | 0.41% | 175.749M |
| 2^30 = 1073741824 |      3x |    6.315 s |  inf% |    6.315 s |  inf% | 170.018M |

After

|    string_size    | Samples |  CPU Time  | Noise |  GPU Time  | Noise |  Elem/s  |
|-------------------|---------|------------|-------|------------|-------|----------|
|    2^20 = 1048576 |   1568x |   7.908 ms | 5.24% |   7.900 ms | 5.24% | 132.730M |
|    2^21 = 2097152 |    576x |  12.235 ms | 3.24% |  12.228 ms | 3.24% | 171.509M |
|    2^22 = 4194304 |    192x |  21.171 ms | 2.09% |  21.164 ms | 2.09% | 198.182M |
|    2^23 = 8388608 |     96x |  38.990 ms | 1.96% |  38.983 ms | 1.96% | 215.188M |
|   2^24 = 16777216 |    192x |  78.414 ms | 2.21% |  78.407 ms | 2.21% | 213.977M |
|   2^25 = 33554432 |     81x | 187.007 ms | 6.47% | 187.000 ms | 6.47% | 179.435M |
|   2^26 = 67108864 |     38x | 400.007 ms | 1.59% | 400.000 ms | 1.59% | 167.772M |
|  2^27 = 134217728 |     19x | 801.575 ms | 1.29% | 801.571 ms | 1.29% | 167.443M |
|  2^28 = 268435456 |     10x |    1.590 s | 0.42% |    1.590 s | 0.42% | 168.799M |
|  2^29 = 536870912 |      5x |    3.150 s | 0.40% |    3.150 s | 0.40% | 170.456M |
| 2^30 = 1073741824 |      3x |    6.402 s |  inf% |    6.402 s |  inf% | 167.712M |

Supported escape sequences:

\" represents the quotation mark character (U+0022).
\\ represents the reverse solidus character (U+005C).
\/ represents the solidus character (U+002F).
\b represents the backspace character (U+0008).
\f represents the form feed character (U+000C).
\n represents the line feed character (U+000A).
\r represents the carriage return character (U+000D).
\t represents the character tabulation character (U+0009).
\uDDDD, where `D` is a hex digit 0-9, a-f, A-F, for code points on the MBP
\uDDDD\uDDDD, where `D` is a hex digit 0-9, a-f, A-F, representing UTF-16 surrogate pairs for remaining unicode code points

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot removed the CMake CMake build issue label Sep 15, 2022
Comment on lines -132 to +138
/* TT_OOS */ {{{'{'}, {'['}, {'}'}, {']'}, {'x'}, {'x'}, {'x'}}},
/* TT_STR */ {{{'x'}, {'x'}, {'x'}, {'x'}, {'x'}, {'x'}, {'x'}}},
/* TT_ESC */ {{{'x'}, {'x'}, {'x'}, {'x'}, {'x'}, {'x'}, {'x'}}}}};
/* TT_OOS */ {{{'{'}, {'['}, {'}'}, {']'}, {}, {}, {}}},
/* TT_STR */ {{{}, {}, {}, {}, {}, {}, {}}},
/* TT_ESC */ {{{}, {}, {}, {}, {}, {}, {}}}}};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is x not needed anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stack supports a sparse representation. Basically: (stack_op, op_index) pairs, where op_index describes at which index such stack operation happens. All omitted indexes are populated with what is on top of the stack.

This is described in more detail in the logical stack PR description:
#11078 (comment)

Comment on lines +1479 to +1491
// Prepare iterator that returns (string_offset, string_length)-pairs needed by inference
auto string_ranges_it =
thrust::make_transform_iterator(offset_length_it, [] __device__(auto ip) {
return thrust::pair<json_column::row_offset_t, std::size_t>{
thrust::get<0>(ip), static_cast<std::size_t>(thrust::get<1>(ip))};
});

// Prepare iterator that returns (string_ptr, string_length)-pairs needed by type conversion
auto string_spans_it = thrust::make_transform_iterator(
offset_length_it, [data = d_input.data()] __device__(auto ip) {
return thrust::pair<const char*, std::size_t>{
data + thrust::get<0>(ip), static_cast<std::size_t>(thrust::get<1>(ip))};
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably in another PR, unless it is trivial.

cpp/tests/io/nested_json_test.cpp Outdated Show resolved Hide resolved
cpp/tests/io/nested_json_test.cpp Outdated Show resolved Hide resolved
cpp/tests/io/nested_json_test.cpp Show resolved Hide resolved
Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

cpp/src/io/json/nested_json_gpu.cu Outdated Show resolved Hide resolved
karthikeyann added a commit to karthikeyann/cudf that referenced this pull request Sep 18, 2022
@elstehle elstehle requested a review from vuule September 19, 2022 07:52
@karthikeyann karthikeyann mentioned this pull request Sep 19, 2022
5 tasks
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few minor corrections. Approving to expedite the merge, with the assumption that these will be addressed.

cpp/src/io/json/nested_json_gpu.cu Outdated Show resolved Hide resolved
cpp/src/io/json/nested_json_gpu.cu Outdated Show resolved Hide resolved
cpp/src/io/json/nested_json_gpu.cu Show resolved Hide resolved
cpp/src/io/json/nested_json_gpu.cu Outdated Show resolved Hide resolved
cpp/src/io/json/nested_json_gpu.cu Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_json.py Outdated Show resolved Hide resolved
@elstehle
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 0ba4675 into rapidsai:branch-22.10 Sep 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants