Adds type inference and type conversion for leaf-columns to the nested JSON parser #11574

elstehle · 2022-08-22T11:01:05Z

Description

Adds type inference and type conversion for leaf-columns to the nested JSON parser

Note to the reviewers:
It's important to note that we're talking about two different stages of quote-stripping here.

Including/excluding quotes in the tokenizer stage (currently always set to true using a constexpr bool)
Including/excluding quotes in the type conversion stage

Currently, we always include quotes in the tokenizer stage (1), such that the type casting stage (2) can differentiate between string values and literals (e.g. [true, "true"]) and, based on the user-provided choice in json_reader_options::keep_quotes, can strip off the quotes or keep them in the values returned to the user.

In addition to adding type inference and type casting:

Switches logic for inferring nested columns. Inferring any column with at least one nested item (list or struct) as that respective nested column, making all other non-nested items of that column invalid. E.g., [null,{"a":1},"foo"] => List<Struct<a:int>> with struct col validity: 0, 1, 0
Adds option for keep_quotes to differentiate between string values and numeric & literal values, like (123.4, true, false, null).
Migrated libcudf test to cudf test to avoid having large byte BLOBs in source file
Changing column order to match the behaviour of pandas and existing JSON lines reader. That is, column order corresponds to the order they were discovered in: [{"b":1, "c":1}, {"a":1}] => order: <b, c, a>
Support for escape sequences (see below)

Performance comparison

Tokenizer

The following is a comparison of the JSON tokenizer stage before this PR and after:

Before

# Benchmark Results

## json_tokenizer

### [0] Tesla V100-SXM2-32GB

|    string_size    | Samples | CPU Time  | Noise | GPU Time  | Noise |  Elem/s  |
|-------------------|---------|-----------|-------|-----------|-------|----------|
|    2^20 = 1048576 |   2176x |  2.489 ms | 9.62% |  2.480 ms | 9.61% | 422.729M |
|    2^21 = 2097152 |   1936x |  2.501 ms | 7.14% |  2.492 ms | 7.12% | 841.482M |
|    2^22 = 4194304 |   1152x |  2.612 ms | 5.43% |  2.604 ms | 5.42% |   1.611G |
|    2^23 = 8388608 |   1456x |  2.855 ms | 4.26% |  2.847 ms | 4.23% |   2.947G |
|   2^24 = 16777216 |   1104x |  3.395 ms | 5.34% |  3.387 ms | 5.33% |   4.954G |
|   2^25 = 33554432 |    560x |  4.410 ms | 2.25% |  4.402 ms | 2.25% |   7.623G |
|   2^26 = 67108864 |   1552x |  6.482 ms | 2.23% |  6.473 ms | 2.22% |  10.367G |
|  2^27 = 134217728 |   1435x | 10.430 ms | 2.70% | 10.422 ms | 2.70% |  12.879G |
|  2^28 = 268435456 |    815x | 18.396 ms | 1.95% | 18.387 ms | 1.95% |  14.599G |
|  2^29 = 536870912 |     15x | 34.389 ms | 0.42% | 34.381 ms | 0.42% |  15.615G |
| 2^30 = 1073741824 |     11x | 66.097 ms | 0.20% | 66.088 ms | 0.20% |  16.247G |

After

# Benchmark Results

## json_tokenizer

### [0] Tesla V100-SXM2-32GB

|    string_size    | Samples |  CPU Time  | Noise  |  GPU Time  | Noise  |  Elem/s  |
|-------------------|---------|------------|--------|------------|--------|----------|
|    2^20 = 1048576 |   1408x |   2.600 ms | 11.28% |   2.592 ms | 11.26% | 404.547M |
|    2^21 = 2097152 |    800x |   2.838 ms |  7.68% |   2.829 ms |  7.67% | 741.243M |
|    2^22 = 4194304 |   2752x |   3.719 ms |  9.24% |   3.710 ms |  9.23% |   1.130G |
|    2^23 = 8388608 |    128x |   4.855 ms |  3.38% |   4.846 ms |  3.37% |   1.731G |
|   2^24 = 16777216 |    720x |   7.029 ms |  4.67% |   7.021 ms |  4.66% |   2.390G |
|   2^25 = 33554432 |    832x |  10.760 ms |  3.83% |  10.751 ms |  3.83% |   3.121G |
|   2^26 = 67108864 |    576x |  17.961 ms |  2.86% |  17.953 ms |  2.86% |   3.738G |
|  2^27 = 134217728 |    461x |  32.550 ms |  2.13% |  32.542 ms |  2.13% |   4.124G |
|  2^28 = 268435456 |    243x |  61.813 ms |  1.60% |  61.805 ms |  1.60% |   4.343G |
|  2^29 = 536870912 |    125x | 120.445 ms |  1.21% | 120.437 ms |  1.21% |   4.458G |
| 2^30 = 1073741824 |     66x | 228.833 ms |  0.75% | 228.825 ms |  0.75% |   4.692G |

JSON Parser

The overall parser performance is obviously impacted as we're now also doing type conversion instead of just returning string columns.

Before

# Benchmark Results

## nested_json_gpu_parser

### [0] Tesla V100-SXM2-32GB

|    string_size    | Samples |  CPU Time  | Noise |  GPU Time  | Noise |  Elem/s  |
|-------------------|---------|------------|-------|------------|-------|----------|
|    2^20 = 1048576 |   1040x |   7.361 ms | 5.61% |   7.353 ms | 5.61% | 142.614M |
|    2^21 = 2097152 |    832x |  11.549 ms | 3.63% |  11.541 ms | 3.63% | 181.708M |
|    2^22 = 4194304 |    740x |  20.264 ms | 2.98% |  20.257 ms | 2.98% | 207.054M |
|    2^23 = 8388608 |    407x |  36.844 ms | 2.26% |  36.837 ms | 2.26% | 227.724M |
|   2^24 = 16777216 |     80x |  75.590 ms | 1.95% |  75.582 ms | 1.95% | 221.974M |
|   2^25 = 33554432 |     80x | 179.442 ms | 4.40% | 179.434 ms | 4.40% | 187.001M |
|   2^26 = 67108864 |     40x | 379.821 ms | 0.98% | 379.815 ms | 0.98% | 176.688M |
|  2^27 = 134217728 |     20x | 777.351 ms | 1.72% | 777.347 ms | 1.72% | 172.661M |
|  2^28 = 268435456 |     10x |    1.550 s | 0.99% |    1.550 s | 0.99% | 173.212M |
|  2^29 = 536870912 |      5x |    3.055 s | 0.41% |    3.055 s | 0.41% | 175.749M |
| 2^30 = 1073741824 |      3x |    6.315 s |  inf% |    6.315 s |  inf% | 170.018M |

After

|    string_size    | Samples |  CPU Time  | Noise |  GPU Time  | Noise |  Elem/s  |
|-------------------|---------|------------|-------|------------|-------|----------|
|    2^20 = 1048576 |   1568x |   7.908 ms | 5.24% |   7.900 ms | 5.24% | 132.730M |
|    2^21 = 2097152 |    576x |  12.235 ms | 3.24% |  12.228 ms | 3.24% | 171.509M |
|    2^22 = 4194304 |    192x |  21.171 ms | 2.09% |  21.164 ms | 2.09% | 198.182M |
|    2^23 = 8388608 |     96x |  38.990 ms | 1.96% |  38.983 ms | 1.96% | 215.188M |
|   2^24 = 16777216 |    192x |  78.414 ms | 2.21% |  78.407 ms | 2.21% | 213.977M |
|   2^25 = 33554432 |     81x | 187.007 ms | 6.47% | 187.000 ms | 6.47% | 179.435M |
|   2^26 = 67108864 |     38x | 400.007 ms | 1.59% | 400.000 ms | 1.59% | 167.772M |
|  2^27 = 134217728 |     19x | 801.575 ms | 1.29% | 801.571 ms | 1.29% | 167.443M |
|  2^28 = 268435456 |     10x |    1.590 s | 0.42% |    1.590 s | 0.42% | 168.799M |
|  2^29 = 536870912 |      5x |    3.150 s | 0.40% |    3.150 s | 0.40% | 170.456M |
| 2^30 = 1073741824 |      3x |    6.402 s |  inf% |    6.402 s |  inf% | 167.712M |

Supported escape sequences:

\" represents the quotation mark character (U+0022).
\\ represents the reverse solidus character (U+005C).
\/ represents the solidus character (U+002F).
\b represents the backspace character (U+0008).
\f represents the form feed character (U+000C).
\n represents the line feed character (U+000A).
\r represents the carriage return character (U+000D).
\t represents the character tabulation character (U+0009).
\uDDDD, where `D` is a hex digit 0-9, a-f, A-F, for code points on the MBP
\uDDDD\uDDDD, where `D` is a hex digit 0-9, a-f, A-F, representing UTF-16 surrogate pairs for remaining unicode code points

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…o branch-22.08

…fea-json-col-cast

karthikeyann · 2022-09-16T06:55:30Z

cpp/src/io/json/nested_json_gpu.cu

-   /* TT_OOS    */ {{{'{'}, {'['}, {'}'}, {']'}, {'x'}, {'x'}, {'x'}}},
-   /* TT_STR    */ {{{'x'}, {'x'}, {'x'}, {'x'}, {'x'}, {'x'}, {'x'}}},
-   /* TT_ESC    */ {{{'x'}, {'x'}, {'x'}, {'x'}, {'x'}, {'x'}, {'x'}}}}};
+   /* TT_OOS    */ {{{'{'}, {'['}, {'}'}, {']'}, {}, {}, {}}},
+   /* TT_STR    */ {{{}, {}, {}, {}, {}, {}, {}}},
+   /* TT_ESC    */ {{{}, {}, {}, {}, {}, {}, {}}}}};


why is x not needed anymore?

The stack supports a sparse representation. Basically: (stack_op, op_index) pairs, where op_index describes at which index such stack operation happens. All omitted indexes are populated with what is on top of the stack.

This is described in more detail in the logical stack PR description:
#11078 (comment)

karthikeyann · 2022-09-16T07:02:34Z

cpp/src/io/json/nested_json_gpu.cu

+      // Prepare iterator that returns (string_offset, string_length)-pairs needed by inference
+      auto string_ranges_it =
+        thrust::make_transform_iterator(offset_length_it, [] __device__(auto ip) {
+          return thrust::pair<json_column::row_offset_t, std::size_t>{
+            thrust::get<0>(ip), static_cast<std::size_t>(thrust::get<1>(ip))};
+        });
+
+      // Prepare iterator that returns (string_ptr, string_length)-pairs needed by type conversion
+      auto string_spans_it = thrust::make_transform_iterator(
+        offset_length_it, [data = d_input.data()] __device__(auto ip) {
+          return thrust::pair<const char*, std::size_t>{
+            data + thrust::get<0>(ip), static_cast<std::size_t>(thrust::get<1>(ip))};
+        });


Probably in another PR, unless it is trivial.

cpp/tests/io/nested_json_test.cpp

cpp/src/io/json/nested_json.hpp

karthikeyann

LGTM 👍

cpp/src/io/json/nested_json_gpu.cu

vuule

Few minor corrections. Approving to expedite the merge, with the assumption that these will be addressed.

cpp/src/io/json/nested_json_gpu.cu

python/cudf/cudf/tests/test_json.py

…f-column-type-conversion

elstehle · 2022-09-20T05:28:56Z

@gpucibot merge

elstehle and others added 30 commits May 6, 2022 03:47

Squashed with initial test set

5b4a6fc

style fix & additional test scenario

377358a

removed forceinline

4186004

tagging host device function

a921c66

Added utility to debug print & instrumented code to use it

75a1853

switched to using rmm also inside algorithm

a23668a

header include order & SFINAE macro

aa5f5c4

debug print cleanups

4ee2253

renaming key-value store op to stack_op

0f35852

device_span

ca5d465

addressing review comments & minor cleanups

f5960bd

error on unsupported unsigned_t and fixed typos

80226b7

minor style changes addressing review comments

e8bc8a5

squashed with bracket/brace test

c5274b5

clean up & addressing review comments

bb16254

refactored lookup tables

4e42d0e

put lookup tables into their own cudf file

e439320

Change interface for FST to not need temp storage

05840b3

removing unused var post-cleanup

6da9360

unified usage of pragma unrolls

702dfa1

Adding hostdevice macros to in-reg array

26a39ea

making const vars const

8c685c0

refactor lut sanity check

5c94521

rebase on latest FST

03b2c20

squash & rebase on latest tokenizer version

ff22f19

start

365b839

Merge branch 'fea-json-col-cast' of https://github.com/vuule/cudf int…

0a495ed

…o branch-22.08

Merge branch 'branch-22.08' of https://github.com/rapidsai/cudf into …

131b19b

…fea-json-col-cast

Add type inference test in CMake

419d3c8

Merge remote-tracking branch 'upstream/branch-22.08' into type-inference

d2e2a0b

adds test for nested column order

b7e5eb6

github-actions bot removed the CMake CMake build issue label Sep 15, 2022

fixes style

36bf571

elstehle requested a review from karthikeyann September 15, 2022 15:34

karthikeyann reviewed Sep 16, 2022

View reviewed changes

cpp/src/io/json/nested_json.hpp Outdated Show resolved Hide resolved

elstehle added 4 commits September 16, 2022 00:48

removes giving names to trivial values

206ed8d

more tests on nested column inference

167f215

removes unused var

226f42d

moves json_col member function definitions to source

c70831e

elstehle requested a review from karthikeyann September 16, 2022 11:50

karthikeyann approved these changes Sep 16, 2022

View reviewed changes

cpp/src/io/json/nested_json_gpu.cu Outdated Show resolved Hide resolved

pass by values for simple type

633f57b

karthikeyann added a commit to karthikeyann/cudf that referenced this pull request Sep 18, 2022

copied pytest change from PR rapidsai#11574 to pass

07caae7

elstehle requested a review from vuule September 19, 2022 07:52

karthikeyann mentioned this pull request Sep 19, 2022

JSON tree traversal #11610

Merged

5 tasks

vuule approved these changes Sep 19, 2022

View reviewed changes

elstehle and others added 9 commits September 19, 2022 14:09

Merge remote-tracking branch 'upstream/branch-22.10' into feature/lea…

dd5224f

…f-column-type-conversion

style fix

c5750ba

revert style change

e5050e6

removes cudf_fail in favor of cudf_expects

4c1d96c

re-enables extra logical check

568cc53

fixes comment typos

74445ab

canonical way for returning empty null mask

2bfb987

removes debug prints from pytest

028d5ac

Merge branch 'branch-22.10' into feature/leaf-column-type-conversion

95d6f3c

rapids-bot bot merged commit 0ba4675 into rapidsai:branch-22.10 Sep 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds type inference and type conversion for leaf-columns to the nested JSON parser #11574

Adds type inference and type conversion for leaf-columns to the nested JSON parser #11574

elstehle commented Aug 22, 2022 •

edited by vuule

Loading

karthikeyann Sep 16, 2022

elstehle Sep 16, 2022

karthikeyann Sep 16, 2022

karthikeyann left a comment

vuule left a comment

elstehle commented Sep 20, 2022

Adds type inference and type conversion for leaf-columns to the nested JSON parser #11574

Adds type inference and type conversion for leaf-columns to the nested JSON parser #11574

Conversation

elstehle commented Aug 22, 2022 • edited by vuule Loading

Description

Performance comparison

Tokenizer

Before

After

JSON Parser

Before

After

Supported escape sequences:

Checklist

karthikeyann Sep 16, 2022

Choose a reason for hiding this comment

elstehle Sep 16, 2022

Choose a reason for hiding this comment

karthikeyann Sep 16, 2022

Choose a reason for hiding this comment

karthikeyann left a comment

Choose a reason for hiding this comment

vuule left a comment

Choose a reason for hiding this comment

elstehle commented Sep 20, 2022

elstehle commented Aug 22, 2022 •

edited by vuule

Loading