Adds JSON tokenizer #11264

elstehle · 2022-07-14T13:40:07Z

This PR builds on the Finite-State Transducer (FST) algorithm and the Logical Stack to implement a tokenizer that demarcates sections from the JSON input and assigns a category to each such section.

This PR builds on:
⛓️ #11242
⛓️ #11078

Specifically, the tokenizer comprises the following processing steps:

FST to emit sequence of stack operations (i.e., emit push(LIST), push(STRUCT), pop(), read()). This FST does transduce each occurrence of an opening semantic bracket or brace to the respective push(LIST) and push(STRUCT) operation, respectively. Each semantic closing bracket or brace is transduced to a pop() operation. All other input is transduced to a read() operation.
The sequence of stack operations from (1) is fed into the logical stack that resolves what is on top of the stack before each operation from (1) (i.e., STRUCT, LIST). After this stage, for every input character we know what is on top of the stack: either a STRUCT or LIST or ROOT, if there is no symbol on top of the stack.
We use the top-of-stack information from (2) for a second FST. This part can be considered a full pushdown or DVPA (because now, we also have stack context). State transitions are caused by the combination of the input character + the top-of-stack for that character. The output of this stage is the token stream: ({beginning-of, end-of}x{struct, list}, field name, value, etc.

elstehle · 2022-07-28T10:02:04Z

Looks good 👍 nit: should we get rid of d_ prefix on almost all of the arguments and variables, since all of them are device data anyway? (except few in unit tests)

If an argument is a pointer, the prefix is helpful to know where it points at a glance. I assume that we use device_spans instead, so the prefix is not required.

I've removed the d_ prefix for all vars where its clear from the context that they're supposed to be device-accessible (e.g., for device_spans).

elstehle · 2022-07-28T11:01:41Z

rerun tests

cpp/src/io/json/nested_json_gpu.cu

…n-tokenizer

cpp/tests/io/fst/common.hpp

PointKernel

LGreatTM!

bdice

This is awesome work @elstehle. I have a few minor comments, consider applying some of the suggested changes if you agree. Otherwise LGTM!

cpp/src/io/json/nested_json_gpu.cu

bdice · 2022-08-05T17:40:55Z

cpp/src/io/json/nested_json_gpu.cu

+/**
+ * @brief Definition of the symbol groups
+ */
+enum class dfa_symbol_group_id : uint32_t {


Why do we use char for some enum classes but uint32_t here? There are certainly less than 256 options, but perhaps there are reasons to desire a 4 byte wide type for thread read alignment?

Thanks, I've fixed the type for symbol group ids. Generally, we prefer to have a smaller type. We want to "compress" the transition & translation tables, as we keep those tables in shared memory and, with smaller types, we increase the chance of broadcasts (from the same bank to different threads) and decrease the chance of bank conflicts.

cpp/src/io/json/nested_json_gpu.cu

…n-tokenizer

elstehle · 2022-08-06T10:21:24Z

@gpucibot merge

karthikeyann

Great work @elstehle and great code review!

This PR builds on the [JSON tokenizer](#11264) algorithm to implement an end-to-end JSON parser that parses to a `table_with_metadata`. **Chained PR depending on:** ⛓️ #11264 Authors: - Elias Stehle (https://github.com/elstehle) - Karthikeyan (https://github.com/karthikeyann) Approvers: - https://github.com/nvdbaranec - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) - Vukasin Milovanovic (https://github.com/vuule) URL: #11388

Adds GPU implementation of JSON-token-stream to JSON-tree Depends on PR [Adds JSON-token-stream to JSON-tree](#11291) #11291 <details> --- This PR adds the stage of converting a JSON input into a tree representation, where each node represents either a struct, a list, a field name, a string value, a value, or an error node. The PR is part of a multi-part PR-chain. Specifically, this PR builds on the [JSON tokenizer PR](#11264). **This PR depends on:** ⛓️ #11264 ⛓️ #11242 ⛓️ #11078 **Each node has one of the following category:** ``` /// A node representing a struct NC_STRUCT, /// A node representing a list NC_LIST, /// A node representing a field name NC_FN, /// A node representing a string value NC_STR, /// A node representing a numeric or literal value (e.g., true, false, null) NC_VAL, /// A node representing a parser error NC_ERR ``` **For each node, the tree representation stores the following information:** - node category - node level - node range begin (index of the first character from the original JSON input that this node demarcates) - node range end (index of one-past-the-last-character of the first character from the original JSON input that this node demarcates) **An example tree:** The following is just an example print of the information represented in the tree generated by the algorithm. - Each line is printing the full path to the next node in the tree. - For each node along the path we have the following format: `<[NODE_ID]:[NODE_CATEGORY]:[[RANGE_BEGIN],[RANGE_END]) '[STRING_FROM_RANGE]'>` **The original JSON for this tree:** ``` [{"category": "reference","index:": [4,12,42],"author": "Nigel Rees","title": "[Sayings of the Century]","price": 8.95}, {"category": "reference","index": [4,{},null,{"a":[{ }, {}] } ],"author": "Nigel Rees","title": "{}[], <=semantic-symbols-string","price": 8.95}] ``` **The tree:** ``` <0:LIST:[2, 3) '['> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <2:FN:[5, 13) 'category'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <2:FN:[5, 13) 'category'> -> <3:STR:[17, 26) 'reference'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <6:VAL:[39, 40) '4'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <7:VAL:[41, 43) '12'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <8:VAL:[44, 46) '42'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <9:FN:[49, 55) 'author'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <9:FN:[49, 55) 'author'> -> <10:STR:[59, 69) 'Nigel Rees'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <11:FN:[72, 77) 'title'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <11:FN:[72, 77) 'title'> -> <12:STR:[81, 105) '[Sayings of the Century]'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <13:FN:[108, 113) 'price'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <13:FN:[108, 113) 'price'> -> <14:VAL:[116, 120) '8.95'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <16:FN:[126, 134) 'category'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <16:FN:[126, 134) 'category'> -> <17:STR:[138, 147) 'reference'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <20:VAL:[159, 160) '4'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <21:STRUCT:[161, 162) '{'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <22:VAL:[164, 168) 'null'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['> -> <26:STRUCT:[175, 176) '{'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['> -> <27:STRUCT:[180, 181) '{'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <28:FN:[189, 195) 'author'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <28:FN:[189, 195) 'author'> -> <29:STR:[199, 209) 'Nigel Rees'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <30:FN:[212, 217) 'title'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <30:FN:[212, 217) 'title'> -> <31:STR:[221, 252) '{}[], <=semantic-symbols-string'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <32:FN:[255, 260) 'price'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <32:FN:[255, 260) 'price'> -> <33:VAL:[263, 267) '8.95'> ``` **The original JSON pretty-printed for this tree:** ``` [ { "category": "reference", "index:": [ 4, 12, 42 ], "author": "Nigel Rees", "title": "[Sayings of the Century]", "price": 8.95 }, { "category": "reference", "index": [ 4, {}, null, { "a": [ {}, {} ] } ], "author": "Nigel Rees", "title": "{}[], <=semantic-symbols-string", "price": 8.95 } ] ``` </details> --- Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Michael Wang (https://github.com/isVoid) - David Wendt (https://github.com/davidwendt) URL: #11518

This PR generates json column creation from the traversed json tree. It has following parts 1. `reduce_to_column_tree` - Reduce node tree into column tree by aggregating each property of each column and number of rows in each column. 2. `make_json_column2` - creates the GPU json column tree structure from tree and column info 3. `json_column_to_cudf_column2` - converts this GPU json column to cudf column. 4. `parse_nested_json2` - combines all json tokenizer, json tree generation, traversal, json column creation, cudf column conversion together. All steps run on device. Depends on PR #11518 #11610 For code-review, use PR karthikeyann#5 which contains only this tree changes. ### Overview - PR #11264 Tokenizes the JSON string to Tokens - PR #11518 Converts Tokens to Nodes (tree representation) - PR #11610 Traverses this node tree --> assigns column id and row index to each node. - This PR #11714 Converts this traversed tree into JSON Column, which in turn is translated to `cudf::column` JSON has 5 categories of nodes. STRUCT, LIST, FIELD, VALUE, STRING, STRUCT, LIST are nested types. FIELD nodes are struct columns' keys. VALUE node is similar to STRING column but without double quotes. Actual datatype conversion happens in `json_column_to_cudf_column2` Tree Representation `tree_meta_t` has 4 data members. 1. node categories 2. node parents' id 3. node level 4. node's string range {begin, end} (as 2 vectors) Currently supported JSON formats are records orient, and JSON lines. ### This PR - Detailed explanation This PR has 3 steps. 1. `reduce_to_column_tree` - Required to compute total number of columns, column type, nested column structure, and number of rows in each column. - Generates `tree_meta_t` data members for column. - - Sort node tree by col_id (stable sort) - - reduce_by_key custom_op on node_categories, collapses to column category - - unique_by_key_copy by col_id, copies first parent_node_id, string_ranges. This parent_node_id will be transformed to parent_column_id. - - reduce_by_key max on row_offsets gives maximum row offset in each column, Propagate list column children's max row offset to their children because sometime structs may miss entries, so parent list gives correct count. 5. `make_json_column2` - Converts nodes to GPU json columns in tree structure - - get column tree, transfer column names to host. - - Create `d_json_column` for non-field columns. - - if 2 columns occurs on same path, and one of them is nested and other is string column, discard the string column. - - For STRUCT, LIST, VALUE, STRING nodes, set the validity bits, and copy string {begin, end} range to string_offsets and string length. - - Compute list offset - - Perform scan max operation on offsets. (to fill 0's with previous offset value). - Now the `d_json_column` is nested, and contains offsets, validity bits, unparsed unconverted string information. 6. `json_column_to_cudf_column2` - converts this GPU json column to cudf column. - Recursively goes over each `d_json_column` and converts to `cudf::column` by inferring the type, parsing the string to type, and setting validity bits further. 7. `parse_nested_json2` - combines all json tokenizer, json tree generation, traversal, json column creation, cudf column conversion together. All steps run on device. Authors: - Karthikeyan (https://github.com/karthikeyann) - Elias Stehle (https://github.com/elstehle) - Yunsong Wang (https://github.com/PointKernel) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Tobias Ribizel (https://github.com/upsj) - https://github.com/nvdbaranec - GALI PREM SAGAR (https://github.com/galipremsagar) - Vukasin Milovanovic (https://github.com/vuule) URL: #11714

elstehle added 19 commits July 13, 2022 00:53

squashed with bracket/brace test

0557d41

clean up & addressing review comments

355d1e4

refactored lookup tables

39a6b65

put lookup tables into their own cudf file

239f138

Change interface for FST to not need temp storage

39cff80

removing unused var post-cleanup

e24a133

unified usage of pragma unrolls

caf6195

Adding hostdevice macros to in-reg array

ea79a81

making const vars const

17dcbfd

refactor lut sanity check

6fdd24a

fixes sg-count & uses rmm stream in fst tests

eccf970

minor doxygen fix

9fe8e4b

adopts suggested fst test changes

694a365

adopts device-side test data gen

f656f49

adopts c++17 namespaces declarations

485a1c6

removes state vector-wrapper in favor of vanilla array

5f1c4b5

some west-const remainders & unifies StateIndexT

e6f8def

adds check for state transition narrowing conversion

a798852

fixes logical stack test includes

eb24962

elstehle requested review from a team as code owners July 14, 2022 13:40

elstehle requested review from vyasr and rgsl888prabhu July 14, 2022 13:40

github-actions bot added CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. labels Jul 14, 2022

elstehle added feature request New feature or request 3 - Ready for Review Ready for review by team cuIO cuIO issue and removed libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Jul 14, 2022

elstehle mentioned this pull request Jul 28, 2022

Adds the end-to-end JSON parser implementation #11388

Merged

3 tasks

PointKernel reviewed Jul 28, 2022

View reviewed changes

cpp/src/io/json/nested_json_gpu.cu Outdated Show resolved Hide resolved

cpp/src/io/json/nested_json_gpu.cu Outdated Show resolved Hide resolved

GregoryKimball added this to the Nested JSON reader milestone Jul 28, 2022

elstehle mentioned this pull request Jul 29, 2022

Feature/json to columnar elstehle/cudf#2

Closed

vuule changed the base branch from branch-22.08 to branch-22.10 August 2, 2022 03:49

elstehle added 6 commits August 2, 2022 04:01

Merge remote-tracking branch 'upstream/branch-22.08' into feature/jso…

c69fdfe

…n-tokenizer

enum class everything

753b2d6

Merge remote-tracking branch 'upstream/branch-22.10' into feature/jso…

45e4a6d

…n-tokenizer

adds test case for utf8 inputs

c1b6002

Merge remote-tracking branch 'upstream/branch-22.10' into feature/jso…

6ca52e7

…n-tokenizer

moves tables from vector to array

b5030b9

bdice reviewed Aug 5, 2022

View reviewed changes

cpp/tests/io/fst/common.hpp Show resolved Hide resolved

PointKernel approved these changes Aug 5, 2022

View reviewed changes

bdice approved these changes Aug 5, 2022

View reviewed changes

elstehle added 2 commits August 5, 2022 22:50

Merge remote-tracking branch 'upstream/branch-22.10' into feature/jso…

4d7a80a

…n-tokenizer

improves a few code comments

ff3600b

rapids-bot bot merged commit e1a4e03 into rapidsai:branch-22.10 Aug 6, 2022

karthikeyann reviewed Aug 6, 2022

View reviewed changes

karthikeyann mentioned this pull request Aug 11, 2022

Adds GPU implementation of JSON-token-stream to JSON-tree #11518

Merged

3 tasks

karthikeyann mentioned this pull request Sep 24, 2022

JSON Column creation in GPU #11714

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds JSON tokenizer #11264

Adds JSON tokenizer #11264

elstehle commented Jul 14, 2022 •

edited

Loading

elstehle commented Jul 28, 2022

elstehle commented Jul 28, 2022

PointKernel left a comment

bdice left a comment

bdice Aug 5, 2022

elstehle Aug 6, 2022 •

edited

Loading

elstehle commented Aug 6, 2022

karthikeyann left a comment

Adds JSON tokenizer #11264

Adds JSON tokenizer #11264

Conversation

elstehle commented Jul 14, 2022 • edited Loading

elstehle commented Jul 28, 2022

elstehle commented Jul 28, 2022

PointKernel left a comment

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

bdice Aug 5, 2022

Choose a reason for hiding this comment

elstehle Aug 6, 2022 • edited Loading

Choose a reason for hiding this comment

elstehle commented Aug 6, 2022

karthikeyann left a comment

Choose a reason for hiding this comment

elstehle commented Jul 14, 2022 •

edited

Loading

elstehle Aug 6, 2022 •

edited

Loading