Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds JSON tokenizer #11264

Merged
merged 77 commits into from
Aug 6, 2022
Merged

Conversation

elstehle
Copy link
Contributor

@elstehle elstehle commented Jul 14, 2022

This PR builds on the Finite-State Transducer (FST) algorithm and the Logical Stack to implement a tokenizer that demarcates sections from the JSON input and assigns a category to each such section.

This PR builds on:
⛓️ #11242
⛓️ #11078

Specifically, the tokenizer comprises the following processing steps:

  1. FST to emit sequence of stack operations (i.e., emit push(LIST), push(STRUCT), pop(), read()). This FST does transduce each occurrence of an opening semantic bracket or brace to the respective push(LIST) and push(STRUCT) operation, respectively. Each semantic closing bracket or brace is transduced to a pop() operation. All other input is transduced to a read() operation.
  2. The sequence of stack operations from (1) is fed into the logical stack that resolves what is on top of the stack before each operation from (1) (i.e., STRUCT, LIST). After this stage, for every input character we know what is on top of the stack: either a STRUCT or LIST or ROOT, if there is no symbol on top of the stack.
  3. We use the top-of-stack information from (2) for a second FST. This part can be considered a full pushdown or DVPA (because now, we also have stack context). State transitions are caused by the combination of the input character + the top-of-stack for that character. The output of this stage is the token stream: ({beginning-of, end-of}x{struct, list}, field name, value, etc.

@elstehle elstehle requested review from a team as code owners July 14, 2022 13:40
@github-actions github-actions bot added CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. labels Jul 14, 2022
@elstehle elstehle added feature request New feature or request 3 - Ready for Review Ready for review by team cuIO cuIO issue and removed libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Jul 14, 2022
@elstehle
Copy link
Contributor Author

Looks good 👍 nit: should we get rid of d_ prefix on almost all of the arguments and variables, since all of them are device data anyway? (except few in unit tests)

If an argument is a pointer, the prefix is helpful to know where it points at a glance. I assume that we use device_spans instead, so the prefix is not required.

I've removed the d_ prefix for all vars where its clear from the context that they're supposed to be device-accessible (e.g., for device_spans).

@elstehle
Copy link
Contributor Author

rerun tests

cpp/src/io/json/nested_json_gpu.cu Outdated Show resolved Hide resolved
cpp/src/io/json/nested_json_gpu.cu Outdated Show resolved Hide resolved
@GregoryKimball GregoryKimball added this to the Nested JSON reader milestone Jul 28, 2022
@vuule vuule changed the base branch from branch-22.08 to branch-22.10 August 2, 2022 03:49
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGreatTM!

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome work @elstehle. I have a few minor comments, consider applying some of the suggested changes if you agree. Otherwise LGTM!

cpp/src/io/json/nested_json_gpu.cu Outdated Show resolved Hide resolved
cpp/src/io/json/nested_json_gpu.cu Outdated Show resolved Hide resolved
cpp/src/io/json/nested_json_gpu.cu Outdated Show resolved Hide resolved
cpp/src/io/json/nested_json_gpu.cu Outdated Show resolved Hide resolved
cpp/src/io/json/nested_json_gpu.cu Outdated Show resolved Hide resolved
/**
* @brief Definition of the symbol groups
*/
enum class dfa_symbol_group_id : uint32_t {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we use char for some enum classes but uint32_t here? There are certainly less than 256 options, but perhaps there are reasons to desire a 4 byte wide type for thread read alignment?

Copy link
Contributor Author

@elstehle elstehle Aug 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've fixed the type for symbol group ids. Generally, we prefer to have a smaller type. We want to "compress" the transition & translation tables, as we keep those tables in shared memory and, with smaller types, we increase the chance of broadcasts (from the same bank to different threads) and decrease the chance of bank conflicts.

cpp/src/io/json/nested_json_gpu.cu Outdated Show resolved Hide resolved
cpp/src/io/json/nested_json_gpu.cu Outdated Show resolved Hide resolved
cpp/src/io/json/nested_json_gpu.cu Outdated Show resolved Hide resolved
@elstehle
Copy link
Contributor Author

elstehle commented Aug 6, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit e1a4e03 into rapidsai:branch-22.10 Aug 6, 2022
Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @elstehle and great code review!

rapids-bot bot pushed a commit that referenced this pull request Aug 12, 2022
This PR builds on the [JSON tokenizer](#11264) algorithm to implement an end-to-end JSON parser that parses to a `table_with_metadata`. 

**Chained PR depending on:** 
⛓️ #11264

Authors:
  - Elias Stehle (https://github.com/elstehle)
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - https://github.com/nvdbaranec
  - Bradley Dice (https://github.com/bdice)
  - Yunsong Wang (https://github.com/PointKernel)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #11388
rapids-bot bot pushed a commit that referenced this pull request Sep 19, 2022
Adds GPU implementation of JSON-token-stream to JSON-tree 
Depends on PR [Adds JSON-token-stream to JSON-tree](#11291)  #11291 




<details>

---
This PR adds the stage of converting a JSON input into a tree representation, where each node represents either a struct, a list, a field name, a string value, a value, or an error node.  

The PR is part of a multi-part PR-chain. Specifically, this PR builds on the [JSON tokenizer PR](#11264).

**This PR depends on:**
⛓️ #11264
⛓️ #11242
⛓️ #11078

**Each node has one of the following category:**

```
/// A node representing a struct
NC_STRUCT,
/// A node representing a list
NC_LIST,
/// A node representing a field name
NC_FN,
/// A node representing a string value
NC_STR,
/// A node representing a numeric or literal value (e.g., true, false, null)
NC_VAL,
/// A node representing a parser error
NC_ERR
```

**For each node, the tree representation stores the following information:**
- node category
- node level
- node range begin (index of the first character from the original JSON input that this node demarcates)
- node range end (index of one-past-the-last-character of the first character from the original JSON input that this node demarcates)

**An example tree:**
The following is just an example print of the information represented in the tree generated by the algorithm.

- Each line is printing the full path to the next node in the tree. 
- For each node along the path we have the following format: `<[NODE_ID]:[NODE_CATEGORY]:[[RANGE_BEGIN],[RANGE_END]) '[STRING_FROM_RANGE]'>`


**The original JSON for this tree:**
```
  [{"category": "reference","index:": [4,12,42],"author": "Nigel Rees","title": "[Sayings of the Century]","price": 8.95},  {"category": "reference","index": [4,{},null,{"a":[{ }, {}] } ],"author": "Nigel Rees","title": "{}[], <=semantic-symbols-string","price": 8.95}] 
```

**The tree:**
```
<0:LIST:[2, 3) '['>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <2:FN:[5, 13) 'category'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <2:FN:[5, 13) 'category'> -> <3:STR:[17, 26) 'reference'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <6:VAL:[39, 40) '4'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <7:VAL:[41, 43) '12'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <8:VAL:[44, 46) '42'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <9:FN:[49, 55) 'author'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <9:FN:[49, 55) 'author'> -> <10:STR:[59, 69) 'Nigel Rees'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <11:FN:[72, 77) 'title'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <11:FN:[72, 77) 'title'> -> <12:STR:[81, 105) '[Sayings of the Century]'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <13:FN:[108, 113) 'price'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <13:FN:[108, 113) 'price'> -> <14:VAL:[116, 120) '8.95'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <16:FN:[126, 134) 'category'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <16:FN:[126, 134) 'category'> -> <17:STR:[138, 147) 'reference'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <20:VAL:[159, 160) '4'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <21:STRUCT:[161, 162) '{'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <22:VAL:[164, 168) 'null'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['> -> <26:STRUCT:[175, 176) '{'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['> -> <27:STRUCT:[180, 181) '{'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <28:FN:[189, 195) 'author'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <28:FN:[189, 195) 'author'> -> <29:STR:[199, 209) 'Nigel Rees'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <30:FN:[212, 217) 'title'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <30:FN:[212, 217) 'title'> -> <31:STR:[221, 252) '{}[], <=semantic-symbols-string'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <32:FN:[255, 260) 'price'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <32:FN:[255, 260) 'price'> -> <33:VAL:[263, 267) '8.95'>
```

**The original JSON pretty-printed for this tree:**
```
[
    {
        "category": "reference",
        "index:": [
            4,
            12,
            42
        ],
        "author": "Nigel Rees",
        "title": "[Sayings of the Century]",
        "price": 8.95
    },
    {
        "category": "reference",
        "index": [
            4,
            {},
            null,
            {
                "a": [
                    {},
                    {}
                ]
            }
        ],
        "author": "Nigel Rees",
        "title": "{}[], <=semantic-symbols-string",
        "price": 8.95
    }
]
```
</details>

---

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Michael Wang (https://github.com/isVoid)
  - David Wendt (https://github.com/davidwendt)

URL: #11518
@karthikeyann karthikeyann mentioned this pull request Sep 24, 2022
3 tasks
rapids-bot bot pushed a commit that referenced this pull request Sep 27, 2022
This PR generates json column creation from the traversed json tree. It has following parts
1. `reduce_to_column_tree` -  Reduce node tree into column tree by aggregating each property of each 	column and number of rows in each column.
2. `make_json_column2` - creates the GPU json column tree structure from tree and column info
3. `json_column_to_cudf_column2` -  converts this GPU json column to cudf column.
4. `parse_nested_json2` - combines all json tokenizer, json tree generation, traversal, json column creation, cudf column conversion together. All steps run on device.

Depends on PR #11518 #11610 
For code-review, use PR karthikeyann#5 which contains only this tree changes.

### Overview

- PR #11264 Tokenizes the JSON string to Tokens
- PR #11518 Converts Tokens to Nodes (tree representation)
- PR #11610 Traverses this node tree --> assigns column id and row index to each node.
- This PR #11714 Converts this traversed tree into JSON Column, which in turn is translated to `cudf::column`

JSON has 5 categories of nodes. STRUCT, LIST,  FIELD, VALUE, STRING,
STRUCT, LIST are nested types.
FIELD nodes are struct columns' keys.
VALUE node is similar to STRING column but without double quotes. Actual datatype conversion happens in `json_column_to_cudf_column2`

Tree Representation `tree_meta_t` has 4 data members.
1. node categories
2. node parents' id
3. node level
4. node's string range {begin, end} (as 2 vectors)

Currently supported JSON formats are records orient, and JSON lines.

### This PR - Detailed explanation
This PR has 3 steps.
1. `reduce_to_column_tree`
    - Required to compute total number of columns, column type, nested column structure, and number of rows in each column.
    - Generates `tree_meta_t` data members for column.
    - - Sort node tree by col_id (stable sort)
    - - reduce_by_key custom_op on node_categories, collapses to column category
    - - unique_by_key_copy by col_id, copies first parent_node_id, string_ranges. This parent_node_id will be transformed to parent_column_id.
    - - reduce_by_key max  on row_offsets gives maximum row offset in each column, Propagate list column children's max row offset to their children because sometime structs may miss entries, so parent list gives correct count.
5. `make_json_column2` 
    - Converts nodes to GPU json columns in tree structure
    - - get column tree, transfer column names to host.
    - - Create `d_json_column` for non-field columns.
    - - if 2 columns occurs on same path, and one of them is nested and other is string column, discard the string column.
    - - For STRUCT, LIST, VALUE, STRING nodes, set the validity bits, and copy string {begin, end} range to string_offsets and string length.
    - - Compute list offset 
    - - Perform scan max operation on offsets. (to fill 0's with previous offset value).
    - Now the `d_json_column` is nested, and contains offsets, validity bits, unparsed unconverted string information.
6. `json_column_to_cudf_column2` -  converts this GPU json column to cudf column.
    - Recursively goes over each `d_json_column` and converts to `cudf::column` by inferring the type, parsing the string to type, and setting validity bits further.
7. `parse_nested_json2` - combines all json tokenizer, json tree generation, traversal, json column creation, cudf column conversion together. All steps run on device.

Authors:
  - Karthikeyan (https://github.com/karthikeyann)
  - Elias Stehle (https://github.com/elstehle)
  - Yunsong Wang (https://github.com/PointKernel)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Tobias Ribizel (https://github.com/upsj)
  - https://github.com/nvdbaranec
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #11714
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants