Skip to content

Commit

Permalink
Adds type inference and type conversion for leaf-columns to the neste…
Browse files Browse the repository at this point in the history
…d JSON parser (#11574)

Adds type inference and type conversion for leaf-columns to the nested JSON parser

**Note to the reviewers**:
It's important to note that we're talking about two different stages of quote-stripping here.

1. Including/excluding quotes in the tokenizer stage (currently always set to `true` using a `constexpr bool`)
2. Including/excluding quotes in the type conversion stage

Currently, we always include quotes in the tokenizer stage (1), such that the type casting stage (2) can differentiate between string values and literals (e.g. `[true, "true"]`) and, based on the user-provided choice in `json_reader_options::keep_quotes`, can strip off the quotes or keep them in the values returned to the user.

**In addition to adding type inference and type casting:**
- Switches logic for inferring nested columns. Inferring any column with at least one nested item (list or struct) as that respective nested column, making all other _non-nested_ items of that column invalid. E.g., `[null,{"a":1},"foo"] => List<Struct<a:int>> with struct col validity: 0, 1, 0`
- Adds option for `keep_quotes` to differentiate between string values and numeric & literal values, like (`123.4`, `true`, `false`, `null`). 
- Migrated libcudf test to cudf test to avoid having large byte BLOBs in source file
- Changing column order to match the behaviour of pandas and existing JSON lines reader. That is, column order corresponds to the order they were discovered in: `[{"b":1, "c":1}, {"a":1}] => order: <b, c, a>`
- Support for escape sequences (see below)

## Performance comparison

### Tokenizer
The following is a comparison of the **JSON tokenizer** stage before this PR and after:

#### Before
```
# Benchmark Results

## json_tokenizer

### [0] Tesla V100-SXM2-32GB

|    string_size    | Samples | CPU Time  | Noise | GPU Time  | Noise |  Elem/s  |
|-------------------|---------|-----------|-------|-----------|-------|----------|
|    2^20 = 1048576 |   2176x |  2.489 ms | 9.62% |  2.480 ms | 9.61% | 422.729M |
|    2^21 = 2097152 |   1936x |  2.501 ms | 7.14% |  2.492 ms | 7.12% | 841.482M |
|    2^22 = 4194304 |   1152x |  2.612 ms | 5.43% |  2.604 ms | 5.42% |   1.611G |
|    2^23 = 8388608 |   1456x |  2.855 ms | 4.26% |  2.847 ms | 4.23% |   2.947G |
|   2^24 = 16777216 |   1104x |  3.395 ms | 5.34% |  3.387 ms | 5.33% |   4.954G |
|   2^25 = 33554432 |    560x |  4.410 ms | 2.25% |  4.402 ms | 2.25% |   7.623G |
|   2^26 = 67108864 |   1552x |  6.482 ms | 2.23% |  6.473 ms | 2.22% |  10.367G |
|  2^27 = 134217728 |   1435x | 10.430 ms | 2.70% | 10.422 ms | 2.70% |  12.879G |
|  2^28 = 268435456 |    815x | 18.396 ms | 1.95% | 18.387 ms | 1.95% |  14.599G |
|  2^29 = 536870912 |     15x | 34.389 ms | 0.42% | 34.381 ms | 0.42% |  15.615G |
| 2^30 = 1073741824 |     11x | 66.097 ms | 0.20% | 66.088 ms | 0.20% |  16.247G |
```


#### After
```
# Benchmark Results

## json_tokenizer

### [0] Tesla V100-SXM2-32GB

|    string_size    | Samples |  CPU Time  | Noise  |  GPU Time  | Noise  |  Elem/s  |
|-------------------|---------|------------|--------|------------|--------|----------|
|    2^20 = 1048576 |   1408x |   2.600 ms | 11.28% |   2.592 ms | 11.26% | 404.547M |
|    2^21 = 2097152 |    800x |   2.838 ms |  7.68% |   2.829 ms |  7.67% | 741.243M |
|    2^22 = 4194304 |   2752x |   3.719 ms |  9.24% |   3.710 ms |  9.23% |   1.130G |
|    2^23 = 8388608 |    128x |   4.855 ms |  3.38% |   4.846 ms |  3.37% |   1.731G |
|   2^24 = 16777216 |    720x |   7.029 ms |  4.67% |   7.021 ms |  4.66% |   2.390G |
|   2^25 = 33554432 |    832x |  10.760 ms |  3.83% |  10.751 ms |  3.83% |   3.121G |
|   2^26 = 67108864 |    576x |  17.961 ms |  2.86% |  17.953 ms |  2.86% |   3.738G |
|  2^27 = 134217728 |    461x |  32.550 ms |  2.13% |  32.542 ms |  2.13% |   4.124G |
|  2^28 = 268435456 |    243x |  61.813 ms |  1.60% |  61.805 ms |  1.60% |   4.343G |
|  2^29 = 536870912 |    125x | 120.445 ms |  1.21% | 120.437 ms |  1.21% |   4.458G |
| 2^30 = 1073741824 |     66x | 228.833 ms |  0.75% | 228.825 ms |  0.75% |   4.692G |

```


### JSON Parser

The overall parser performance is obviously impacted as we're now also doing type conversion instead of just returning string columns.

#### Before
```
# Benchmark Results

## nested_json_gpu_parser

### [0] Tesla V100-SXM2-32GB

|    string_size    | Samples |  CPU Time  | Noise |  GPU Time  | Noise |  Elem/s  |
|-------------------|---------|------------|-------|------------|-------|----------|
|    2^20 = 1048576 |   1040x |   7.361 ms | 5.61% |   7.353 ms | 5.61% | 142.614M |
|    2^21 = 2097152 |    832x |  11.549 ms | 3.63% |  11.541 ms | 3.63% | 181.708M |
|    2^22 = 4194304 |    740x |  20.264 ms | 2.98% |  20.257 ms | 2.98% | 207.054M |
|    2^23 = 8388608 |    407x |  36.844 ms | 2.26% |  36.837 ms | 2.26% | 227.724M |
|   2^24 = 16777216 |     80x |  75.590 ms | 1.95% |  75.582 ms | 1.95% | 221.974M |
|   2^25 = 33554432 |     80x | 179.442 ms | 4.40% | 179.434 ms | 4.40% | 187.001M |
|   2^26 = 67108864 |     40x | 379.821 ms | 0.98% | 379.815 ms | 0.98% | 176.688M |
|  2^27 = 134217728 |     20x | 777.351 ms | 1.72% | 777.347 ms | 1.72% | 172.661M |
|  2^28 = 268435456 |     10x |    1.550 s | 0.99% |    1.550 s | 0.99% | 173.212M |
|  2^29 = 536870912 |      5x |    3.055 s | 0.41% |    3.055 s | 0.41% | 175.749M |
| 2^30 = 1073741824 |      3x |    6.315 s |  inf% |    6.315 s |  inf% | 170.018M |
```

#### After
```
|    string_size    | Samples |  CPU Time  | Noise |  GPU Time  | Noise |  Elem/s  |
|-------------------|---------|------------|-------|------------|-------|----------|
|    2^20 = 1048576 |   1568x |   7.908 ms | 5.24% |   7.900 ms | 5.24% | 132.730M |
|    2^21 = 2097152 |    576x |  12.235 ms | 3.24% |  12.228 ms | 3.24% | 171.509M |
|    2^22 = 4194304 |    192x |  21.171 ms | 2.09% |  21.164 ms | 2.09% | 198.182M |
|    2^23 = 8388608 |     96x |  38.990 ms | 1.96% |  38.983 ms | 1.96% | 215.188M |
|   2^24 = 16777216 |    192x |  78.414 ms | 2.21% |  78.407 ms | 2.21% | 213.977M |
|   2^25 = 33554432 |     81x | 187.007 ms | 6.47% | 187.000 ms | 6.47% | 179.435M |
|   2^26 = 67108864 |     38x | 400.007 ms | 1.59% | 400.000 ms | 1.59% | 167.772M |
|  2^27 = 134217728 |     19x | 801.575 ms | 1.29% | 801.571 ms | 1.29% | 167.443M |
|  2^28 = 268435456 |     10x |    1.590 s | 0.42% |    1.590 s | 0.42% | 168.799M |
|  2^29 = 536870912 |      5x |    3.150 s | 0.40% |    3.150 s | 0.40% | 170.456M |
| 2^30 = 1073741824 |      3x |    6.402 s |  inf% |    6.402 s |  inf% | 167.712M |
```


## Supported escape sequences:
```
\" represents the quotation mark character (U+0022).
\\ represents the reverse solidus character (U+005C).
\/ represents the solidus character (U+002F).
\b represents the backspace character (U+0008).
\f represents the form feed character (U+000C).
\n represents the line feed character (U+000A).
\r represents the carriage return character (U+000D).
\t represents the character tabulation character (U+0009).
\uDDDD, where `D` is a hex digit 0-9, a-f, A-F, for code points on the MBP
\uDDDD\uDDDD, where `D` is a hex digit 0-9, a-f, A-F, representing UTF-16 surrogate pairs for remaining unicode code points
```

Authors:
  - Elias Stehle (https://github.com/elstehle)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Yunsong Wang (https://github.com/PointKernel)
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Michael Wang (https://github.com/isVoid)
  - Karthikeyan (https://github.com/karthikeyann)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #11574
  • Loading branch information
elstehle authored Sep 20, 2022
1 parent bf2c751 commit 0ba4675
Show file tree
Hide file tree
Showing 6 changed files with 404 additions and 205 deletions.
31 changes: 31 additions & 0 deletions cpp/include/cudf/io/json.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,9 @@ class json_reader_options {
// Whether to use the experimental reader
bool _experimental = false;

// Whether to keep the quote characters of string values
bool _keep_quotes = false;

/**
* @brief Constructor from source info.
*
Expand Down Expand Up @@ -203,6 +206,13 @@ class json_reader_options {
*/
bool is_enabled_experimental() const { return _experimental; }

/**
* @brief Whether the experimental reader should keep quotes of string values.
*
* @returns true if the experimental reader should keep quotes, false otherwise
*/
bool is_enabled_keep_quotes() const { return _keep_quotes; }

/**
* @brief Set data types for columns to be read.
*
Expand Down Expand Up @@ -258,6 +268,14 @@ class json_reader_options {
* @param val Boolean value to enable/disable the experimental reader
*/
void enable_experimental(bool val) { _experimental = val; }

/**
* @brief Set whether the experimental reader should keep quotes of string values.
*
* @param val Boolean value to indicate whether the experimental reader should keep quotes
* of string values
*/
void enable_keep_quotes(bool val) { _keep_quotes = val; }
};

/**
Expand Down Expand Up @@ -377,6 +395,19 @@ class json_reader_options_builder {
return *this;
}

/**
* @brief Set whether the experimental reader should keep quotes of string values.
*
* @param val Boolean value to indicate whether the experimental reader should keep quotes
* of string values
* @return this for chaining
*/
json_reader_options_builder& keep_quotes(bool val)
{
options._keep_quotes = val;
return *this;
}

/**
* @brief move json_reader_options member once it's built.
*/
Expand Down
73 changes: 6 additions & 67 deletions cpp/src/io/json/nested_json.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
#include <cudf/types.hpp>
#include <cudf/utilities/bit.hpp>
#include <cudf/utilities/default_stream.hpp>
#include <cudf/utilities/error.hpp>
#include <cudf/utilities/span.hpp>

#include <rmm/cuda_stream_view.hpp>
Expand Down Expand Up @@ -127,6 +128,7 @@ struct json_column {
// Following "items" as the default child column's name of a list column
// Using the struct's field names
std::map<std::string, json_column> child_columns;
std::vector<std::string> column_order;

// Counting the current number of items in this column
row_offset_t current_offset = 0;
Expand All @@ -142,46 +144,15 @@ struct json_column {
*
* @param up_to_row_offset The row offset up to which to fill with nulls.
*/
void null_fill(row_offset_t up_to_row_offset)
{
// Fill all the rows up to up_to_row_offset with "empty"/null rows
validity.resize(word_index(up_to_row_offset) + 1);
std::fill_n(std::back_inserter(string_offsets),
up_to_row_offset - string_offsets.size(),
(string_offsets.size() > 0) ? string_offsets.back() : 0);
std::fill_n(std::back_inserter(string_lengths), up_to_row_offset - string_lengths.size(), 0);
std::fill_n(std::back_inserter(child_offsets),
up_to_row_offset + 1 - child_offsets.size(),
(child_offsets.size() > 0) ? child_offsets.back() : 0);
current_offset = up_to_row_offset;
}
void null_fill(row_offset_t up_to_row_offset);

/**
* @brief Recursively iterates through the tree of columns making sure that all child columns of a
* struct column have the same row count, filling missing rows with nulls.
*
* @param min_row_count The minimum number of rows to be filled.
*/
void level_child_cols_recursively(row_offset_t min_row_count)
{
// Fill this columns with nulls up to the given row count
null_fill(min_row_count);

// If this is a struct column, we need to level all its child columns
if (type == json_col_t::StructColumn) {
for (auto it = std::begin(child_columns); it != std::end(child_columns); it++) {
it->second.level_child_cols_recursively(min_row_count);
}
}
// If this is a list column, we need to make sure that its child column levels its children
else if (type == json_col_t::ListColumn) {
auto it = std::begin(child_columns);
// Make that child column fill its child columns up to its own row count
if (it != std::end(child_columns)) {
it->second.level_child_cols_recursively(it->second.current_offset);
}
}
}
void level_child_cols_recursively(row_offset_t min_row_count);

/**
* @brief Appends the row at the given index to the column, filling all rows between the column's
Expand All @@ -195,42 +166,10 @@ struct json_column {
* the offsets
*/
void append_row(uint32_t row_index,
json_col_t const& row_type,
json_col_t row_type,
uint32_t string_offset,
uint32_t string_end,
uint32_t child_count)
{
// If, thus far, the column's type couldn't be inferred, we infer it to the given type
if (type == json_col_t::Unknown) { type = row_type; }

// We shouldn't run into this, as we shouldn't be asked to append an "unknown" row type
// CUDF_EXPECTS(type != json_col_t::Unknown, "Encountered invalid JSON token sequence");

// Fill all the omitted rows with "empty"/null rows (if needed)
null_fill(row_index);

// Table listing what we intend to use for a given column type and row type combination
// col type | row type => {valid, FAIL, null}
// -----------------------------------------------
// List | List => valid
// List | Struct => FAIL
// List | String => null
// Struct | List => FAIL
// Struct | Struct => valid
// Struct | String => null
// String | List => null
// String | Struct => null
// String | String => valid
bool const is_valid = (type == row_type);
if (static_cast<size_type>(validity.size()) < word_index(current_offset))
validity.push_back({});
set_bit_unsafe(&validity.back(), intra_word_index(current_offset));
valid_count += (is_valid) ? 1U : 0U;
string_offsets.push_back(string_offset);
string_lengths.push_back(string_end - string_offset);
child_offsets.push_back((child_offsets.size() > 0) ? child_offsets.back() + child_count : 0);
current_offset++;
};
uint32_t child_count);
};

/**
Expand Down
Loading

0 comments on commit 0ba4675

Please sign in to comment.