Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds type inference and type conversion for leaf-columns to the nested JSON parser #11574

Merged
Merged
Show file tree
Hide file tree
Changes from 175 commits
Commits
Show all changes
226 commits
Select commit Hold shift + click to select a range
5b4a6fc
Squashed with initial test set
elstehle Mar 28, 2022
377358a
style fix & additional test scenario
elstehle Mar 29, 2022
4186004
removed forceinline
elstehle Mar 29, 2022
a921c66
tagging host device function
elstehle Mar 30, 2022
75a1853
Added utility to debug print & instrumented code to use it
elstehle Mar 31, 2022
a23668a
switched to using rmm also inside algorithm
elstehle Mar 31, 2022
aa5f5c4
header include order & SFINAE macro
elstehle Mar 31, 2022
4ee2253
debug print cleanups
elstehle Apr 4, 2022
0f35852
renaming key-value store op to stack_op
elstehle Apr 4, 2022
ca5d465
device_span
elstehle Apr 4, 2022
f5960bd
addressing review comments & minor cleanups
elstehle Apr 6, 2022
80226b7
error on unsupported unsigned_t and fixed typos
elstehle Apr 7, 2022
e8bc8a5
minor style changes addressing review comments
elstehle Apr 13, 2022
c5274b5
squashed with bracket/brace test
elstehle Apr 11, 2022
bb16254
clean up & addressing review comments
elstehle Apr 20, 2022
4e42d0e
refactored lookup tables
elstehle Apr 25, 2022
e439320
put lookup tables into their own cudf file
elstehle Apr 25, 2022
05840b3
Change interface for FST to not need temp storage
elstehle Apr 27, 2022
6da9360
removing unused var post-cleanup
elstehle May 4, 2022
702dfa1
unified usage of pragma unrolls
elstehle May 9, 2022
26a39ea
Adding hostdevice macros to in-reg array
elstehle May 9, 2022
8c685c0
making const vars const
elstehle May 9, 2022
5c94521
refactor lut sanity check
elstehle May 9, 2022
03b2c20
rebase on latest FST
elstehle May 3, 2022
ff22f19
squash & rebase on latest tokenizer version
elstehle May 13, 2022
365b839
start
vuule Jun 11, 2022
0a495ed
Merge branch 'fea-json-col-cast' of https://github.com/vuule/cudf int…
Jun 13, 2022
131b19b
Merge branch 'branch-22.08' of https://github.com/rapidsai/cudf into …
vuule Jun 17, 2022
419d3c8
Add type inference test in CMake
PointKernel Jun 17, 2022
d2e2a0b
Merge remote-tracking branch 'upstream/branch-22.08' into type-inference
PointKernel Jun 17, 2022
f32b9b9
Add type inference prototype and basic test
PointKernel Jun 17, 2022
35bcde7
Code formatting
PointKernel Jun 17, 2022
ec856a7
rework API
vuule Jun 17, 2022
bf24ef5
tune up API; set up test inputs
vuule Jun 17, 2022
54cda73
comments
vuule Jun 17, 2022
9f2247f
add placeholder experimental JSON reader
vuule Jul 22, 2022
76b2834
doc fix
vuule Jul 22, 2022
f5464f6
copyright year
vuule Jul 22, 2022
a76d7bc
Merge remote-tracking branch 'upstream/branch-22.08' into type-inference
PointKernel Jul 25, 2022
fa2c3a6
Merge branch 'branch-22.08' of https://github.com/rapidsai/cudf into …
vuule Jul 25, 2022
2ca0ac0
newline
vuule Jul 25, 2022
7ee4b41
Merge branch 'fea-read_json-experimental' of http://github.com/vuule/…
vuule Jul 25, 2022
3ee7a5a
use span
vuule Jul 25, 2022
fa9ab89
Merge branch 'branch-22.08' of https://github.com/rapidsai/cudf into …
vuule Jul 26, 2022
fcc90c5
options check + decompression
vuule Jul 27, 2022
9d272fb
Merge branch 'branch-22.10' of https://github.com/rapidsai/cudf into …
vuule Aug 8, 2022
3380a10
Merge branch 'fea-json-col-cast' of http://github.com/vuule/cudf into…
vuule Aug 8, 2022
83c12d5
basic string column creation
vuule Aug 9, 2022
84210c8
basic conversion
vuule Aug 10, 2022
1db20e1
remove duplicated `decode_value`s
vuule Aug 10, 2022
61e42ec
more redundancy removed
vuule Aug 10, 2022
db505d8
fixed width null/true/false support
vuule Aug 11, 2022
179a627
Merge branch 'branch-22.10' of https://github.com/rapidsai/cudf into …
vuule Aug 11, 2022
988f788
skip input nulls
vuule Aug 11, 2022
f69c7ea
pass options
vuule Aug 11, 2022
41c221f
basic string column rewrite
vuule Aug 12, 2022
5d73a8a
string null literal + null pass throught
vuule Aug 12, 2022
b481d66
Merge branch 'branch-22.10' of https://github.com/rapidsai/cudf into …
vuule Aug 12, 2022
bdbc111
Merge branch 'branch-22.10' of https://github.com/rapidsai/cudf into …
vuule Aug 13, 2022
22b5a46
adds support for ndjson
elstehle Aug 15, 2022
907a846
Merge remote-tracking branch 'upstream/branch-22.10' into feature/nes…
elstehle Aug 15, 2022
87fce7d
addresses outstanding todo
elstehle Aug 15, 2022
a88c9dd
Merge branch 'branch-22.10' of https://github.com/rapidsai/cudf into …
vuule Aug 15, 2022
c41f178
Merge branch 'branch-22.10' of https://github.com/rapidsai/cudf into …
vuule Aug 15, 2022
9669c6a
C++ side changes + test
vuule Aug 15, 2022
c9fb5b2
working Python + test
vuule Aug 15, 2022
2de91e1
clean up
vuule Aug 16, 2022
70dd9b1
stop using column_names
vuule Aug 16, 2022
b1afef0
adds documentation for mr parameter
elstehle Aug 16, 2022
8409214
minor documentation fixes
elstehle Aug 16, 2022
d0e0def
fixes parameter order
elstehle Aug 16, 2022
6294dc8
adds option to include quote chars for string values
elstehle Aug 16, 2022
574ac43
fix copy-paste error
vuule Aug 16, 2022
2de80b7
raw string
vuule Aug 16, 2022
81cab67
Merge branch 'exp-read_json-adapter' of http://github.com/vuule/cudf …
vuule Aug 16, 2022
bc14a1d
remove print in Python test
vuule Aug 16, 2022
bca2e83
addressing reviews
vuule Aug 16, 2022
ba28571
Java fix
vuule Aug 16, 2022
a6d5ab7
style
vuule Aug 16, 2022
397e00f
Merge remote-tracking branch 'upstream/pull-request/11364' into featu…
elstehle Aug 17, 2022
a0bd229
integrates upstream interface changes
elstehle Aug 17, 2022
c822942
Merge remote-tracking branch 'fea-json-col-cast' into feature/json-pr…
elstehle Aug 17, 2022
46a3c44
Merge remote-tracking branch 'upstream/branch-22.10' into feature/nes…
elstehle Aug 17, 2022
f3bba9d
enables lines option in the nested reader
elstehle Aug 17, 2022
21b4023
migrates test from details api to reader api
elstehle Aug 17, 2022
cdc4441
improves code comment
elstehle Aug 17, 2022
836e0d1
Merge branch 'feature/nested-json-lines' into feature/json-primitive-…
elstehle Aug 18, 2022
7479b63
adds inference and type conversion
elstehle Aug 19, 2022
29c6525
Merge remote-tracking branch 'upstream/branch-22.10' into type-inference
PointKernel Aug 19, 2022
a659817
Move type inference to utilities
PointKernel Aug 19, 2022
4410488
Resolve conflicts + relocate type inference test file
PointKernel Aug 19, 2022
6409a5f
Get rid of narrow conversion + add string handling
PointKernel Aug 19, 2022
ec07bca
Updates: make column string iter compatible with zip iterator
PointKernel Aug 19, 2022
640eb00
Minor updates
PointKernel Aug 19, 2022
b0fac83
Add missing header
PointKernel Aug 19, 2022
67fcaf5
Fix the infinite loop bug with while
PointKernel Aug 19, 2022
51997be
Update cpp/src/io/utilities/type_inference.cuh
PointKernel Aug 20, 2022
a5e50d6
patches data casting for escape handling
elstehle Aug 22, 2022
b2805a5
Merge remote-tracking branch 'upstream/pull-request/11121' into featu…
elstehle Aug 22, 2022
fe70ac2
resolves downstream inference conflicts
elstehle Aug 22, 2022
779b638
removes debug prints from casting
elstehle Aug 22, 2022
05b506f
removes local test
elstehle Aug 22, 2022
bcf4b86
adds new logic for inferring nested columns
elstehle Aug 23, 2022
ff87f3b
fixes issue for two subsequent non-UTF-16 unicode esc sequences
elstehle Aug 23, 2022
123cb69
Merge remote-tracking branch 'upstream/branch-22.10' into feature/lea…
elstehle Aug 25, 2022
052fdfb
Merge remote-tracking branch 'upstream/branch-22.10' into feature/lea…
elstehle Aug 26, 2022
e2fae02
resolves merge conflicts
elstehle Aug 26, 2022
872c332
fixes nullable behaviour to match nested json reader
elstehle Aug 26, 2022
f123118
migrates test to pytest
elstehle Aug 26, 2022
b23311d
fixes pytest style
elstehle Aug 26, 2022
d1949ca
adds option to keep quotes for string values
elstehle Aug 26, 2022
6c42351
fixes doxygen
elstehle Aug 26, 2022
b48da1a
parses to rows to null for failing value casting
elstehle Aug 27, 2022
7fcfcbb
adds test for escape sequences
elstehle Aug 27, 2022
eceefe9
adds string value handling
elstehle Aug 27, 2022
ae23d21
Merge remote-tracking branch 'upstream/branch-22.10' into feature/gen…
elstehle Aug 27, 2022
e4abc5b
Merge remote-tracking branch 'upstream/branch-22.10' into feature/gen…
elstehle Aug 27, 2022
70d597d
Merge remote-tracking branch 'upstream/branch-22.10' into feature/lea…
elstehle Aug 27, 2022
5b60cd8
Merge branch 'feature/generic-type-casting' into feature/leaf-column-…
elstehle Aug 27, 2022
9d71c85
adds NVTX range annotation
elstehle Aug 29, 2022
27643af
moves unicode escape tests to type conversion
elstehle Aug 29, 2022
d51337a
Merge remote-tracking branch 'upstream/branch-22.10' into feature/lea…
elstehle Aug 29, 2022
39710de
Merge remote-tracking branch 'upstream/branch-22.10' into feature/gen…
elstehle Aug 29, 2022
1778bdb
cudf default stream and escape seq tests
elstehle Aug 30, 2022
69048f4
test case for null string versus null literal
elstehle Aug 30, 2022
955f480
Merge remote-tracking branch 'upstream/branch-22.10' into type-inference
PointKernel Aug 30, 2022
cc1a04c
Update cpp/src/io/utilities/type_inference.cuh
PointKernel Aug 30, 2022
756a3e2
Update cpp/src/io/utilities/type_inference.cuh
PointKernel Aug 30, 2022
8961126
Update cpp/src/io/utilities/type_inference.cuh
PointKernel Aug 30, 2022
bb69c14
Minor doc updates
PointKernel Aug 30, 2022
baff56f
Merge branch 'type-inference' of github.com:PointKernel/cudf into typ…
PointKernel Aug 30, 2022
ec593ba
Update docs
PointKernel Aug 30, 2022
015d83d
fixes cmake order
elstehle Aug 31, 2022
c0db3cb
refactors string handling
elstehle Aug 31, 2022
44daa5e
fixes corner case for non-surrogate pair unicode escape
elstehle Aug 31, 2022
bc273ec
removes empty src file and improves comments
elstehle Aug 31, 2022
4afbd8b
few style changes
elstehle Sep 1, 2022
8b600aa
avoids superfluous decode definitions
elstehle Sep 1, 2022
695aefc
using CUDF_ENABLE_IF macro
elstehle Sep 1, 2022
f625a7b
fixes a few remaining west consts
elstehle Sep 1, 2022
ce5ab55
adds test coverage for remaining esc sequences
elstehle Sep 1, 2022
bb82f7e
adds more comments
elstehle Sep 1, 2022
b431641
fixes style
elstehle Sep 1, 2022
8d50099
moves data_casting to include
elstehle Sep 1, 2022
04154fb
Merge remote-tracking branch 'upstream/branch-22.10' into feature/gen…
elstehle Sep 1, 2022
409d5b5
Merge remote-tracking branch 'upstream/branch-22.10' into feature/lea…
elstehle Sep 1, 2022
c6931c5
Merge branch 'feature/generic-type-casting' into feature/leaf-column-…
elstehle Sep 1, 2022
a1c1872
Merge remote-tracking branch 'upstream/branch-22.10' into feature/gen…
elstehle Sep 1, 2022
975f30c
more cleanups
elstehle Sep 1, 2022
70efb99
switches to return struct instead of tuple
elstehle Sep 2, 2022
4d708d1
Merge branch 'feature/generic-type-casting' into feature/leaf-column-…
elstehle Sep 2, 2022
fb78f6b
integrates upstream interface changes
elstehle Sep 2, 2022
9f27a86
Merge remote-tracking branch 'upstream/branch-22.10' into feature/lea…
elstehle Sep 2, 2022
73b20f5
minor fixes and clarifications
elstehle Sep 2, 2022
95ea2d4
Merge remote-tracking branch 'upstream/branch-22.10' into feature/lea…
elstehle Sep 5, 2022
0164396
rewrites switch statement in tokenizer stage
elstehle Sep 5, 2022
053eca8
fixes column order for deeply nested JSON
elstehle Sep 6, 2022
2edccc8
Merge remote-tracking branch 'upstream/branch-22.10' into type-inference
PointKernel Sep 6, 2022
b461fd8
Update cpp/src/io/utilities/type_inference.cuh
PointKernel Sep 6, 2022
c2b0393
Update cpp/src/io/utilities/type_inference.cuh
PointKernel Sep 6, 2022
529d5f2
Cleanups: comment, remove unused header + use const var
PointKernel Sep 6, 2022
630c94c
Merge branch 'type-inference' of github.com:PointKernel/cudf into typ…
PointKernel Sep 6, 2022
9f4907c
Code formatting
PointKernel Sep 6, 2022
be98e84
Use fixed-width integers
PointKernel Sep 6, 2022
f864f19
Add omission null count handling
PointKernel Sep 6, 2022
ea8493a
Update tests
PointKernel Sep 6, 2022
5b46d33
Add null test
PointKernel Sep 6, 2022
0d81a33
Add more tests
PointKernel Sep 6, 2022
255b013
Merge remote-tracking branch 'upstream/branch-22.10' into feature/lea…
elstehle Sep 7, 2022
9c112ee
schema meta data for non-string leaf columns
elstehle Sep 7, 2022
e3bb216
Cleanups: renaming, const parameter + namespace
PointKernel Sep 7, 2022
b71f665
Code cleanup
PointKernel Sep 7, 2022
0a56800
Cleanup: use parse_options consistently
PointKernel Sep 7, 2022
b19b8f5
Merge remote-tracking branch 'upstream/pull-request/11121' into featu…
elstehle Sep 7, 2022
717eacd
resolves upstream interface changes
elstehle Sep 7, 2022
07b2fff
Update cpp/tests/io/type_inference_test.cu
PointKernel Sep 9, 2022
5ecff99
Revert ommited null changes
PointKernel Sep 9, 2022
9bfbf38
Merge branch 'type-inference' of github.com:PointKernel/cudf into typ…
PointKernel Sep 9, 2022
ee588dd
Minor cleanups
PointKernel Sep 9, 2022
be717a7
Add all null test
PointKernel Sep 9, 2022
168efdd
Renaming json inference options view struct
PointKernel Sep 9, 2022
eb930a2
Minor improvement: use block reduce to minimize global atomic
PointKernel Sep 9, 2022
6e5062f
Minor cleanup
PointKernel Sep 9, 2022
223cd37
Update cpp/src/io/utilities/type_inference.cuh
PointKernel Sep 9, 2022
13fb10d
Code formatting
PointKernel Sep 9, 2022
32bbb00
Merge remote-tracking branch 'upstream/pull-request/11121' into featu…
elstehle Sep 12, 2022
a1ca7fa
Merge remote-tracking branch 'upstream/branch-22.10' into feature/lea…
elstehle Sep 12, 2022
52e79af
omissions in inference as zero-length strings
elstehle Sep 12, 2022
ee65f28
Treat all date-like input as string
PointKernel Sep 12, 2022
114c7e8
Merge remote-tracking branch 'upstream/pull-request/11121' into featu…
elstehle Sep 12, 2022
0a7bdfa
Add invalid input test
PointKernel Sep 12, 2022
0ba1e26
adds explicit check for mix of struct and list values in same col
elstehle Sep 14, 2022
47f2c14
renames keep_quotes option
elstehle Sep 14, 2022
0c48547
renames helper function for parse options
elstehle Sep 14, 2022
3839329
fixes keep_quotes in tests
elstehle Sep 14, 2022
fb9ac99
Reinforce string condition
PointKernel Sep 14, 2022
c595ac0
Use per-thread histogram with custom sum reduction
PointKernel Sep 14, 2022
107c0cc
Use string scalar instead of char array
PointKernel Sep 14, 2022
80840b1
Add default member initializer to column_type_histogram
PointKernel Sep 14, 2022
8deec54
Minor updates
PointKernel Sep 14, 2022
b0be37b
removes superfluous raw from string
elstehle Sep 14, 2022
6afea15
Merge remote-tracking branch 'upstream/branch-22.10' into feature/lea…
elstehle Sep 14, 2022
1185b03
fixes pytest by accounting for pyarrow misbehaving
elstehle Sep 14, 2022
55f9366
adds pytest for more types
elstehle Sep 14, 2022
c0123cf
style fixes
elstehle Sep 14, 2022
1a79f37
Merge remote-tracking branch 'upstream/branch-22.10' into type-inference
PointKernel Sep 14, 2022
6bb002e
Merge remote-tracking branch 'upstream/pull-request/11121' into featu…
elstehle Sep 15, 2022
e1fde8e
Merge remote-tracking branch 'upstream/branch-22.10' into feature/lea…
elstehle Sep 15, 2022
e627211
use sparse feature of logical stack
elstehle Sep 15, 2022
6cef4eb
moves from hostdevice_v to d_scalar
elstehle Sep 15, 2022
b7e5eb6
adds test for nested column order
elstehle Sep 15, 2022
36bf571
fixes style
elstehle Sep 15, 2022
206ed8d
removes giving names to trivial values
elstehle Sep 16, 2022
167f215
more tests on nested column inference
elstehle Sep 16, 2022
226f42d
removes unused var
elstehle Sep 16, 2022
c70831e
moves json_col member function definitions to source
elstehle Sep 16, 2022
633f57b
pass by values for simple type
elstehle Sep 16, 2022
dd5224f
Merge remote-tracking branch 'upstream/branch-22.10' into feature/lea…
elstehle Sep 19, 2022
c5750ba
style fix
elstehle Sep 19, 2022
e5050e6
revert style change
elstehle Sep 19, 2022
4c1d96c
removes cudf_fail in favor of cudf_expects
elstehle Sep 19, 2022
568cc53
re-enables extra logical check
elstehle Sep 19, 2022
74445ab
fixes comment typos
elstehle Sep 19, 2022
2bfb987
canonical way for returning empty null mask
elstehle Sep 19, 2022
028d5ac
removes debug prints from pytest
elstehle Sep 19, 2022
95d6f3c
Merge branch 'branch-22.10' into feature/leaf-column-type-conversion
karthikeyann Sep 19, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions cpp/include/cudf/io/json.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,9 @@ class json_reader_options {
// Whether to use the experimental reader
bool _experimental = false;

// Whether to keep the quote characters of string values
bool _keep_quotes = false;

/**
* @brief Constructor from source info.
*
Expand Down Expand Up @@ -203,6 +206,13 @@ class json_reader_options {
*/
bool is_enabled_experimental() const { return _experimental; }

/**
* @brief Whether the experimental reader should keep quotes of string values.
*
* @returns true if the experimental reader should keep quotes, false otherwise
*/
bool is_keeping_quotes() const { return _keep_quotes; }
elstehle marked this conversation as resolved.
Show resolved Hide resolved

/**
* @brief Set data types for columns to be read.
*
Expand Down Expand Up @@ -258,6 +268,14 @@ class json_reader_options {
* @param val Boolean value to enable/disable the experimental reader
*/
void enable_experimental(bool val) { _experimental = val; }

/**
* @brief Set whether the experimental reader should keep quotes of string values.
*
* @param val Boolean value to indicate whether the experimental reader should keep quotes
* of string values
*/
void keep_quotes(bool val) { _keep_quotes = val; }
elstehle marked this conversation as resolved.
Show resolved Hide resolved
};

/**
Expand Down Expand Up @@ -377,6 +395,19 @@ class json_reader_options_builder {
return *this;
}

/**
* @brief Set whether the experimental reader should keep quotes of string values.
*
* @param val Boolean value to indicate whether the experimental reader should keep quotes
* of string values
* @return this for chaining
*/
json_reader_options_builder& keep_quotes(bool val)
{
options._keep_quotes = val;
return *this;
}

/**
* @brief move json_reader_options member once it's built.
*/
Expand Down
2 changes: 1 addition & 1 deletion cpp/src/io/fst/agent_dfa.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -422,7 +422,7 @@ struct AgentDFA {
OffsetT const num_total_symbols,
StateIndexT& state,
CallbackOpT& callback_op,
cub::Int2Type<BYPASS_LOAD> /**/)
cub::Int2Type<BYPASS_LOAD>)
{
using StateTransitionOpT = StateTransitionOp<CallbackOpT, TransitionTableT>;

Expand Down
23 changes: 19 additions & 4 deletions cpp/src/io/json/nested_json.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,7 @@ struct json_column {
// Following "items" as the default child column's name of a list column
// Using the struct's field names
std::map<std::string, json_column> child_columns;
std::vector<std::string> column_order;

// Counting the current number of items in this column
row_offset_t current_offset = 0;
Expand Down Expand Up @@ -199,7 +200,21 @@ struct json_column {
uint32_t child_count)
karthikeyann marked this conversation as resolved.
Show resolved Hide resolved
{
// If, thus far, the column's type couldn't be inferred, we infer it to the given type
if (type == json_col_t::Unknown) { type = row_type; }
if (type == json_col_t::Unknown) {
type = row_type;
}
// If, at some point within a column, we encounter a nested type (list or struct),
// we change that columns type to that respective nested type and invalidate all previous rows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we at least warn about this case? It seems easy to lose data that way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have any warning mechanisms in libcudf (AFAIK). There's a plan to add logging similar (or same) to what RMM has.

else if (type == json_col_t::StringColumn &&
(row_type == json_col_t::ListColumn || row_type == json_col_t::StructColumn)) {
// Change the column type
type = row_type;

// Invalidate all previous entries, as they were _not_ of the nested type to which we just
// converted
std::fill_n(validity.begin(), validity.size(), 0);
valid_count = 0U;
}

// We shouldn't run into this, as we shouldn't be asked to append an "unknown" row type
// CUDF_EXPECTS(type != json_col_t::Unknown, "Encountered invalid JSON token sequence");
Expand All @@ -216,13 +231,13 @@ struct json_column {
// Struct | List => FAIL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this "FAIL" happen?

Copy link
Contributor Author

@elstehle elstehle Sep 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, let me add a more explicit check and message for these two cases and also a test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out, I've added explicit check and error message and test for this case!

// Struct | Struct => valid
// Struct | String => null
// String | List => null
// String | Struct => null
// String | List => valid (we switch col type to list, null'ing all previous rows)
// String | Struct => valid (we switch col type to list, null'ing all previous rows)
// String | String => valid
bool const is_valid = (type == row_type);
if (static_cast<size_type>(validity.size()) < word_index(current_offset))
validity.push_back({});
set_bit_unsafe(&validity.back(), intra_word_index(current_offset));
if (is_valid) { set_bit_unsafe(&validity.back(), intra_word_index(current_offset)); }
valid_count += (is_valid) ? 1U : 0U;
string_offsets.push_back(string_offset);
string_lengths.push_back(string_end - string_offset);
Expand Down
Loading