Introduce benchmark suite for JSON reader options #15124

shrshi · 2024-02-23T01:42:26Z

Description

The goal of this piece of work is to analyze the performance of the reader for JSON lines. This PR establishes a baseline for the performance of single quote normalization, white space normalization, mixed type as string parsing and recovery mode options when the input JSON is valid, and does not have any single quotes.
Modifying the data generation to produce inputs with single quotes/mixed types/invalid lines will be the focus of follow-on PRs.
Addresses #15041

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

GregoryKimball · 2024-02-28T17:27:21Z

cpp/benchmarks/io/json/json_reader_option.cpp

+  auto const view = tbl->view();
+  cudf::io::json_writer_options const write_opts =
+    cudf::io::json_writer_options::builder(source_sink.make_sink_info(), view)
+      .lines(true)


While we are testing the options, I would recommend making lines into an nvbench enum axis as well. Although since some of the options require lines to be true, maybe benchmarking lines true/false should be a separate benchmark.

nvbench::enum_type_list<row_selection::ALL, row_selection::BYTE_RANGE>, lines=True
nvbench::enum_type_list<normalize_single_quotes::NO, normalize_single_quotes::YES> line=True/False
nvbench::enum_type_list<mixed_types_as_string::NO, mixed_types_as_string::YES> line=True/False
nvbench::enum_type_list<recovery_mode::RECOVER_WITH_NULL, recovery_mode::FAIL>)) lines=True

After thinking through the options, I don't think we need to test normalize_single_quotes and mixed_types_as_string with lines=false. It still might be useful to add a lines true/false benchmark without any additional options. If others agree then that could be a follow-on PR.

GregoryKimball · 2024-02-28T17:40:07Z

Would you please post some early benchmark results? (similar to this example)

cpp/benchmarks/io/json/json_reader_option.cpp

shrshi · 2024-02-29T19:38:53Z

Benchmarks were run on A100 80GB GPU, with all combinations of normalize_single_quotes, row_selection, recovery_mode, mixed_type_as_string options being enabled/disabled.

The figure above shows a slight drop in performance while byte range reading is enabled. Note that all the runs above do not enable mixed_type_as_string option.

Figure showing significant performance degradation when mixed_type_as_string is enabled.
Performance tracking issue

…chmark

@karthikeyann

…n is enabled (#15236) Addresses #15196 by applying a patch from @karthikeyann to skip the `infer_column_type_kernel` by forcing the mixed types column to be a string. With this optimization, we see a significant improvement in performance. Please refer to the [comment](#15236 (comment)) for a visualization of the results before and after applying this optimization as obtained from the [JSON lines benchmarking exercise](#15124). Authors: - Shruti Shivakumar (https://github.com/shrshi) - Karthikeyan (https://github.com/karthikeyann) Approvers: - Karthikeyan (https://github.com/karthikeyann) - Vukasin Milovanovic (https://github.com/vuule) URL: #15236

robertmaynard

Approving CMake changes

…chmark

vuule

Great stuff!
Some minor comments/questions.

cpp/benchmarks/io/json/json_reader_option.cpp

shrshi · 2024-03-11T19:58:44Z

After updates to the mixed types as string handling and the byte range reader, here are the most recent benchmarks. Note that in view of the data generation bug, the results below may vary after that large last row issue is fixed.

row_selection=ALL, recovery_mode=FAIL

row_selection=ALL, recovery_mode=RECOVER_WITH_NULL

row_selection=BYTE_RANGE, recovery_mode=FAIL

row_selection=BYTE_RANGE, recovery_mode=RECOVER_WITH_NULL

GregoryKimball · 2024-03-12T17:49:32Z

These plots are great! Please check out #15185 for an example of where the byte range support is not working optimally.

PointKernel

One non-blocking nit.

Very clean code, nice work!

cpp/benchmarks/io/json/json_reader_option.cpp

Co-authored-by: Yunsong Wang <yunsongw@nvidia.com>

shrshi · 2024-03-19T21:11:55Z

/ok to test

…chmark

ttnghia · 2024-03-20T05:57:08Z

cpp/benchmarks/io/json/json_reader_option.cpp

+template <row_selection RowSelection,
+          normalize_single_quotes NormalizeSingleQuotes,
+          normalize_whitespace NormalizeWhitespace,
+          mixed_types_as_string MixedTypesAsString,
+          recovery_mode RecoveryMode>


Oh no this is too many template params. Why not using run time parameter instead? That would reduce compile time a lot.

I followed the design for benchmarking reader options in orc and parquet. Would we have to modify those benchmarks as well to maintain similar design?

Oh no this is too many template params. Why not using run time parameter instead?

This makes results more readable and the build time for benchmarks doesn't really matter IMO.

Yeah that's fine to me 👍

shrshi · 2024-04-09T00:50:33Z

/merge

The goal of this piece of work is to analyze the performance of the reader for JSON lines. This PR establishes a baseline for the performance of single quote normalization, white space normalization, mixed type as string parsing and recovery mode options when the input JSON is valid, and does not have any single quotes. Modifying the data generation to produce inputs with single quotes/mixed types/invalid lines will be the focus of follow-on PRs. Addresses rapidsai#15041 Authors: - Shruti Shivakumar (https://github.com/shrshi) - Nghia Truong (https://github.com/ttnghia) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Vukasin Milovanovic (https://github.com/vuule) - Yunsong Wang (https://github.com/PointKernel) - Nghia Truong (https://github.com/ttnghia) URL: rapidsai#15124

simple json lines benchmark - no fancy data generation

b2a7c45

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Feb 23, 2024

shrshi added non-breaking Non-breaking change 2 - In Progress Currently a work in progress feature request New feature or request and removed libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Feb 23, 2024

style fixes

bc42243

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Feb 23, 2024

shrshi added 2 commits February 28, 2024 02:25

added lines; table size as param

f2dd994

Merge branch 'branch-24.04' into json-benchmark

2ae3633

GregoryKimball reviewed Feb 28, 2024

View reviewed changes

vuule reviewed Feb 29, 2024

View reviewed changes

cpp/benchmarks/io/json/json_reader_option.cpp Outdated Show resolved Hide resolved

shrshi mentioned this pull request Feb 29, 2024

[PERF] Performance impact of mixed_type_as_string JSON reader option in reading JSON lines #15196

Closed

Merge branch 'branch-24.04' into json-benchmark

ea4d9b7

shrshi mentioned this pull request Mar 5, 2024

Improve performance in JSON reader when mixed_types_as_string option is enabled #15236

Merged

3 tasks

GregoryKimball assigned shrshi Mar 6, 2024

Merge branch 'branch-24.04' into json-benchmark

56317bd

GregoryKimball changed the title ~~[WIP] JSON lines benchmark - studying reader options~~ [WIP] Introduce benchmark suite for JSON reader options Mar 7, 2024

shrshi added 6 commits March 7, 2024 20:31

Merge branch 'branch-24.04' into json-benchmark

a1bc431

partial work commit

5c0de81

adding whitespace normalization and lines axes

72b070d

separated out the lines benchamrk

e04db8b

Merge branch 'json-benchmark' of github.com:shrshi/cudf into json-ben…

b470334

…chmark

Merge branch 'branch-24.04' into json-benchmark

dc799c2

shrshi marked this pull request as ready for review March 8, 2024 01:42

shrshi requested review from robertmaynard and karthikeyann March 8, 2024 01:42

shrshi changed the title ~~[WIP] Introduce benchmark suite for JSON reader options~~ Introduce benchmark suite for JSON reader options Mar 8, 2024

robertmaynard approved these changes Mar 11, 2024

View reviewed changes

shrshi added 2 commits March 11, 2024 17:41

skipping some param configs

f926172

Merge branch 'json-benchmark' of github.com:shrshi/cudf into json-ben…

6c75aee

…chmark

vuule reviewed Mar 11, 2024

View reviewed changes

Merge branch 'branch-24.04' into json-benchmark

14435e0

partially addressing PR reviews

5f331e2

GregoryKimball mentioned this pull request Mar 13, 2024

[FEA] JSON reader improvements for Spark-RAPIDS #13525

Open

Merge branch 'branch-24.04' into json-benchmark

6c3604b

shrshi requested a review from vuule March 19, 2024 18:26

vuule approved these changes Mar 19, 2024

View reviewed changes

PointKernel approved these changes Mar 19, 2024

View reviewed changes

cpp/benchmarks/io/json/json_reader_option.cpp Outdated Show resolved Hide resolved

Update cpp/benchmarks/io/json/json_reader_option.cpp

5661a4a

Co-authored-by: Yunsong Wang <yunsongw@nvidia.com>

shrshi added 3 commits March 19, 2024 19:17

Merge branch 'branch-24.04' into json-benchmark

12cfaba

Merge branch 'json-benchmark' of github.com:shrshi/cudf into json-ben…

300753c

…chmark

formatting fix

3a79368

ttnghia reviewed Mar 20, 2024

View reviewed changes

shrshi changed the base branch from branch-24.04 to branch-24.06 April 8, 2024 19:34

Merge branch 'branch-24.06' into json-benchmark

66552a5

ttnghia approved these changes Apr 8, 2024

View reviewed changes

rapids-bot bot merged commit 1862cdc into rapidsai:branch-24.06 Apr 9, 2024
70 checks passed

bdice mentioned this pull request Apr 17, 2024

Rename JSON_READER_OPTION to JSON_READER_OPTION_NVBENCH. #15553

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce benchmark suite for JSON reader options #15124

Introduce benchmark suite for JSON reader options #15124

shrshi commented Feb 23, 2024 •

edited

Loading

GregoryKimball Feb 28, 2024 •

edited

Loading

GregoryKimball commented Feb 28, 2024

shrshi commented Feb 29, 2024 •

edited

Loading

robertmaynard left a comment

vuule left a comment

shrshi commented Mar 11, 2024 •

edited

Loading

GregoryKimball commented Mar 12, 2024

PointKernel left a comment

shrshi commented Mar 19, 2024

ttnghia Mar 20, 2024 •

edited

Loading

shrshi Apr 8, 2024

PointKernel Apr 8, 2024

ttnghia Apr 8, 2024

shrshi commented Apr 9, 2024

Introduce benchmark suite for JSON reader options #15124

Introduce benchmark suite for JSON reader options #15124

Conversation

shrshi commented Feb 23, 2024 • edited Loading

Description

Checklist

GregoryKimball Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

GregoryKimball commented Feb 28, 2024

shrshi commented Feb 29, 2024 • edited Loading

robertmaynard left a comment

Choose a reason for hiding this comment

vuule left a comment

Choose a reason for hiding this comment

shrshi commented Mar 11, 2024 • edited Loading

GregoryKimball commented Mar 12, 2024

PointKernel left a comment

Choose a reason for hiding this comment

shrshi commented Mar 19, 2024

ttnghia Mar 20, 2024 • edited Loading

Choose a reason for hiding this comment

shrshi Apr 8, 2024

Choose a reason for hiding this comment

PointKernel Apr 8, 2024

Choose a reason for hiding this comment

ttnghia Apr 8, 2024

Choose a reason for hiding this comment

shrshi commented Apr 9, 2024

shrshi commented Feb 23, 2024 •

edited

Loading

GregoryKimball Feb 28, 2024 •

edited

Loading

shrshi commented Feb 29, 2024 •

edited

Loading

shrshi commented Mar 11, 2024 •

edited

Loading

ttnghia Mar 20, 2024 •

edited

Loading