Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/json to columnar #2

Closed
wants to merge 111 commits into from

Conversation

elstehle
Copy link
Owner

@elstehle elstehle commented Jul 29, 2022

Temporary PR serving as a surrogate for rapidsai#11388 to properly see diff and optionally also as a forum until rapidsai#11264 is merged an we can have a proper diff on the PR open at rapids/cudf: rapidsai#11388

raydouglass and others added 30 commits July 22, 2022 10:53
[gpuCI] Forward-merge branch-22.08 to branch-22.10 [skip gpuci]
[gpuCI] Forward-merge branch-22.08 to branch-22.10 [skip gpuci]
[gpuCI] Forward-merge branch-22.08 to branch-22.10 [skip gpuci]
[gpuCI] Forward-merge branch-22.08 to branch-22.10 [skip gpuci]
Closes rapidsai#10296

These _should_ actually just work if the following PRs get merged, after which this diff might be really small:

rapidsai#10815
rapidsai#10838
dask/dask#9074

Authors:
  - https://github.com/brandon-b-miller
  - Charles Blackmon-Luca (https://github.com/charlesbluca)

Approvers:
  - Charles Blackmon-Luca (https://github.com/charlesbluca)

URL: rapidsai#10889
[gpuCI] Forward-merge branch-22.08 to branch-22.10 [skip gpuci]
+ fills/levels child columns
…w group. (rapidsai#11353)

There is a particularly odd corner case that can be constructed where a column in a parquet file has more rows in it than the associated row group specifies.  Previously we were inadvertently handling this, however this optimization broke that support:

rapidsai#11252

The solution is to cap the size of any non-list-child columns to the size of the selected row groups.

<s>Leaving this as a draft while the changes percolate through the spark tests.</s>

Authors:
  - https://github.com/nvdbaranec

Approvers:
  - MithunR (https://github.com/mythrocks)
  - Nghia Truong (https://github.com/ttnghia)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: rapidsai#11353
[gpuCI] Forward-merge branch-22.08 to branch-22.10 [skip gpuci]
This PR fixes timeouts like the following that are happening across all our s3 tests in code-base:
```python
Traceback (most recent call last):
  File "/lib/python3.9/site-packages/urllib3/connectionpool.py", line 426, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/lib/python3.9/site-packages/urllib3/connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "/..pyenv/versions/3.9.6/lib/python3.9/http/client.py", line 1349, in getresponse
    response.begin()
  File "/..pyenv/versions/3.9.6/lib/python3.9/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/..pyenv/versions/3.9.6/lib/python3.9/http/client.py", line 277, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/..pyenv/versions/3.9.6/lib/python3.9/socket.py", line 704, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out
```
This seems to be an issue with `werkzeug` package which was updated 1 day ago. So until there is an upstream fix for it, we will temporarily need to pin this package.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - AJ Schmidt (https://github.com/ajschmidt8)

URL: rapidsai#11369
[gpuCI] Forward-merge branch-22.08 to branch-22.10 [skip gpuci]
I wanted to get this up in a PR of its own so we could get some discussion going if necessary. This adds a `byte_array_view` which is almost identical to a `string_view`. The goal is to be able to get these for list columns of bytes, so `list<uint8>` and `list<int8>`. I didn't template it on the type, but instead selected `uint8_t` because `std::byte` is a `uint8_t`. My PR for writing byte arrays in parquet will use this to get the rows of byte data for statistics and writing. That PR is forthcoming. I left this code down in cuio statistics due to the usage and the previous discussions regarding `.element`. I needed to wrap the `device_span` because I need comparison operators for the cub reduce.

Authors:
  - Mike Wilson (https://github.com/hyperbolic2346)

Approvers:
  - MithunR (https://github.com/mythrocks)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: rapidsai#11322
This PR is needed for PR rapidsai#11160 and another PR for writing list<int8>'s as byte arrays as well.

This PR will be the basis for the statistics written by rapidsai#11160, but it should be noted that it will work due to an implementation detail. The union of the statistics types for strings and byte arrays is the same layout. This means the string statistics that are stored from the string column can be read as byte array statistics later when writing out the file and it will work simply because they can alias to each other.

Alternatives considered included changing the incoming table type in the case of strings being written as a byte array to be a byte array, which would only work because of the implementation detail that the columns are identical data layouts. Also considered was templating out multiple paths in [column_statistics.cuh:107](https://github.com/rapidsai/cudf/blob/e98feab966f6d1b9eba83323e851c955a1691865/cpp/src/io/statistics/column_statistics.cuh#L107) that we could use to branch based on the data type.

The latter seems more the correct way, but comes with a performance and complexity issue. The method used in the PR was mainly chosen due to simplicity and the existing reliance on the data formats being identical throughout the code. It seems unlikely these will ever decouple.

Authors:
  - Mike Wilson (https://github.com/hyperbolic2346)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - MithunR (https://github.com/mythrocks)
  - https://github.com/nvdbaranec

URL: rapidsai#11303
[gpuCI] Forward-merge branch-22.08 to branch-22.10 [skip gpuci]
This PR add Java binding for the set-like operations:
 * `lists::have_overlap`
 * `lists::intersect_distinct`
 * `lists::union_distinct`
 * `lists::difference_distinct`

Depends on:
 * rapidsai#11043
 * rapidsai#11220

New Java APIs start here: https://github.com/rapidsai/cudf/pull/11143/files#diff-50ba2711690aca8e4f28d7b491373a4dd76443127c8b452a77b6c1fe2388d9e3R3545

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)

URL: rapidsai#11143
[gpuCI] Forward-merge branch-22.08 to branch-22.10 [skip gpuci]
Closes rapidsai#10378. This PR provides Spark-compliant hash values for list columns.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Nghia Truong (https://github.com/ttnghia)
  - Ryan Lee (https://github.com/rwlee)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: rapidsai#11292
… in device operators `min` and `max` (rapidsai#11357)

This fixes a bug of device operators `min` and `max` in generating the `identity` value for floating-point numbers. In particular:
 * `min::identity()` should return `cuda::std::numeric_limits<T>::infinity()` instead of `cuda::std::numeric_limits<T>::max()`, and
 * `max::identity()` should return `-cuda::std::numeric_limits<T>::infinity()` instead of `cuda::std::numeric_limits<T>::lowest()`.

Closes rapidsai#11352.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Yunsong Wang (https://github.com/PointKernel)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: rapidsai#11357
[gpuCI] Forward-merge branch-22.08 to branch-22.10 [skip gpuci]
This PR adds `cudf.options`, a global dictionary to store configurations. A set of helper functions to manage the registries are also included. See documentation included in the PR for detail.

See demonstration use in: rapidsai#11272

Closes rapidsai#5311

Authors:
  - Michael Wang (https://github.com/isVoid)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)
  - Matthew Roeschke (https://github.com/mroeschke)
  - Bradley Dice (https://github.com/bdice)

URL: rapidsai#11193
[gpuCI] Forward-merge branch-22.08 to branch-22.10 [skip gpuci]
…11354)

This PR fixes a subtle bug leading to a segfault when constructing a `Column` from a `column_view`.

Closes rapidsai#11349

Authors:
  - Ashwin Srinath (https://github.com/shwina)

Approvers:
  - https://github.com/brandon-b-miller
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Matthew Roeschke (https://github.com/mroeschke)
  - Lawrence Mitchell (https://github.com/wence-)

URL: rapidsai#11354
hyperbolic2346 and others added 5 commits August 3, 2022 21:20
When reviewing PR rapidsai#11322 it was noted that it would be preferable to use `std::byte` for the data type, but at the time that didn't work out, so the plan was to address it later and issue rapidsai#11362 was created to track it.

Fixes rapidsai#11362

Authors:
  - Mike Wilson (https://github.com/hyperbolic2346)

Approvers:
  - Tobias Ribizel (https://github.com/upsj)
  - Bradley Dice (https://github.com/bdice)
  - Nghia Truong (https://github.com/ttnghia)

URL: rapidsai#11424
Closes rapidsai#11115 

This PR adds a `column` constructor to be constructible from a `device_uvector&&` using move semantics.

Authors:
  - Srikar Vanavasam (https://github.com/SrikarVanavasam)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Nghia Truong (https://github.com/ttnghia)
  - Jake Hemstad (https://github.com/jrhemstad)

URL: rapidsai#11356
… option (rapidsai#11446)

Changes are mostly equivalent to Parquet changes in rapidsai#11018.

Store the `columns` option as `optional`:

- `nullopt` when columns are not passed by caller - read all columns.
- Empty vector when caller explicitly passes an empty list/vector - return empty dataframe.
- Vector of column names - read columns with given names.

Also includes a small cleanup of the code equivalent in the Parquet reader.

Fixes rapidsai#11021

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - MithunR (https://github.com/mythrocks)
  - Nghia Truong (https://github.com/ttnghia)

URL: rapidsai#11446
As noted in rapidsai#11368 we should strive towards not having thrust types in our 'public' API. 
This removes occurences of using `thrust::optional` from cudf/io host classes in preference of `std::optional`.

Authors:
  - Robert Maynard (https://github.com/robertmaynard)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Tobias Ribizel (https://github.com/upsj)
  - Bradley Dice (https://github.com/bdice)

URL: rapidsai#11455
@github-actions github-actions bot added the gpuCI label Aug 4, 2022
vyasr and others added 21 commits August 4, 2022 16:21
The hooks for cmake-format and cmake-lint can fail silently if the necessary config files are not available. When creating these hooks we chose this behavior because depending on where and how people build the libraries the location of the format file may not be discoverable. However, this often leads to user confusion where the hooks appear to pass locally when in fact they never ran. This PR changes the hooks to be verbose so that they can provide more useful diagnostic output. In order to leave that output at a maintainable level, it forces these hooks to run serially. On my machine, this results in the cmake-format hook taking ~3.5s instead of ~1.2s to run on all files, which is an acceptable compromise for readable output.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Nghia Truong (https://github.com/ttnghia)

URL: rapidsai#11456
Adds regex compile logic to check quantifier can be used with the previous item even if its within a capture group.
This prevents an infinite loop occurring when evaluating the expression.
Additional gtests are included to check for this condition which should throw an error.

Closes rapidsai#11311

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Tobias Ribizel (https://github.com/upsj)
  - Elias Stehle (https://github.com/elstehle)

URL: rapidsai#11373
Thrust 1.16 removed internal header inclusions that libcudf relied on. This PR adds missing `#include`s that were found automatically by a script I wrote. See notes on rapidsai#10489. This was previously applied in rapidsai#10489 but the script became more sophisticated (and libcudf has changed) since I last applied it, so more missing `#include`s were found.

Required for rapidsai#11437 to upgrade to Thrust 1.17. This change has been separated from rapidsai#11437 to minimize that PR's diff. Some additional changes will be needed on that PR but we don't want to hold off on fixing these includes, as recommended by @davidwendt.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Karthikeyan (https://github.com/karthikeyann)
  - Nghia Truong (https://github.com/ttnghia)
  - Robert Maynard (https://github.com/robertmaynard)

URL: rapidsai#11457
This adds a simple benchmark for groupby `max` aggregation.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Karthikeyan (https://github.com/karthikeyann)
  - David Wendt (https://github.com/davidwendt)

URL: rapidsai#11464
This PR switches the loading of `custom.js` to `defer` because we will need the entire page to be loading until the methods in this script can even execute correctly. 

Authors:
   - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
   - AJ Schmidt (https://github.com/ajschmidt8)
[gpuCI] Forward-merge branch-22.08 to branch-22.10 [skip gpuci]
This PR removes the Dremel encoding logic from Parquet-specific files and places it into a separate set of files for consumption by non-Parquet code. This PR also includes a minor rename of `utilities/column.hpp`->`utilities/linked_column.hpp` to more accurately reflect the contents of that file.

These changes were split out from rapidsai#11129 to minimize future conflicts with Parquet development (which is very active at present) and to allow further refactoring and other improvements on this Dremel code to proceed independently of the list lexicographic comparator.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Devavret Makkar (https://github.com/devavret)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Ray Douglass (https://github.com/raydouglass)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: rapidsai#11461
This PR adds a primary developer guide for Python. It provides a more complete and informative landing page for new developers. When rapidsai#11217, rapidsai#11199, and rapidsai#11122 are merged, they will all be linked from this page to provide a complete set of developer documentation.

There is one main point of discussion that I would like reviewer comments on, and that is the section on directory and file organization. How do we want that aspect of cuDF to look?

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Matthew Roeschke (https://github.com/mroeschke)
  - Lawrence Mitchell (https://github.com/wence-)
  - Ashwin Srinath (https://github.com/shwina)

URL: rapidsai#11235
This PR documents best practices for writing cuDF Python benchmarks. It includes an overview of the various fixtures provided by our benchmarking suite to all benchmarks and indicates how best to make use of them. It also discusses the various features of our benchmarking suite (including easy comparison to pandas and running in CI) and what developers must do to maintain compatibility with those features.

A PR to incorporate the [cudf_benchmarks](https://github.com/vyasr/cudf_benchmarks) repo into cudf proper is imminent, but this documentation PR can be reviewed (and merged) independently.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)
  - Ashwin Srinath (https://github.com/shwina)

URL: rapidsai#11122
…pidsai#11480)

This PR removes support for `skiprows` & `num_rows` in parquet reader. A continuation of rapidsai#11218

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: rapidsai#11480
…unctor (rapidsai#11482)

Refactored the `group_nunique.cu` source to use the `nullate::DYNAMIC` for the equal operator and the unique-iterator. This improves the compile time by almost 2x without much change to performance by reducing the number of calls to `thrust::reduce_by_key`.

Found while investigating compile issues for rapidsai#11437

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Yunsong Wang (https://github.com/PointKernel)

URL: rapidsai#11482
…pidsai#11365)

release() sets the null_count of a column to zero, so previously
asking for the null_count provided an incorrect value. Fortunately
this never exhibited in the final column, since Column.__init__ always
ignores the provided null_count and computes it from the null_mask (if
one is given).

Authors:
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)
  - Nghia Truong (https://github.com/ttnghia)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: rapidsai#11365
This document aims to give instruction the following two things:
- What to throw given invalid user inputs
- How should cuDF handle exceptions from libcudf

Authors:
  - Michael Wang (https://github.com/isVoid)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: rapidsai#7917
This PR builds on the _Finite-State Transducer_ (_FST_) algorithm and the _Logical Stack_ to implement a tokenizer that demarcates sections from the JSON input and assigns a category to each such section.

**This PR builds on:**
⛓️ rapidsai#11242
⛓️ rapidsai#11078

Specifically, the tokenizer comprises the following processing steps:
1. FST to emit sequence of stack operations (i.e., emit push(LIST), push(STRUCT), pop(), read()). This FST does transduce each occurrence of an opening semantic bracket or brace to the respective push(LIST) and push(STRUCT) operation, respectively. Each semantic closing bracket or brace is transduced to a pop() operation. All other input is transduced to a read() operation.
2. The sequence of stack operations from (1) is fed into the logical stack that resolves what is on top of the stack before each operation from (1) (i.e., STRUCT, LIST). After this stage, for every input character we know what is on top of the stack: either a STRUCT or LIST or ROOT, if there is no symbol on top of the stack.
3. We use the top-of-stack information from (2) for a second FST. This part can be considered a full pushdown or DVPA (because now, we also have stack context). State transitions are caused by the combination of the input character + the top-of-stack for that character. The output of this stage is the token stream: ({beginning-of, end-of}x{struct, list}, field name, value, etc.

Authors:
  - Elias Stehle (https://github.com/elstehle)
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Tobias Ribizel (https://github.com/upsj)
  - Karthikeyan (https://github.com/karthikeyann)
  - Yunsong Wang (https://github.com/PointKernel)
  - Bradley Dice (https://github.com/bdice)

URL: rapidsai#11264
@elstehle elstehle closed this Aug 10, 2022
elstehle pushed a commit that referenced this pull request Jun 13, 2023
This implements stacktrace and adds a stacktrace string into any exception thrown by cudf. By doing so, the exception carries information about where it originated, allowing the downstream application to trace back with much less effort.

Closes rapidsai#12422.

### Example:
```
#0: cudf/cpp/build/libcudf.so : std::unique_ptr<cudf::column, std::default_delete<cudf::column> > cudf::detail::sorted_order<false>(cudf::table_view, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x446
#1: cudf/cpp/build/libcudf.so : cudf::detail::sorted_order(cudf::table_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x113
#2: cudf/cpp/build/libcudf.so : std::unique_ptr<cudf::column, std::default_delete<cudf::column> > cudf::detail::segmented_sorted_order_common<(cudf::detail::sort_method)1>(cudf::table_view const&, cudf::column_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x66e
#3: cudf/cpp/build/libcudf.so : cudf::detail::segmented_sort_by_key(cudf::table_view const&, cudf::table_view const&, cudf::column_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x88
#4: cudf/cpp/build/libcudf.so : cudf::segmented_sort_by_key(cudf::table_view const&, cudf::table_view const&, cudf::column_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::mr::device_memory_resource*)+0xb9
rapidsai#5: cudf/cpp/build/gtests/SORT_TEST : ()+0xe3027
rapidsai#6: cudf/cpp/build/lib/libgtest.so.1.13.0 : void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x8f
rapidsai#7: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::Test::Run()+0xd6
rapidsai#8: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::TestInfo::Run()+0x195
rapidsai#9: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::TestSuite::Run()+0x109
rapidsai#10: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::internal::UnitTestImpl::RunAllTests()+0x44f
rapidsai#11: cudf/cpp/build/lib/libgtest.so.1.13.0 : bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*)+0x87
rapidsai#12: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::UnitTest::Run()+0x95
rapidsai#13: cudf/cpp/build/gtests/SORT_TEST : ()+0xdb08c
rapidsai#14: /lib/x86_64-linux-gnu/libc.so.6 : ()+0x29d90
rapidsai#15: /lib/x86_64-linux-gnu/libc.so.6 : __libc_start_main()+0x80
rapidsai#16: cudf/cpp/build/gtests/SORT_TEST : ()+0xdf3d5
```

### Usage

In order to retrieve a stacktrace with fully human-readable symbols, some compiling options must be adjusted. To make such adjustment convenient and effortless, a new cmake option (`CUDF_BUILD_STACKTRACE_DEBUG`) has been added. Just set this option to `ON` before building cudf and it will be ready to use.

For downstream applications, whenever a cudf-type exception is thrown, it can retrieve the stored stacktrace and do whatever it wants with it. For example:
```
try {
  // cudf API calls
} catch (cudf::logic_error const& e) {
  std::cout << e.what() << std::endl;
  std::cout << e.stacktrace() << std::endl;
  throw e;
} 
// similar with catching other exception types
```

### Follow-up work

The next step would be patching `rmm` to attach stacktrace into `rmm::` exceptions. Doing so will allow debugging various memory exceptions thrown from libcudf using their stacktrace.


### Note:
 * This feature doesn't require libcudf to be built in Debug mode.
 * The flag `CUDF_BUILD_STACKTRACE_DEBUG` should not be turned on in production as it may affect code optimization. Instead, libcudf compiled with that flag turned on should be used only when needed, when debugging cudf throwing exceptions.
 * This flag removes the current optimization flag from compiling (such as `-O2` or `-O3`, if in Release mode) and replaces by `-Og` (optimize for debugging).
 * If this option is not set to `ON`, the stacktrace will not be available. This is to avoid expensive stracktrace retrieval if the throwing exception is expected.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Robert Maynard (https://github.com/robertmaynard)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Jason Lowe (https://github.com/jlowe)

URL: rapidsai#13298
elstehle pushed a commit that referenced this pull request Oct 5, 2023
Pin conda packages to `aws-sdk-cpp<1.11`. The recent upgrade in version `1.11.*` has caused several issues with cleaning up (more details on changes can be read in [this link](https://github.com/aws/aws-sdk-cpp#version-111-is-now-available)), leading to Distributed and Dask-CUDA processes to segfault. The stack for one of those crashes looks like the following:

```
(gdb) bt
#0  0x00007f5125359a0c in Aws::Utils::Logging::s_aws_logger_redirect_get_log_level(aws_logger*, unsigned int) () from /opt/conda/envs/dask/lib/python3.9/site-packages/pyarrow/../../.././libaws-cpp-sdk-core.so
#1  0x00007f5124968f83 in aws_event_loop_thread () from /opt/conda/envs/dask/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-io.so.1.0.0
#2  0x00007f5124ad9359 in thread_fn () from /opt/conda/envs/dask/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1
#3  0x00007f519958f6db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#4  0x00007f5198b1361f in clone () from /lib/x86_64-linux-gnu/libc.so.6
```

Such segfaults now manifest frequently in CI, and in some cases are reproducible with a hit rate of ~30%. Given the approaching release time, it's probably the safest option to just pin to an older version of the package while we don't pinpoint the exact cause for the issue and a patched build is released upstream.

The `aws-sdk-cpp` is statically-linked in the `pyarrow` pip package, which prevents us from using the same pinning technique. cuDF is currently pinned to `pyarrow=12.0.1` which seems to be built against `aws-sdk-cpp=1.10.*`, as per [recent build logs](https://github.com/apache/arrow/actions/runs/6276453828/job/17046177335?pr=37792#step:6:1372).

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Ray Douglass (https://github.com/raydouglass)

URL: rapidsai#14173
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.