Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpcio >=1.51 python stack is segfaults on osx (x86-64) #281

Closed
1 task done
s22chan opened this issue Feb 21, 2023 · 25 comments
Closed
1 task done

grpcio >=1.51 python stack is segfaults on osx (x86-64) #281

s22chan opened this issue Feb 21, 2023 · 25 comments
Labels
bug Something isn't working

Comments

@s22chan
Copy link

s22chan commented Feb 21, 2023

Solution to issue cannot be found in the documentation.

  • I checked the documentation.

Issue

On 1.51/2, I get the following making a unary call:

>>> import grpc
>>> import helloworld_pb2
>>> import helloworld_pb2_grpc
>>> with grpc.insecure_channel('localhost:50051') as channel:
        stub = helloworld_pb2_grpc.GreeterStub(channel)
        response = stub.SayHello(helloworld_pb2.HelloRequest(name='you'))

E0221 15:33:42.187235000 140704340240576 client_channel.cc:625]        chand=0x7f8c0182d360: Illegal keepalive throttling value
E0221 15:33:42.187294000 140704340240576 client_channel.cc:625]        chand=0x7f8c0182d360: Illegal keepalive throttling value
E0221 15:33:42.187544000 140704340240576 client_channel.cc:625]        chand=0x7f8c0182d360: Illegal keepalive throttling value
E0221 15:33:42.191049000 123145385762816 client_channel.cc:625]        chand=0x7f8c0182d360: Illegal keepalive throttling value
E0221 15:33:42.191257000 123145385762816 client_channel.cc:625]        chand=0x7f8c0182d360: Illegal keepalive throttling value
E0221 15:33:42.195639000 140704340240576 client_channel.cc:625]        chand=0x7f8c0182d360: Illegal keepalive throttling value

then it randomly can segfault.

Running under lldb shows it's there's some kind of error (linkage to libstdc++?) in abseil (20230125.0):

 thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x1)
    frame #0: 0x0000000104bc7c2e libabsl_cord.2301.0.0.dylib`absl::lts_20230125::CopyCordToString(absl::lts_20230125::Cord const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*) + 46
libabsl_cord.2301.0.0.dylib`absl::lts_20230125::CopyCordToString:
->  0x104bc7c2e <+46>: movq   (%rax), %r15
    0x104bc7c31 <+49>: movzbl (%rbx), %ecx
    0x104bc7c34 <+52>: movl   %ecx, %eax
    0x104bc7c36 <+54>: andb   $0x1, %al

Installed packages

# Name                    Version                   Build  Channel
bzip2                     1.0.8                h0d85af4_4    conda-forge
c-ares                    1.18.1               h0d85af4_0    conda-forge
ca-certificates           2022.12.7            h033912b_0    conda-forge
grpcio                    1.52.1          py311h814d153_0    conda-forge
libabseil                 20230125.0      cxx17_hf0c8a7f_1    conda-forge
libcxx                    14.0.6               hccf4f1f_0    conda-forge
libffi                    3.4.2                h0d85af4_5    conda-forge
libgrpc                   1.52.1               h493e69f_0    conda-forge
libprotobuf               3.21.12              hbc0c0cd_0    conda-forge
libsqlite                 3.40.0               ha978bb4_0    conda-forge
libzlib                   1.2.13               hfd90126_4    conda-forge
ncurses                   6.3                  h96cf925_1    conda-forge
openssl                   3.0.8                hfd90126_0    conda-forge
pip                       23.0.1             pyhd8ed1ab_0    conda-forge
protobuf                  4.21.12         py311h814d153_0    conda-forge
python                    3.11.0          he7542f4_1_cpython    conda-forge
python_abi                3.11                    3_cp311    conda-forge
re2                       2023.02.01           hf0c8a7f_0    conda-forge
readline                  8.1.2                h3899abd_0    conda-forge
setuptools                67.3.2             pyhd8ed1ab_0    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
tk                        8.6.12               h5dbffcc_0    conda-forge
tzdata                    2022g                h191b570_0    conda-forge
wheel                     0.38.4             pyhd8ed1ab_0    conda-forge
xz                        5.2.6                h775f41a_0    conda-forge
zlib                      1.2.13               hfd90126_4    conda-forge

Environment info

active environment : test
    active env location : /conda/envs/test
            shell level : 2
       user config file : /.condarc
 populated config files : /.condarc
          conda version : 22.11.1
    conda-build version : 3.23.3
         python version : 3.11.0.final.0
       virtual packages : __archspec=1=x86_64
                          __osx=13.1=0
                          __unix=0=0
       base environment : /conda  (writable)
      conda av data dir : /conda/etc/conda
  conda av metadata url : None
           channel URLs : https://conda.anaconda.org/conda-forge/osx-64
                          https://conda.anaconda.org/conda-forge/noarch
                          https://conda.anaconda.org/bioconda/osx-64
                          https://conda.anaconda.org/bioconda/noarch
          package cache : /conda/pkgs
                          /.conda/pkgs
       envs directories : /conda/envs
                          /.conda/envs
               platform : osx-64
             user-agent : conda/22.11.1 requests/2.28.2 CPython/3.11.0 Darwin/22.2.0 OSX/13.1
                UID:GID : 501:20
             netrc file : None
           offline mode : False
@s22chan s22chan added the bug Something isn't working label Feb 21, 2023
@s22chan s22chan changed the title grpcio >=1.51 python stack is broken on osx (x86-64) grpcio >=1.51 python stack is segfauktss on osx (x86-64) Feb 22, 2023
@s22chan s22chan changed the title grpcio >=1.51 python stack is segfauktss on osx (x86-64) grpcio >=1.51 python stack is segfaults on osx (x86-64) Feb 22, 2023
@zeroshade
Copy link

I just wanted to chime in a finding here. We ran into this issue with the arrow nightly builds and was able to confirm that gRPC v1.54.0 appears to alleviate this crash.

See apache/arrow#35090 for more info. For now we have to bump back down a version, but if v1.54 can get released on conda-forge arrow would switch to that.

Thanks!

@h-vetinari
Copy link
Member

Thanks. Could you try 1.53 as well? 1.54 is pretty fresh and we'd have to either skip 1.53 or migrate twice in a row.

@h-vetinari
Copy link
Member

Now that I see that 1.53 broke a version pattern scheme that's held for a while, I might go to 1.54 directly with the next migration...

@h-vetinari h-vetinari mentioned this issue Apr 15, 2023
3 tasks
@s22chan
Copy link
Author

s22chan commented May 12, 2023

Still broken for me (and seems so for arrow as well: apache/arrow#35089)

@h-vetinari
Copy link
Member

For now we have to bump back down a version, but if v1.54 can get released on conda-forge arrow would switch to that.

grpc 1.54 is in conda-forge for a while now (there's also 1.55 already, but that will need more time due to breaking changes around protobuf). I'd still be very interested in knowing what broke and how!

@s22chan
Copy link
Author

s22chan commented May 24, 2023

I think the crash is fixed, so this can be closed but requests still aren't being sent correctly.

I'd be interested in this portion of the code, but getting a debug build working of grpc and abseil might be tricky for someone outside these projects (I wasn't able to get abseil cmake to keep the debug symbols):

https://github.com/grpc/grpc/blob/33d0afd316d43ea89c4a5cda680405f2b4033194/src/core/ext/filters/client_channel/client_channel.cc#L650C7-L670

@h-vetinari
Copy link
Member

From @lidavidm's work in apache/arrow#36908:

For posterity, to get a debug build of gRPC:

  • git clean -fdx . in the grpc-cpp-feedstock
  • env PATH=... arch -arch x86_64 /bin/bash (clean your $PATH of any gunk)
  • Edit .scripts/run_osx_build.sh and add --keep-old-work to conda mambabuild
  • Edit the recipe to set CMAKE_BUILD_TYPE and also clean CMAKE_BUILD_TYPE out of CMAKE_ARGS (since something in the conda build setup also sets it there)
  • python3 build-locally.py
  • Symlink ln -s .../grpc-cpp-feedstock/miniforge3/conda-bld/grpc-split_1692042959526/work_moved_libgrpc-1.56.2-h162c7d8_0_osx-64/ back to the original path .../grpc-cpp-feedstock/miniforge3/conda-bld/grpc-split_1692042959526/work
  • Install the resulting package, lldb should be able to find debug symbols now

In the meantime, I've started looking at enabling the C++ test suite here as well: #311

@s22chan
Copy link
Author

s22chan commented Aug 24, 2023

this doesn't generate abseil as debug but I'll take a quick look

@s22chan
Copy link
Author

s22chan commented Aug 27, 2023

I built both abseil and grpcio with debugging, and it's clear that there is some kind of linkage issue. Somehow the absl::optional<absl::Cord> definitions are different (code patches? different #define?) starting with how grpcio=1.51's build was changed.

abseil is returning a nullopt in various GetPayload places, but grpc is interpreting that as something that has a value.

lldb trace
libabsl_status.2301.0.0.dylib`absl::lts_20230125::Status::GetPayload:
    0x10384927e <+94>: testb  $0x1, %r13b
->  0x103849282 <+98>: jne    0x1038493ba               ; <+410> [inlined] absl::lts_20230125::optional_internal::optional_data_dtor_base<absl::lts_20230125::Cord, false>::optional_data_dtor_base() at optional.h:108:50
Process 32508 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = instruction step into
    frame #0: 0x0000000103849282 libabsl_status.2301.0.0.dylib`absl::lts_20230125::Status::GetPayload(this=0x0000600000201c38, type_url=Summary Unavailable) const at status.cc:128:7 [opt]
Target 0: (python) stopped.
(lldb)
libabsl_status.2301.0.0.dylib`absl::lts_20230125::Status::GetPayload:
->  0x1038493ba <+410>: movb   $0x0, (%r15)


    0x1038493be <+414>: movq   %r15, %rax
    0x1038493c1 <+417>: addq   $0x28, %rsp
    0x1038493c5 <+421>: popq   %rbx
    0x1038493c6 <+422>: popq   %r12
    0x1038493c8 <+424>: popq   %r13
    0x1038493ca <+426>: popq   %r14
    0x1038493cc <+428>: popq   %r15
    0x1038493ce <+430>: popq   %rbp
    0x1038493cf <+431>: retq


    0x1038493d0 <+432>: leaq   0x3825(%rip), %rdi        ; "operator()"
Process 32508 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = instruction step into
    frame #0: 0x00000001038493ba libabsl_status.2301.0.0.dylib`absl::lts_20230125::Status::GetPayload(this=<unavailable>, type_url=Summary Unavailable) const at status.cc:131:10 [opt]
Target 0: (python) stopped.
(lldb)
libabsl_status.2301.0.0.dylib`absl::lts_20230125::Status::GetPayload:
->  0x1038493be <+414>: movq   %r15, %rax
    0x1038493c1 <+417>: addq   $0x28, %rsp
    0x1038493c5 <+421>: popq   %rbx
    0x1038493c6 <+422>: popq   %r12
    0x1038493c8 <+424>: popq   %r13
    0x1038493ca <+426>: popq   %r14
    0x1038493cc <+428>: popq   %r15
    0x1038493ce <+430>: popq   %rbp
    0x1038493cf <+431>: retq


 thread #1, queue = 'com.apple.main-thread', stop reason = instruction step into
    frame #0: 0x00000001038493cf libabsl_status.2301.0.0.dylib`absl::lts_20230125::Status::GetPayload(this=<unavailable>, type_url=Summary Unavailable) const at status.cc:132:1 [opt]
Target 0: (python) stopped.
(lldb)
libgrpc.34.0.0.dylib`grpc_core::ClientChannel::SubchannelWrapper::WatcherWrapper::ApplyUpdateInControlPlaneWorkSerializer:
->  0x1045911d7 <+87>:  cmpb   $0x0, -0x38(%rbp)


    0x1045911db <+91>:  je     0x10459130c               ; <+396> at new


    0x1045911e1 <+97>:  leaq   -0x68(%rbp), %rdi
    0x1045911e5 <+101>: leaq   -0x48(%rbp), %rsi


    0x1045911e9 <+105>: callq  0x1048addd2               ; symbol stub for: absl::lts_20230125::Cord::operator std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >() const
    0x1045911ee <+110>: movzbl -0x68(%rbp), %esi
    0x1045911f2 <+114>: testb  $0x1, %sil
Process 32508 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = instruction step into
    frame #0: 0x00000001045911d7 libgrpc.34.0.0.dylib`grpc_core::ClientChannel::SubchannelWrapper::WatcherWrapper::ApplyUpdateInControlPlaneWorkSerializer(this=0x000060000020c400, state=GRPC_CHANNEL_IDLE, status=0x0000600000201c38) at client_channel.cc:654:32 [opt]
(lldb) p keepalive_throttling
(std::optional<absl::Cord>) $4 =  Has Value=true  {
  Value = {
    contents_ = {
      data_ = {
        rep_ = {
           = {
            data ={...}
            as_tree ={...}
          }
        }
      }
    }
  }
}

@h-vetinari
Copy link
Member

Thanks for digging into this!

starting with how grpcio=1.51's build was changed.

So grpc-cpp=1.51.1=*_0 only bumped the version, but shortly after we rebuilt for a new re2 (build *_1) and more importantly: the newest abseil at the time (build *_2).

It's possible that grpc was using abseil in a way that wasn't compatible with that newer version, however, they definitely followed suit as of grpc 1.53 (which doesn't look like it needed source changes).

Which grpc & abseil version did you use for your debug builds?

@s22chan
Copy link
Author

s22chan commented Aug 28, 2023

in this case grcp-cpp-feedstock 2e04d51 (1.57.0) and abseil-cpp-feedstock 6117048 (20230125.3).

But this issue has persisted since 1.51 as per the title. I'd dig in more but grcp takes forever to build.

@s22chan
Copy link
Author

s22chan commented Aug 28, 2023

step-by-step repro for posterity (note: -DCMAKE_BUILD_TYPE=Debug will not show the bug):

grpc:

diff --git a/recipe/build-cpp.sh b/recipe/build-cpp.sh
index 51b2f3f..7a137d9 100755
--- a/recipe/build-cpp.sh
+++ b/recipe/build-cpp.sh
@@ -60,6 +60,7 @@ cmake ${CMAKE_ARGS} ..  \
       -GNinja \
       -DBUILD_SHARED_LIBS=ON \
       -DCMAKE_BUILD_TYPE=Release \
+      -DCMAKE_CXX_FLAGS_RELEASE="${CMAKE_CXX_FLAGS_RELEASE:-} -O1 -g -DNDEBUG" \
       -DCMAKE_CXX_FLAGS="$CXXFLAGS" \
       -DCMAKE_PREFIX_PATH=$PREFIX \
       -DCMAKE_INSTALL_PREFIX=$PREFIX \

abseil:

iff --git a/recipe/build-abseil.sh b/recipe/build-abseil.sh
index a1533e0..3fdc43f 100644
--- a/recipe/build-abseil.sh
+++ b/recipe/build-abseil.sh
@@ -24,6 +24,7 @@ fi
 cmake -G Ninja \
     ${CMAKE_ARGS} \
     -DCMAKE_BUILD_TYPE=Release \
+    -DCMAKE_CXX_FLAGS_RELEASE="${CMAKE_CXX_FLAGS_RELEASE:-} -O1 -g -DNDEBUG" \
     -DCMAKE_CXX_STANDARD=17 \
     -DCMAKE_INSTALL_LIBDIR=lib \
     -DCMAKE_PREFIX_PATH=${PREFIX} \
  1. build both with EXTRA_CB_OPTIONS="--dirty --no-test" OSX_SDK_DIR=... ./build-locally.py. This doesn't use the abseil local build but -c doesn't seem to work for mambabuild.
  2. mamba create -n test grpcio protobuf --override-channels -c /.../grpc-cpp-feedstock/miniforge3/conda-bld -c /.../abseil-cpp-feedstock/miniforge3/conda-bld -c conda-forge
  3. mamba activate test
  4. lldb python test.py note: lldb will complain that conda-forge touched the libabseil .o files and won't load symbols. I was able to get around, that but I forget now what I did.

@h-vetinari
Copy link
Member

I tried to add a test for OP example in #312, but either I cannot get the server process to work correctly, our CI setup doesn't allow accessing accessing the default port of 127.0.0.1. On all platforms, it runs into a variant of

grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNKNOWN
	details = ""
	debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50051:  {grpc_status:14, created_time:"2023-08-28T09:34:02.207699+00:00"}"
>

Interestingly, despite the failing startup, the segfault seems to happen on osx.

@s22chan
Copy link
Author

s22chan commented Aug 29, 2023

you don't need a server, since the segfault happens due to the connection state being updated in callbacks for the client (eg IDLE -> CONNECTING)

@h-vetinari
Copy link
Member

I had tried it without the server, and it gets the same error (on linux). If we are to integrate this into CI (best way to fix it and keep it fixed), we're going to need a form of the test that passes.

@s22chan
Copy link
Author

s22chan commented Aug 29, 2023

I tried to add a test for OP example in #312, but either I cannot get the server process to work correctly, our CI setup doesn't allow accessing accessing the default port of 127.0.0.1. On all platforms, it runs into a variant of

grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNKNOWN
	details = ""
	debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50051:  {grpc_status:14, created_time:"2023-08-28T09:34:02.207699+00:00"}"
>

Interestingly, despite the failing startup, the segfault seems to happen on osx.

I don't think the server is started? the https://github.com/conda-forge/grpc-cpp-feedstock/pull/312/files#diff-ddad1f9c7894b9921ef910104dd72b393dcb3d052923fdc85a5f0eaad8bdb450 differs significantly from https://github.com/grpc/grpc/blob/master/examples/python/helloworld/greeter_server.py

also while the crash is non-deterministic, the keepalive error is 100% deterministic and you can just run it once and verify no warnings/errors were emitted.

@h-vetinari
Copy link
Member

Good point, thanks. I had just stupidly copied stuff together from their docs. Updated the PR, let's see how it goes.

@s22chan
Copy link
Author

s22chan commented Aug 30, 2023

@h-vetinari I saw you tried to downgrade abseil to test if it failed, have you tried new abseil + old (1.50) grpc?

@h-vetinari
Copy link
Member

have you tried new abseil + old (1.50) grpc?

#313 for now does 1.50 as published. We can then try to bump abseil and see what happens.

@h-vetinari
Copy link
Member

h-vetinari commented Sep 1, 2023

Small update from #313:
Test works with 1.50 & old abseil (20220623.3), fails on 1.50 with newer abseil (20230125; which our builds since 1.51 are built against), and passes again with newest abseil (20230802; not yet migrated)

@h-vetinari
Copy link
Member

Alright, the good news is that both on 1.50 as well as on 1.56, the newest abseil (20230802) fixes the error.

The bad news is that the migration for this is stuck in purgatory until we figure out how to move to a Macos>=10.13 world

@s22chan
Copy link
Author

s22chan commented Sep 3, 2023

thanks for the investigation @h-vetinari. I'd be suspicious if 20230125 actually ever worked with OSX<10.13 at this point... although I didn't see any C++ABI code in abseil's optional code

@h-vetinari
Copy link
Member

Do you think it has something to do with having compiled abseil 20230125 against 10.9? That possibility did not occur to me, I would have guessed it's an abseil bug, but perhaps you're right. In that case, we could fix all the existing grpc builds by just recompiling abseil against 10.13, which would be very nice...

@s22chan
Copy link
Author

s22chan commented Sep 5, 2023

That was my initial thought before debugging (since that was the big change other than the version bump), but the optional code didn't have/fallback on libc++, so not sure if it's worth your time.

@h-vetinari
Copy link
Member

Closing this as fixed by #315

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants