Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

envoy stuck initializing cluster after we updated to 6ea29c7 #7252

Closed
rgs1 opened this issue Jun 12, 2019 · 12 comments
Closed

envoy stuck initializing cluster after we updated to 6ea29c7 #7252

rgs1 opened this issue Jun 12, 2019 · 12 comments
Labels
Milestone

Comments

@rgs1
Copy link
Member

rgs1 commented Jun 12, 2019

Since we synced with 6ea29c7, we are seeing that Envoy never finishes initializing clusters that depend on SDS. E.g.:

[2019-06-12 20:46:45.051][17][debug][upstream] [external/envoy/source/common/upstream/cluster_manager_impl.cc:74] cm init: adding: cluster=foo primary=1 secondary=17

but something like:

[2019-06-12 20:46:45.246][17][debug][upstream] [external/envoy/source/common/upstream/cluster_manager_impl.cc:102] cm init: init complete: cluster=foo primary=0 secondary=18

never happens. The TLS context for the foo cluster looks like this:

    "tls_context": {
      "common_tls_context": {
        "tls_certificate_sds_secret_configs": [
          {
            "name": "foo_certificate_chain",
            "sds_config": {
              "api_config_source": {
                "api_type": "GRPC",
                "grpc_services": [
                  {
                    "envoy_grpc": {
                      "cluster_name": "local_sds"
                    }
                  }
                ]
              }
            }
          }
        ],
        "validation_context_sds_secret_config": {
          "name": "foo_root_ca",
          "sds_config": {
            "api_config_source": {
              "api_type": "GRPC",
              "grpc_services": [
                {
                  "envoy_grpc": {
                    "cluster_name": "local_sds"
                  }
                }
              ]
            }
          }
        }
      }
    },

cc: @fishcakez

@mattklein123
Copy link
Member

cc @htuch @fredlas

@rgs1
Copy link
Member Author

rgs1 commented Jun 12, 2019

Err, sorry for the confusion but #7200 is actually newer. I'll look around a bit more.

@rgs1
Copy link
Member Author

rgs1 commented Jun 12, 2019

Hmm a bit suspicious about #6525 now, given that we use gRPC to talk to our SDS sidecar.

@mattklein123 mattklein123 added this to the 1.11.0 milestone Jun 12, 2019
@rgs1
Copy link
Member Author

rgs1 commented Jun 13, 2019

Hmm, taking a closer look this morning it seems to be between these changes:

9d47062...6ea29c7

Bisecting now.

@rgs1 rgs1 changed the title envoy stuck initializing cluster after we updated to 9d4706251 envoy stuck initializing cluster after we updated to 6ea29c7 Jun 13, 2019
@mattklein123
Copy link
Member

I would potentially suspect 1e34a2e

@rgs1
Copy link
Member Author

rgs1 commented Jun 13, 2019

@mattklein123 literally just reverted it and trying that :-)

@rgs1
Copy link
Member Author

rgs1 commented Jun 13, 2019

stay tuned.

@rgs1
Copy link
Member Author

rgs1 commented Jun 13, 2019

Alas, 1e34a2e wasn't the problem. Still stuck at:

[2019-06-13 16:23:59.650][18][info][upstream] [external/envoy/source/common/upstream/cluster_manager_impl.cc:123] cm init: initializing secondary clusters

@rgs1
Copy link
Member Author

rgs1 commented Jun 13, 2019

Hmm what's going on here:

(gdb) thread apply all bt
Thread 7 (LWP 32616):
#0  0x00007fb41b1504d9 in syscall () from target:/lib/x86_64-linux-gnu/libc.so.6
#1  0x000000000103d719 in absl::synchronization_internal::Futex::WaitUntil (t=..., val=0, v=0x7fb41bf3f150) at external/com_google_absl/absl/synchronization/internal/waiter.cc:106                        
#2  absl::synchronization_internal::Waiter::Wait (this=this@entry=0x7fb41bf3f150, t=t@entry=...) at external/com_google_absl/absl/synchronization/internal/waiter.cc:143                                   
#3  0x000000000103d5c2 in AbslInternalPerThreadSemWait (t=t@entry=...) at external/com_google_absl/absl/synchronization/internal/per_thread_sem.cc:87                                                      
#4  0x000000000103d1e0 in absl::synchronization_internal::PerThreadSem::Wait (t=...) at external/com_google_absl/absl/synchronization/internal/per_thread_sem.h:105                                        
#5  absl::Mutex::DecrementSynchSem (mu=0x3bda380, w=<optimized out>, t=...) at external/com_google_absl/absl/synchronization/mutex.cc:562                                                                  
#6  absl::CondVar::WaitCommon (this=this@entry=0x3bda390, mutex=mutex@entry=0x3bda380, t=t@entry=...) at external/com_google_absl/absl/synchronization/mutex.cc:2534                                       
#7  0x000000000103d2cc in absl::CondVar::Wait (this=this@entry=0x3bda390, mu=mu@entry=0x3bda380) at external/com_google_absl/absl/synchronization/mutex.cc:2569                                            
#8  0x00000000009631c3 in Envoy::Thread::CondVar::wait (mutex=..., this=0x3bda390) at bazel-out/k8-opt/bin/external/envoy/source/common/common/_virtual_includes/thread_lib/common/common/thread.h:56      
#9  Envoy::AccessLog::AccessLogFileImpl::flushThreadFunc (this=0x3bda350) at external/envoy/source/common/access_log/access_log_manager_impl.cc:116                                                        
#10 0x0000000000ef3f65 in std::function<void ()>::operator()() const (this=<optimized out>) at /usr/include/c++/7/bits/std_function.h:706                                                                  
#11 Envoy::Thread::ThreadImplPosix::<lambda(void*)>::operator() (__closure=0x0, arg=<optimized out>) at external/envoy/source/common/common/posix/thread_impl.cc:39                                        
#12 Envoy::Thread::ThreadImplPosix::<lambda(void*)>::_FUN(void *) () at external/envoy/source/common/common/posix/thread_impl.cc:41                                                                        
#13 0x00007fb41b7296ba in start_thread () from target:/lib/x86_64-linux-gnu/libpthread.so.0
#14 0x00007fb41b15641d in clone () from target:/lib/x86_64-linux-gnu/libc.so.6

Thread 6 (LWP 32615):
#0  0x00007fb41b156a13 in epoll_wait () from target:/lib/x86_64-linux-gnu/libc.so.6
#1  0x0000000000cc1f7a in epoll_dispatch (base=0x3d51340, tv=<optimized out>) at /build/tmp/_bazel_bazel/b570b5ccd0454dc9af9f65ab1833764d/execroot/envoy/external/com_github_libevent_libevent/epoll.c:465 
#2  0x0000000000cbbd7e in event_base_loop (base=0x3d51340, flags=<optimized out>)
    at /build/tmp/_bazel_bazel/b570b5ccd0454dc9af9f65ab1833764d/execroot/envoy/external/com_github_libevent_libevent/event.c:2019                                                                          
#3  0x0000000000976185 in Envoy::Event::LibeventScheduler::run (this=this@entry=0x3d535a8, mode=mode@entry=Envoy::Event::Dispatcher::RunType::RunUntilExit)                                                
    at external/envoy/source/common/event/libevent_scheduler.cc:47
#4  0x000000000096479a in Envoy::Event::DispatcherImpl::run (this=0x3d53560, type=Envoy::Event::Dispatcher::RunType::RunUntilExit) at external/envoy/source/common/event/dispatcher_impl.cc:178            
#5  0x0000000000ef3f65 in std::function<void ()>::operator()() const (this=<optimized out>) at /usr/include/c++/7/bits/std_function.h:706                                                                  
#6  Envoy::Thread::ThreadImplPosix::<lambda(void*)>::operator() (__closure=0x0, arg=<optimized out>) at external/envoy/source/common/common/posix/thread_impl.cc:39                                        
#7  Envoy::Thread::ThreadImplPosix::<lambda(void*)>::_FUN(void *) () at external/envoy/source/common/common/posix/thread_impl.cc:41                                                                        
#8  0x00007fb41b7296ba in start_thread () from target:/lib/x86_64-linux-gnu/libpthread.so.0
#9  0x00007fb41b15641d in clone () from target:/lib/x86_64-linux-gnu/libc.so.6

Thread 5 (LWP 32614):
#0  0x00007fb41b156a13 in epoll_wait () from target:/lib/x86_64-linux-gnu/libc.so.6
#1  0x0000000000a90868 in pollable_epoll (deadline=<optimized out>, p=0x3b24000) at external/com_github_grpc_grpc/src/core/lib/iomgr/ev_epollex_linux.cc:956                                               
#2  pollset_work (pollset=0x3cd1108, worker_hdl=<optimized out>, deadline=<optimized out>) at external/com_github_grpc_grpc/src/core/lib/iomgr/ev_epollex_linux.cc:1144                                    
#3  0x0000000000a7045a in cq_next (cq=0x3cd1040, deadline=..., reserved=<optimized out>) at external/com_github_grpc_grpc/src/core/lib/surface/completion_queue.cc:1040                                    
#4  0x00000000009e4481 in grpc_impl::CompletionQueue::AsyncNextInternal (this=this@entry=0x3d16d98, tag=tag@entry=0x7fb4198372e8, ok=ok@entry=0x7fb4198372e7, deadline=...)                                
    at external/com_github_grpc_grpc/src/cpp/common/completion_queue_cc.cc:48
#5  0x00000000009d8fd1 in grpc_impl::CompletionQueue::Next (ok=0x7fb4198372e7, tag=0x7fb4198372e8, this=0x3d16d98) at external/com_github_grpc_grpc/include/grpcpp/impl/codegen/completion_queue_impl.h:179
#6  Envoy::Grpc::GoogleAsyncClientThreadLocal::completionThread (this=0x3d16d90) at external/envoy/source/common/grpc/google_async_client_impl.cc:46                                                       
#7  0x0000000000ef3f65 in std::function<void ()>::operator()() const (this=<optimized out>) at /usr/include/c++/7/bits/std_function.h:706                                                                  
#8  Envoy::Thread::ThreadImplPosix::<lambda(void*)>::operator() (__closure=0x0, arg=<optimized out>) at external/envoy/source/common/common/posix/thread_impl.cc:39                                        
#9  Envoy::Thread::ThreadImplPosix::<lambda(void*)>::_FUN(void *) () at external/envoy/source/common/common/posix/thread_impl.cc:41                                                                        
#10 0x00007fb41b7296ba in start_thread () from target:/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x00007fb41b15641d in clone () from target:/lib/x86_64-linux-gnu/libc.so.6

Thread 4 (LWP 32613):
#0  0x00007fb41b72f709 in pthread_cond_timedwait@@GLIBC_2.3.2 () from target:/lib/x86_64-linux-gnu/libpthread.so.0                                                                                         
#1  0x0000000000a9e41f in gpr_cv_wait (cv=cv@entry=0x15fa300 <g_cv_wait>, mu=mu@entry=0x15fa340 <_ZL4g_mu>, abs_deadline=...) at external/com_github_grpc_grpc/src/core/lib/gpr/sync_posix.cc:233          
#2  0x0000000000a7fb41 in wait_until (next=<optimized out>) at external/com_github_grpc_grpc/src/core/lib/iomgr/timer_manager.cc:211                                                                       
#3  timer_main_loop () at external/com_github_grpc_grpc/src/core/lib/iomgr/timer_manager.cc:265
#4  timer_thread (completed_thread_ptr=0x3b72140) at external/com_github_grpc_grpc/src/core/lib/iomgr/timer_manager.cc:294                                                                                 
#5  0x0000000000a9ce1a in grpc_core::(anonymous namespace)::ThreadInternalsPosix::<lambda(void*)>::operator() (__closure=0x0, v=<optimized out>)                                                           
    at external/com_github_grpc_grpc/src/core/lib/gprpp/thd_posix.cc:114
#6  grpc_core::(anonymous namespace)::ThreadInternalsPosix::<lambda(void*)>::_FUN(void *) () at external/com_github_grpc_grpc/src/core/lib/gprpp/thd_posix.cc:119                                          
#7  0x00007fb41b7296ba in start_thread () from target:/lib/x86_64-linux-gnu/libpthread.so.0
#8  0x00007fb41b15641d in clone () from target:/lib/x86_64-linux-gnu/libc.so.6

Thread 3 (LWP 32612):
#0  0x00007fb41b72f360 in pthread_cond_wait@@GLIBC_2.3.2 () from target:/lib/x86_64-linux-gnu/libpthread.so.0                                                                                              
--Type <RET> for more, q to quit, c to continue without paging--
#1  0x0000000000a9e3c7 in gpr_cv_wait (cv=cv@entry=0x3bdf038, mu=mu@entry=0x3bdf000, abs_deadline=...) at external/com_github_grpc_grpc/src/core/lib/gpr/sync_posix.cc:217
#2  0x0000000000a8af87 in grpc_core::Executor::ThreadMain (arg=0x3bdf000) at external/com_github_grpc_grpc/src/core/lib/iomgr/executor.cc:245
#3  0x0000000000a9ce1a in grpc_core::(anonymous namespace)::ThreadInternalsPosix::<lambda(void*)>::operator() (__closure=0x0, v=<optimized out>)
    at external/com_github_grpc_grpc/src/core/lib/gprpp/thd_posix.cc:114
#4  grpc_core::(anonymous namespace)::ThreadInternalsPosix::<lambda(void*)>::_FUN(void *) () at external/com_github_grpc_grpc/src/core/lib/gprpp/thd_posix.cc:119
#5  0x00007fb41b7296ba in start_thread () from target:/lib/x86_64-linux-gnu/libpthread.so.0
#6  0x00007fb41b15641d in clone () from target:/lib/x86_64-linux-gnu/libc.so.6

Thread 2 (LWP 32611):
#0  0x00007fb41b72f360 in pthread_cond_wait@@GLIBC_2.3.2 () from target:/lib/x86_64-linux-gnu/libpthread.so.0
#1  0x0000000000a9e3c7 in gpr_cv_wait (cv=cv@entry=0x3bdc038, mu=mu@entry=0x3bdc000, abs_deadline=...) at external/com_github_grpc_grpc/src/core/lib/gpr/sync_posix.cc:217
#2  0x0000000000a8af87 in grpc_core::Executor::ThreadMain (arg=0x3bdc000) at external/com_github_grpc_grpc/src/core/lib/iomgr/executor.cc:245
#3  0x0000000000a9ce1a in grpc_core::(anonymous namespace)::ThreadInternalsPosix::<lambda(void*)>::operator() (__closure=0x0, v=<optimized out>)
    at external/com_github_grpc_grpc/src/core/lib/gprpp/thd_posix.cc:114
#4  grpc_core::(anonymous namespace)::ThreadInternalsPosix::<lambda(void*)>::_FUN(void *) () at external/com_github_grpc_grpc/src/core/lib/gprpp/thd_posix.cc:119
#5  0x00007fb41b7296ba in start_thread () from target:/lib/x86_64-linux-gnu/libpthread.so.0
#6  0x00007fb41b15641d in clone () from target:/lib/x86_64-linux-gnu/libc.so.6

Thread 1 (LWP 32609):
#0  0x00007fb41b156a13 in epoll_wait () from target:/lib/x86_64-linux-gnu/libc.so.6
#1  0x0000000000cc1f7a in epoll_dispatch (base=0x3b642c0, tv=<optimized out>) at /build/tmp/_bazel_bazel/b570b5ccd0454dc9af9f65ab1833764d/execroot/envoy/external/com_github_libevent_libevent/epoll.c:465
#2  0x0000000000cbbd7e in event_base_loop (base=0x3b642c0, flags=<optimized out>)
    at /build/tmp/_bazel_bazel/b570b5ccd0454dc9af9f65ab1833764d/execroot/envoy/external/com_github_libevent_libevent/event.c:2019
#3  0x0000000000976185 in Envoy::Event::LibeventScheduler::run (this=this@entry=0x3b1b008, mode=mode@entry=Envoy::Event::Dispatcher::RunType::Block)
    at external/envoy/source/common/event/libevent_scheduler.cc:47
#4  0x000000000096479a in Envoy::Event::DispatcherImpl::run (this=0x3b1afc0, type=Envoy::Event::Dispatcher::RunType::Block) at external/envoy/source/common/event/dispatcher_impl.cc:178
#5  0x000000000091e324 in Envoy::Server::InstanceImpl::run (this=0x3beb800) at external/envoy/source/server/server.cc:491
#6  0x00000000004bd054 in Envoy::MainCommonBase::run (this=this@entry=0x3b3f6f0) at external/envoy/source/exe/main_common.cc:102
#7  0x0000000000422103 in Envoy::MainCommon::run (this=<optimized out>) at bazel-out/k8-opt/bin/external/envoy/source/exe/_virtual_includes/envoy_main_common_lib/exe/main_common.h:91
#8  main (argc=17, argv=0x7ffd5e72b998) at external/envoy/source/exe/main.cc:39
(gdb) 

@rgs1
Copy link
Member Author

rgs1 commented Jun 13, 2019

Ok, definitely related to our tls_context. I remove it, it starts without a problem. So might be something in our setup and not an upstream thing.

@rgs1
Copy link
Member Author

rgs1 commented Jun 13, 2019

So the issue gets triggered by running two instances of our sds sidecar [unclear why this is bad in itself, but reverting fixes it].

Leaving this open for now, since there might be room for improving how Envoy handles a misbehaving sds cluster/instance.

@rgs1
Copy link
Member Author

rgs1 commented Jun 14, 2019

Turns out, the problem was our sds service was returning incorrect secrets to Envoy, which Envoy would ignore and request again. Envoy would then block setting up clusters/listeners until it got a secret response it expected. Since this never happened, Envoy would block forever...

Sorry for the false alarm.

@rgs1 rgs1 closed this as completed Jun 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants