-
Notifications
You must be signed in to change notification settings - Fork 269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
configs which contain jaeger config will hang the router when updated #337
Conversation
If a configuration is updated which contains jaeger configuration, then at some point a synchronous call to: opentelemetry::global::set_tracer_provider() will arise. Even though this call is made from a synchronous function, the root context is asynchronous and it seems to cause the runtime to mis-behave: i.e.: hang... If we make the call from a newly spawned thread (no performance concerns here because re-configuration is not hot code) then the problem goes away.
apollo-router/src/trace.rs
Outdated
// async context. If we don't do this in a separate thread, | ||
// it will cause issues with the async runtime that prevents | ||
// the router from working correctly. | ||
let _ = std::thread::spawn(|| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
joining on a thread from inside an async context might give the same result as waiting on a RwLock, since joining is blocking. But I don't see another way to do it (except maybe through channels with that thread)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not ideal, but, for whatever reason, it's certainly better behaved with this fix in than otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would tokio::spawn_blocking work better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the thread spawn looked cleaner. I need a way to wait on the spawn_blocking and the code looks a little bit messier (to my eyes).
e.g.:
let jh = tokio::task::spawn_blocking(|| {
opentelemetry::global::set_tracer_provider(provider);
});
futures::executor::block_on(jh).expect("block on set tracer");
I don't have strong feelings. I don't think either is more "correct" than the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I used spawn_blocking(), since that's what it's designed for.
resolved in: bef9051
@@ -62,7 +62,16 @@ pub(crate) fn try_initialize_subscriber( | |||
let provider = builder.with_span_processor(batch).build(); | |||
|
|||
let tracer = provider.tracer("opentelemetry-jaeger", Some(env!("CARGO_PKG_VERSION"))); | |||
let _ = opentelemetry::global::set_tracer_provider(provider); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh geez I should have seen that xD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's a tricky one. It's hard to know what the correct thing to do is in this scenario.
apollo-router/src/trace.rs
Outdated
// The call to set_tracer_provider() manipulate a sync RwLock. | ||
// Even though this code is sync, it is called from within an | ||
// async context. If we don't do this in a separate thread, | ||
// it will cause issues with the async runtime that prevents | ||
// the router from working correctly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧶 The comment should actually mention the ticket ID, a link and summarize what the issue is exactly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Less of "prevents the router from working correctly"
More like "makes the async runtime hang for some reason"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved in: bef9051
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I'm not entirely sure this issue is a apollo-router issue. It would be nice if you could ask the question to the people of opentelemetry and seek for advice. My reason is that their documentation explicitly says that this function is intended to be called multiple time. It's not like tracing's global subscriber which can be called only once otherwise it panics.
- Would it be cleaner to spawn a thread in main.rs that will loop on a synchronous channel for new providers? cc @Geal
fn main() { let (tx, rx) = some_channel(); let otlp_global_tracer_thread = std::thread::spawn(move || { while let Some(provider) = rx.next() { opentelemetry::global::set_tracer_provider(provider); } }); // ... tx.close(); otlp_global_tracer_thread.join(); Ok(()) }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved, I guess if it comes up again we dig in further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also approved but more as a temporary fix.
- use spawn_blocking() in preference to spawning a thread - update the comment block to refer directly to issue #331
If a configuration is updated which contains jaeger configuration, then
at some point a synchronous call to:
opentelemetry::global::set_tracer_provider()
will arise. Even though this call is made from a synchronous function,
the root context is asynchronous and it seems to cause the runtime to
mis-behave: i.e.: hang...
If we make the call from a newly spawned thread (no performance concerns
here because re-configuration is not hot code) then the problem goes
away.
fixes: #331