Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

configs which contain jaeger config will hang the router when updated #337

Merged
merged 2 commits into from
Jan 19, 2022

Conversation

garypen
Copy link
Contributor

@garypen garypen commented Jan 18, 2022

If a configuration is updated which contains jaeger configuration, then
at some point a synchronous call to:
opentelemetry::global::set_tracer_provider()
will arise. Even though this call is made from a synchronous function,
the root context is asynchronous and it seems to cause the runtime to
mis-behave: i.e.: hang...

If we make the call from a newly spawned thread (no performance concerns
here because re-configuration is not hot code) then the problem goes
away.
fixes: #331

If a configuration is updated which contains jaeger configuration, then
at some point a synchronous call to:
  opentelemetry::global::set_tracer_provider()
will arise. Even though this call is made from a synchronous function,
the root context is asynchronous and it seems to cause the runtime to
mis-behave: i.e.: hang...

If we make the call from a newly spawned thread (no performance concerns
here because re-configuration is not hot code) then the problem goes
away.
// async context. If we don't do this in a separate thread,
// it will cause issues with the async runtime that prevents
// the router from working correctly.
let _ = std::thread::spawn(|| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

joining on a thread from inside an async context might give the same result as waiting on a RwLock, since joining is blocking. But I don't see another way to do it (except maybe through channels with that thread)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not ideal, but, for whatever reason, it's certainly better behaved with this fix in than otherwise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would tokio::spawn_blocking work better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the thread spawn looked cleaner. I need a way to wait on the spawn_blocking and the code looks a little bit messier (to my eyes).
e.g.:

            let jh = tokio::task::spawn_blocking(|| {
                opentelemetry::global::set_tracer_provider(provider);
            });
            futures::executor::block_on(jh).expect("block on set tracer");

I don't have strong feelings. I don't think either is more "correct" than the other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I used spawn_blocking(), since that's what it's designed for.
resolved in: bef9051

@@ -62,7 +62,16 @@ pub(crate) fn try_initialize_subscriber(
let provider = builder.with_span_processor(batch).build();

let tracer = provider.tracer("opentelemetry-jaeger", Some(env!("CARGO_PKG_VERSION")));
let _ = opentelemetry::global::set_tracer_provider(provider);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh geez I should have seen that xD

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's a tricky one. It's hard to know what the correct thing to do is in this scenario.

Comment on lines 66 to 70
// The call to set_tracer_provider() manipulate a sync RwLock.
// Even though this code is sync, it is called from within an
// async context. If we don't do this in a separate thread,
// it will cause issues with the async runtime that prevents
// the router from working correctly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧶 The comment should actually mention the ticket ID, a link and summarize what the issue is exactly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Less of "prevents the router from working correctly"
More like "makes the async runtime hang for some reason"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved in: bef9051

Copy link
Contributor

@cecton cecton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I'm not entirely sure this issue is a apollo-router issue. It would be nice if you could ask the question to the people of opentelemetry and seek for advice. My reason is that their documentation explicitly says that this function is intended to be called multiple time. It's not like tracing's global subscriber which can be called only once otherwise it panics.
  2. Would it be cleaner to spawn a thread in main.rs that will loop on a synchronous channel for new providers? cc @Geal
    fn main() {
        let (tx, rx) = some_channel();
        let otlp_global_tracer_thread = std::thread::spawn(move || {
            while let Some(provider) = rx.next() {
                opentelemetry::global::set_tracer_provider(provider);
            }
        });
        
        // ...
        
        tx.close();
        otlp_global_tracer_thread.join();
        Ok(())
    }

Copy link
Contributor

@BrynCooke BrynCooke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, I guess if it comes up again we dig in further.

Copy link
Contributor

@cecton cecton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also approved but more as a temporary fix.

 - use spawn_blocking() in preference to spawning a thread
 - update the comment block to refer directly to issue #331
@garypen garypen requested a review from Geal January 19, 2022 14:27
@garypen garypen merged commit 6dc8a36 into main Jan 19, 2022
@garypen garypen deleted the garypen/331-fix branch January 19, 2022 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

configs which contain jaeger config will hang the router when updated
4 participants