-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jaeger ingester panics on convertProcess function #3578
Comments
It's certainly possible to add defensive check, but I would rather not do that because process=null is malformed span data that will cause problems later anyway, because all processing in Jaeger expects process.serviceName to be defined. |
There are a few places that use the parent function (FromDomainEmbedProcess). Couldn't we just add second return value of "error" and bubble it up to function that is trying to write the span? The goal wouldn't be to pass the bad data or even transform it. We just want to inform the logs that a bad span made it to the Ingester and have the original caller of (FromDomainEmbedProcess) skip the write. |
+1 |
Hi, Thanks. |
Please describe your setup - which SDK (lang, version) you're using to emit traces, and what the pipeline looks like (OTEL collector? Jaeger collector? Jaeger ingester? which protocol between the SDK and the backend?) |
We are using jaeger for Thanos tracing. The current structure is thanos -> jaeger agent -> jaeger collector -> AWS msk -> jaegeringester -> elasticsearch. When the ingester pod was failed, the log found that the key and value were null and that |
What tracing SDK does Thanos code use? |
Are you talking about the package that Thanos is using? It uses below packages.
|
No, we saw the same behavior after a few days after adding tags everywhere. We decided to downgrade back to 1.20 and slowly upgrade over time to see when it was introduced. As of writing we haven't experienced the behavior again, although our collectors are experiencing a different kind of panic. It happens infrequently enough for it to not cause problems. I spoke to my team about this and I don't have the bandwidth to work on a solution to this for now. Hopefully someone from the community picks this up. ( At least until I find the time ) |
Can someone provide me with guidance
|
I put a PR to sanitize spans with empty service name or null process. This should protect the pipeline from malformed spans submitted externally. It will not protect against malformed spans written directly to Kafka - we could do that too, but it changes the expectations of the deployment that Kafka is a public API into ingester, which would be a new guarantee and will require ingester to run preprocessing logic similar to collector. |
I also experienced this issue. It occurred twice and both times were during a kafka maintenance which caused the collector to buffer and unfortunately OOM. After the kafka queue was cleared, both times, everything worked. No changes on the client library side. |
@seanw75 thanks for update. Did the Kafka maintenance cause data corruption in the queue? It seems that the collector OOM-ing should only result in lost data, not corrupted data in Kafka. Also, you're probably running with too high setting for the collector's internal queue capacity, you may want to tune is to that the collector would start dropping spans prior to enqueueing them rather than causing OOM. |
@yurishkuro - Yeah, I had everything at the default from the operator for the collector regarding memory and internal queue - which obviously wasn't enough. Already took steps to avoid OOM. As for it being kafka data corruption or collector writing a corrupt record - Hard to tell - what I can say is that the kafka server has many other topics on it and this was the only one that experienced any type of corruption from what I could see. |
The processing in the ingester is really barebones jaeger/cmd/ingester/app/processor/span_processor.go Lines 61 to 68 in cdbdb92
It feels weird to me to be adding data format validation at this point, especially if the root cause is data corruption, which can affect any other aspect of the span message, not just the null process that came up in the exception. |
The collector seems to writing null message to Kafka topic, that seems to cause the panic in the ingester
|
The ingester - if seeing an invalid message, should just log and skip. |
We are rolling out tracing to our production environment and seeing more and more panic jaeger investor (current backlog is ~25M messages) How can I help to fix this bug? |
@HaroonSaid - the only solution I know about is to lower your TTL in your kafka topic and have the corrupt kafka messages removed until there is a fix to the ingester to do some validation. In my own experience, we only got this issue when we were doing maintenance on our kafka servers, that then caused the collector to OOM. The segfault continued until we cleared out the kafka queue and then things stabilized. That being said, something that may help Yuri is:
|
At this point I don't know what is wrong with the messages that cause panic. We already skip messages that are cannot be deserialized. I have a fix for null Process fix that you can try, but if it's a random corruption, it won't help. Being isolated to the Process field does not sound like random to me. |
Is there a way for me to deserialize the base64 message that looks like it might be a problem ? My current workaround is to change the offset (increase by 1) re-run in debug mode find the next bad message skip and rerun until all bad messages have been skipped |
Why is it base64? We could certainly add a util that would read a message from a file, try to parse it into the data model, and then re-serialize that data model into JSON for inspection. |
When I guess my question is, If I have an input message - is there any way to feed it into the ingester to find what it's complaining about. Did any developer create a simple program for debugging purposes? if not - please provide some hints so that I can write it and call the appropriate library - and find out whats the offending message and write a workaround. |
I am not aware of such single-message testing setup (sounds like a good idea), but you can always reproduce it by running Kafka locally and using Kafka CLI to write a message to a topic and letting ingester process it > bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test |
would be good to have a docker-compose file for that ^ |
For me to run ingester locally, I still need to create a bad message. the only bad message(s) that I have are the debug trace. I still need a way to convert the debug trace message into a real message and stick it into Kafka to re-produce. Just want to make sure, we are on the same page. No one has written a debug trace (json) message to protobuf message convertor? I am sure I can write a program to do that and debug the issue. |
I thought you said you have base64-encoded message, which I assumed would be binary protobuf (otherwise why base64). We don't have any standalone conversion utils. |
The tracing going great in our production environment, we were processing several billion spans a day and suddenly we start seeing
|
there was a fix for this specific NPE in 1.34 |
We have deployed 1.35.2 to all of our environments are seeing no issues so far. Thanks again for implementing the fix @yurishkuro! |
@juan-ramirez-sp Glad to hear, but to be clear, the fix is just preventing the crash by replacing empty Process with |
Also, people still seem to be getting panics with 1.35 - #3765 |
I've confirmed we are getting the same error again.
|
We are having the same issue with the setup of ElasticSearch backend after upgrading it from 7.17.2 (Jaeger 1.30.0 works great with this) to 7.17.5. It seems like either the process / process.Tags was null. Adding a safeguard there might solve this? Update my naive attempt: locmai@32a8e31 Another approach could be also good: Instead of using:
We could:
The |
But in 1.35 & 1.36 it still occurs I get the same error #3829 I can ensure my spans in kafka is good, but it's hard to tell which span cause the issue, because lack of trace info, can you please catch and print it out. |
We've deployed 1.37 to our clusters and confirmed it catches the bad message correctly. Thanks again for working on this! |
@juan-ramirez-sp Does catching mean that it also clears those bad messages from Kafka, or are those kept in the Kafka cluster? |
Describe the bug
The Jaeger ingester is panicking when taking data from kafka and attempting to write it into elasticsearch.
I believe the following code is the reason for the panic.
jaeger/plugin/storage/es/spanstore/dbmodel/from_domain.go
Line 123 in 4f55a70
There seems to be no safeguards for a null process or null tags?
This is the following stack trace
To Reproduce
We aren't sure how to reproduce this as the only component panicking is the ingester.
We also haven't seen what set of traces are causing this.
The best way to attempt to reproduce this could be to send a trace to the collector without process tags. We just haven't tried that yet.
Expected behavior
We expect the Jaeger ingester to gracefully handle any "bad" data and give us the configuration to either panic or discard the data.
Screenshots
N/A
Version (please complete the following information):
What troubleshooting steps did you try?
We tried downgrading the ingesters back to 1.20.0 and saw the same issue.
We recreated the Jaeger deployment and still saw the issue.
The only temporary fix was clearing the kafka queue, which implies bad data is making it to the queue.
Additional context
We are going to add collector tags to everything which should force process tags if I am reading the doc correctly.
https://www.jaegertracing.io/docs/1.32/cli/#jaeger-collector-kafka
The text was updated successfully, but these errors were encountered: