Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jaeger ingester panics on convertProcess function #3578

Closed
juan-ramirez-sp opened this issue Mar 10, 2022 · 38 comments · Fixed by #3819
Closed

Jaeger ingester panics on convertProcess function #3578

juan-ramirez-sp opened this issue Mar 10, 2022 · 38 comments · Fixed by #3819
Labels

Comments

@juan-ramirez-sp
Copy link

juan-ramirez-sp commented Mar 10, 2022

Describe the bug
The Jaeger ingester is panicking when taking data from kafka and attempting to write it into elasticsearch.

I believe the following code is the reason for the panic.

tags, tagsMap := fd.convertKeyValuesString(process.Tags)

There seems to be no safeguards for a null process or null tags?

This is the following stack trace

{"level":"info","ts":1646779105.5014815,"caller":"consumer/processor_factory.go:65","msg":"Creating new processors","partition":31}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0xd94dc6]

goroutine 84516 [running]:
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel.FromDomain.convertProcess(...)
	github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel/from_domain.go:123
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel.FromDomain.convertSpanEmbedProcess({0x0, 0xc0003c48a0, {0x1499618, 0xf23872}}, 0xc00c544690)
	github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel/from_domain.go:64 +0x126
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel.FromDomain.FromDomainEmbedProcess(...)
	github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel/from_domain.go:43
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore.(*SpanWriter).WriteSpan(0xc0005f8cc0, {0x0, 0x0}, 0xc00c544690)
	github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/writer.go:152 +0x7a
github.com/jaegertracing/jaeger/cmd/ingester/app/processor.KafkaSpanProcessor.Process({{0x14ad480, 0x1dee2a8}, {0x14ad340, 0xc0005f8cc0}, {0x0, 0x0}}, {0x14af760, 0xc0111950e0})
	github.com/jaegertracing/jaeger/cmd/ingester/app/processor/span_processor.go:67 +0xd3
github.com/jaegertracing/jaeger/cmd/ingester/app/processor/decorator.(*retryDecorator).Process(0xc00008f180, {0x14af760, 0xc0111950e0})
	github.com/jaegertracing/jaeger/cmd/ingester/app/processor/decorator/retry.go:110 +0x37
github.com/jaegertracing/jaeger/cmd/ingester/app/consumer.(*comittingProcessor).Process(0xc0085cfaa0, {0x14af760, 0xc0111950e0})
	github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/committing_processor.go:44 +0x5e
github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*metricsDecorator).Process(0xc00f38fa00, {0x14af760, 0xc0111950e0})
	github.com/jaegertracing/jaeger/cmd/ingester/app/processor/metrics_decorator.go:44 +0x5b
github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*ParallelProcessor).Start.func1()
	github.com/jaegertracing/jaeger/cmd/ingester/app/processor/parallel_processor.go:57 +0x42
created by github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*ParallelProcessor).Start
	github.com/jaegertracing/jaeger/cmd/ingester/app/processor/parallel_processor.go:53 +0x10c

To Reproduce

We aren't sure how to reproduce this as the only component panicking is the ingester.

We also haven't seen what set of traces are causing this.

The best way to attempt to reproduce this could be to send a trace to the collector without process tags. We just haven't tried that yet.

Expected behavior
We expect the Jaeger ingester to gracefully handle any "bad" data and give us the configuration to either panic or discard the data.

Screenshots
N/A

Version (please complete the following information):

  • OS: Amazon Linux 2 (5.4.176-91.338.amzn2.x86_64)
  • Jaeger version: 1.31.0 ( We were on 1.20.0 and didn't see this issue)
  • Elastic version: 7.16.2
  • Kafka version: quay.io/strimzi/kafka:0.25.0-kafka-2.8.0
  • K8s version: 1.19 ( We are also seeing this on 1.21)
  • Deployment: Kubernetes via Operator

What troubleshooting steps did you try?

We tried downgrading the ingesters back to 1.20.0 and saw the same issue.

We recreated the Jaeger deployment and still saw the issue.

The only temporary fix was clearing the kafka queue, which implies bad data is making it to the queue.

Additional context
We are going to add collector tags to everything which should force process tags if I am reading the doc correctly.

https://www.jaegertracing.io/docs/1.32/cli/#jaeger-collector-kafka

--collector.tags  
One or more tags to be added to the Process tags of all spans passing through this collector. Ex: key1=value1,key2=${envVar:defaultValue}
@yurishkuro
Copy link
Member

It's certainly possible to add defensive check, but I would rather not do that because process=null is malformed span data that will cause problems later anyway, because all processing in Jaeger expects process.serviceName to be defined.

@juan-ramirez-sp
Copy link
Author

There are a few places that use the parent function (FromDomainEmbedProcess).

Couldn't we just add second return value of "error" and bubble it up to function that is trying to write the span?

The goal wouldn't be to pass the bad data or even transform it. We just want to inform the logs that a bad span made it to the Ingester and have the original caller of (FromDomainEmbedProcess) skip the write.

@yurishkuro
Copy link
Member

+1

@JungBin-Eom
Copy link

Hi,
I'm having the exact same problem.
Did you solve it with the --collector.tags you mentioned?
I'm not sure which tag to add.

Thanks.

@yurishkuro
Copy link
Member

Please describe your setup - which SDK (lang, version) you're using to emit traces, and what the pipeline looks like (OTEL collector? Jaeger collector? Jaeger ingester? which protocol between the SDK and the backend?)

@JungBin-Eom
Copy link

We are using jaeger for Thanos tracing. The current structure is thanos -> jaeger agent -> jaeger collector -> AWS msk -> jaegeringester -> elasticsearch.

When the ingester pod was failed, the log found that the key and value were null and that panic: runtime error: invalid memory address or nil pointer reference occurred.

@yurishkuro
Copy link
Member

What tracing SDK does Thanos code use?

@JungBin-Eom
Copy link

Are you talking about the package that Thanos is using?

It uses below packages.

  • github.com/opentracing/basictracer-go
  • github.com/opentracing/opentracing-go
  • github.com/uber/jaeger-client-go
  • github.com/uber/jaeger-lib

@juan-ramirez-sp
Copy link
Author

Hi, I'm having the exact same problem. Did you solve it with the --collector.tags you mentioned? I'm not sure which tag to add.

Thanks.

No, we saw the same behavior after a few days after adding tags everywhere.

We decided to downgrade back to 1.20 and slowly upgrade over time to see when it was introduced.

As of writing we haven't experienced the behavior again, although our collectors are experiencing a different kind of panic. It happens infrequently enough for it to not cause problems.

I spoke to my team about this and I don't have the bandwidth to work on a solution to this for now.

Hopefully someone from the community picks this up. ( At least until I find the time )

@HaroonSaid
Copy link

Can someone provide me with guidance

  • take a debug base64 message and run it thru a debug flow?

@yurishkuro
Copy link
Member

I put a PR to sanitize spans with empty service name or null process. This should protect the pipeline from malformed spans submitted externally. It will not protect against malformed spans written directly to Kafka - we could do that too, but it changes the expectations of the deployment that Kafka is a public API into ingester, which would be a new guarantee and will require ingester to run preprocessing logic similar to collector.

@seanw75
Copy link

seanw75 commented Apr 15, 2022

I also experienced this issue. It occurred twice and both times were during a kafka maintenance which caused the collector to buffer and unfortunately OOM. After the kafka queue was cleared, both times, everything worked. No changes on the client library side.

@yurishkuro
Copy link
Member

@seanw75 thanks for update. Did the Kafka maintenance cause data corruption in the queue? It seems that the collector OOM-ing should only result in lost data, not corrupted data in Kafka. Also, you're probably running with too high setting for the collector's internal queue capacity, you may want to tune is to that the collector would start dropping spans prior to enqueueing them rather than causing OOM.

@seanw75
Copy link

seanw75 commented Apr 15, 2022

@yurishkuro - Yeah, I had everything at the default from the operator for the collector regarding memory and internal queue - which obviously wasn't enough. Already took steps to avoid OOM. As for it being kafka data corruption or collector writing a corrupt record - Hard to tell - what I can say is that the kafka server has many other topics on it and this was the only one that experienced any type of corruption from what I could see.

@yurishkuro
Copy link
Member

The processing in the ingester is really barebones

func (s KafkaSpanProcessor) Process(message Message) error {
span, err := s.unmarshaller.Unmarshal(message.Value())
if err != nil {
return fmt.Errorf("cannot unmarshall byte array into span: %w", err)
}
// TODO context should be propagated from upstream components
return s.writer.WriteSpan(context.TODO(), span)
}

It feels weird to me to be adding data format validation at this point, especially if the root cause is data corruption, which can affect any other aspect of the span message, not just the null process that came up in the exception.

@HaroonSaid
Copy link

The collector seems to writing null message to Kafka topic, that seems to cause the panic in the ingester

{"level":"debug","ts":1650237044.222346,"caller":"consumer/consumer.go:138","msg":"Got msg","msg":{"Headers":null,"Timestamp":"0001-01-01T00:00:00Z","BlockTimestamp":"0001-01-01T00:00:00Z","Key":null,"Value":null,"Topic":"tools_tools_us_prod_jaeger_spans","Partition":2,"Offset":39319}}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x198f046]

goroutine 148 [running]:
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel.FromDomain.convertProcess(...)
	github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel/from_domain.go:123
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel.FromDomain.convertSpanEmbedProcess({0x40, 0xc000518d80, {0x2094088, 0x1b1eaf2}}, 0xc0003b85a0)
	github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel/from_domain.go:64 +0x126
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel.FromDomain.FromDomainEmbedProcess(...)
	github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel/from_domain.go:43
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore.(*SpanWriter).WriteSpan(0xc0005205a0, {0x0, 0x0}, 0xc0003b85a0)
	github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/writer.go:152 +0x7a
github.com/jaegertracing/jaeger/cmd/ingester/app/processor.KafkaSpanProcessor.Process({{0x20a8120, 0x29ebfc0}, {0x20a7fe0, 0xc0005205a0}, {0x0, 0x0}}, {0x20aa520, 0xc00028caa0})
	github.com/jaegertracing/jaeger/cmd/ingester/app/processor/span_processor.go:67 +0xd3
github.com/jaegertracing/jaeger/cmd/ingester/app/processor/decorator.(*retryDecorator).Process(0xc00019c230, {0x20aa520, 0xc00028caa0})
	github.com/jaegertracing/jaeger/cmd/ingester/app/processor/decorator/retry.go:110 +0x37
github.com/jaegertracing/jaeger/cmd/ingester/app/consumer.(*comittingProcessor).Process(0xc0005d1bc0, {0x20aa520, 0xc00028caa0})
	github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/committing_processor.go:44 +0x5e
github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*metricsDecorator).Process(0xc00041c340, {0x20aa520, 0xc00028caa0})
	github.com/jaegertracing/jaeger/cmd/ingester/app/processor/metrics_decorator.go:44 +0x5b
github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*ParallelProcessor).Start.func1()
	github.com/jaegertracing/jaeger/cmd/ingester/app/processor/parallel_processor.go:57 +0x42
created by github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*ParallelProcessor).Start
	github.com/jaegertracing/jaeger/cmd/ingester/app/processor/parallel_processor.go:53 +0x10c

@HaroonSaid
Copy link

The processing in the ingester is really barebones

func (s KafkaSpanProcessor) Process(message Message) error {
span, err := s.unmarshaller.Unmarshal(message.Value())
if err != nil {
return fmt.Errorf("cannot unmarshall byte array into span: %w", err)
}
// TODO context should be propagated from upstream components
return s.writer.WriteSpan(context.TODO(), span)
}

It feels weird to me to be adding data format validation at this point, especially if the root cause is data corruption, which can affect any other aspect of the span message, not just the null process that came up in the exception.

The ingester - if seeing an invalid message, should just log and skip.

@HaroonSaid
Copy link

We are rolling out tracing to our production environment and seeing more and more panic jaeger investor (current backlog is ~25M messages)

How can I help to fix this bug?

@seanw75
Copy link

seanw75 commented Apr 21, 2022

@HaroonSaid - the only solution I know about is to lower your TTL in your kafka topic and have the corrupt kafka messages removed until there is a fix to the ingester to do some validation. In my own experience, we only got this issue when we were doing maintenance on our kafka servers, that then caused the collector to OOM. The segfault continued until we cleared out the kafka queue and then things stabilized.

That being said, something that may help Yuri is:

  • Are your kafka servers stable?
  • Is the connection between the collector and kafka stable?
  • Are your collectors stable (ie not OOMing)
  • What version of collector/ingester are you using?

@yurishkuro
Copy link
Member

At this point I don't know what is wrong with the messages that cause panic. We already skip messages that are cannot be deserialized. I have a fix for null Process fix that you can try, but if it's a random corruption, it won't help. Being isolated to the Process field does not sound like random to me.

@HaroonSaid
Copy link

Is there a way for me to deserialize the base64 message that looks like it might be a problem ?
I typically see a null message, followed by another message in the debug trace

My current workaround is to change the offset (increase by 1) re-run in debug mode find the next bad message skip and rerun until all bad messages have been skipped

@yurishkuro
Copy link
Member

Why is it base64?

We could certainly add a util that would read a message from a file, try to parse it into the data model, and then re-serialize that data model into JSON for inspection.

@HaroonSaid
Copy link

Why is it base64?

We could certainly add a util that would read a message from a file, try to parse it into the data model, and then re-serialize that data model into JSON for inspection.

When jaeger-ingester is run in debug mode (--log-level debug) it prints the incoming message.
The message value and key are encoded in base64.
I don't have a program to convert from base64 to span protobuf message.

I guess my question is, If I have an input message - is there any way to feed it into the ingester to find what it's complaining about. Did any developer create a simple program for debugging purposes?

if not - please provide some hints so that I can write it and call the appropriate library - and find out whats the offending message and write a workaround.

@yurishkuro
Copy link
Member

I am not aware of such single-message testing setup (sounds like a good idea), but you can always reproduce it by running Kafka locally and using Kafka CLI to write a message to a topic and letting ingester process it

> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test 

@yurishkuro
Copy link
Member

would be good to have a docker-compose file for that ^

@HaroonSaid
Copy link

HaroonSaid commented Apr 22, 2022

I am not aware of such single-message testing setup (sounds like a good idea), but you can always reproduce it by running Kafka locally and using Kafka CLI to write a message to a topic and letting the ingester process it

> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test 

For me to run ingester locally, I still need to create a bad message. the only bad message(s) that I have are the debug trace. I still need a way to convert the debug trace message into a real message and stick it into Kafka to re-produce.

Just want to make sure, we are on the same page. No one has written a debug trace (json) message to protobuf message convertor?

I am sure I can write a program to do that and debug the issue.

@yurishkuro
Copy link
Member

I thought you said you have base64-encoded message, which I assumed would be binary protobuf (otherwise why base64).

We don't have any standalone conversion utils.

@HaroonSaid
Copy link

HaroonSaid commented May 19, 2022

The tracing going great in our production environment, we were processing several billion spans a day and suddenly we start seeing panic again.

[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0xd93886]
"panic: runtime error: invalid memory address or nil pointer dereference
"
{"level":"info","ts":1652926531.4554677,"caller":"consumer/processor_factory.go:65","msg":"Creating new processors","partition":11}
{"level":"info","ts":1652926531.449849,"caller":"consumer/processor_factory.go:65","msg":"Creating new processors","partition":10}
{"level":"info","ts":1652926531.4402878,"caller":"consumer/consumer.go:110","msg":"Starting message handler","partition":10}
{"level":"info","ts":1652926531.4402316,"caller":"consumer/consumer.go:167","msg":"Starting error handler","partition":10}
{"level":"info","ts":1652926531.43911,"caller":"consumer/consumer.go:110","msg":"Starting message handler","partition":12}
{"level":"info","ts":1652926531.4390614,"caller":"consumer/consumer.go:167","msg":"Starting error handler","partition":12}
{"level":"info","ts":1652926531.4358554,"caller":"consumer/consumer.go:110","msg":"Starting message handler","partition":11}
{"level":"info","ts":1652926531.4358284,"caller":"consumer/consumer.go:167","msg":"Starting error handler","partition":11}
github.com/jaegertracing/jaeger/cmd/ingester/app/processor/parallel_processor.go:53 +0x10c
github.com/jaegertracing/jaeger/cmd/ingester/app/processor/parallel_processor.go:53 +0x10c
"created by github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*ParallelProcessor).Start
"
"created by github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*ParallelProcessor).Start
"
github.com/jaegertracing/jaeger/cmd/ingester/app/processor/parallel_processor.go:57 +0x42
github.com/jaegertracing/jaeger/cmd/ingester/app/processor/parallel_processor.go:57 +0x42
github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*ParallelProcessor).Start.func1()
github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*ParallelProcessor).Start.func1()
github.com/jaegertracing/jaeger/cmd/ingester/app/processor/metrics_decorator.go:44 +0x5b
github.com/jaegertracing/jaeger/cmd/ingester/app/processor/metrics_decorator.go:44 +0x5b
github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*metricsDecorator).Process(0xc00008d040, {0x14af600, 0xc000130fa0})
github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*metricsDecorator).Process(0xc000452780, {0x14af600, 0xc0000ce320})
github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/committing_processor.go:44 +0x5e
github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/committing_processor.go:44 +0x5e
github.com/jaegertracing/jaeger/cmd/ingester/app/consumer.(*comittingProcessor).Process(0xc0002960f0, {0x14af600, 0xc000130fa0})
github.com/jaegertracing/jaeger/cmd/ingester/app/consumer.(*comittingProcessor).Process(0xc000358360, {0x14af600, 0xc0000ce320})
github.com/jaegertracing/jaeger/cmd/ingester/app/processor/decorator/retry.go:110 +0x37
github.com/jaegertracing/jaeger/cmd/ingester/app/processor/decorator/retry.go:110 +0x37
github.com/jaegertracing/jaeger/cmd/ingester/app/processor/decorator.(*retryDecorator).Process(0xc00017f2d0, {0x14af600, 0xc000130fa0})
github.com/jaegertracing/jaeger/cmd/ingester/app/processor/decorator.(*retryDecorator).Process(0xc0001daa80, {0x14af600, 0xc0000ce320})
github.com/jaegertracing/jaeger/cmd/ingester/app/processor/span_processor.go:67 +0xd3
github.com/jaegertracing/jaeger/cmd/ingester/app/processor/span_processor.go:67 +0xd3
github.com/jaegertracing/jaeger/cmd/ingester/app/processor.KafkaSpanProcessor.Process({{0x14ad320, 0x1def2a8}, {0x14ad1e0, 0xc000288ea0}, {0x0, 0x0}}, {0x14af600, 0xc000130fa0})
github.com/jaegertracing/jaeger/cmd/ingester/app/processor.KafkaSpanProcessor.Process({{0x14ad320, 0x1def2a8}, {0x14ad1e0, 0xc0004cea20}, {0x0, 0x0}}, {0x14af600, 0xc0000ce320})
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/writer.go:152 +0x7a
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore.(*SpanWriter).WriteSpan(0xc000288ea0, {0x0, 0x0}, 0xc0003c8780)
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel/from_domain.go:43
"github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel.FromDomain.FromDomainEmbedProcess(...)
"
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel/from_domain.go:64 +0x126
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel.FromDomain.convertSpanEmbedProcess({0xc0, 0xc0000c4060, {0x1499478, 0xf23312}}, 0xc0003c8780)
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel/from_domain.go:123
"github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel.FromDomain.convertProcess(...)
"
goroutine 95 [running]:
"
"
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0xd93886]
"panic: runtime error: invalid memory address or nil pointer dereference
"

@yurishkuro
Copy link
Member

there was a fix for this specific NPE in 1.34

@juan-ramirez-sp
Copy link
Author

We have deployed 1.35.2 to all of our environments are seeing no issues so far.

Thanks again for implementing the fix @yurishkuro!

@yurishkuro
Copy link
Member

@juan-ramirez-sp Glad to hear, but to be clear, the fix is just preventing the crash by replacing empty Process with Process{ServiceName: "unknown"}, so that the spans can still be saved and inspected later for their origin. We have not found out the root cause of this issue.

@yurishkuro
Copy link
Member

Also, people still seem to be getting panics with 1.35 - #3765

@juan-ramirez-sp
Copy link
Author

I've confirmed we are getting the same error again.
All components are running 1.35.2, and we performed a direct upgrade.

github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel.FromDomain.convertProcess(...) github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel/from_domain.go:123 github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel.FromDomain.convertSpanEmbedProcess({0x0?, 0xc000222ae0?, {0x15a7e88?, 0x1?}}, 0xc0000faff0) github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel/from_domain.go:64 +0x126 github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel.FromDomain.FromDomainEmbedProcess(...) github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel/from_domain.go:43 github.com/jaegertracing/jaeger/plugin/storage/es/spanstore.(*SpanWriter).WriteSpan(0xc0001d3d40, {0x0?, 0x0?}, 0xc0000faff0) github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/writer.go:152 +0x7a github.com/jaegertracing/jaeger/cmd/ingester/app/processor.KafkaSpanProcessor.Process({{0x15ac040, 0x1e84f90}, {0x15abec0, 0xc0001d3d40}, {0x0, 0x0}}, {0x15aeac0?, 0xc0003f8d20?}) github.com/jaegertracing/jaeger/cmd/ingester/app/processor/span_processor.go:67 +0xd3 github.com/jaegertracing/jaeger/cmd/ingester/app/processor/decorator.(*retryDecorator).Process(0xc000089b20, {0x15aeac0, 0xc0003f8d20}) github.com/jaegertracing/jaeger/cmd/ingester/app/processor/decorator/retry.go:110 +0x37 github.com/jaegertracing/jaeger/cmd/ingester/app/consumer.(*comittingProcessor).Process(0xc000375410, {0x15aeac0, 0xc0003f8d20}) github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/committing_processor.go:44 +0x5e github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*metricsDecorator).Process(0xc000128180, {0x15aeac0, 0xc0003f8d20}) github.com/jaegertracing/jaeger/cmd/ingester/app/processor/metrics_decorator.go:44 +0x5b github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*ParallelProcessor).Start.func1() github.com/jaegertracing/jaeger/cmd/ingester/app/processor/parallel_processor.go:57 +0x42 created by github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*ParallelProcessor).Start github.com/jaegertracing/jaeger/cmd/ingester/app/processor/parallel_processor.go:53 +0xf5

@locmai
Copy link
Contributor

locmai commented Jul 20, 2022

We are having the same issue with the setup of ElasticSearch backend after upgrading it from 7.17.2 (Jaeger 1.30.0 works great with this) to 7.17.5.

It seems like either the process / process.Tags was null. Adding a safeguard there might solve this?

Update my naive attempt: locmai@32a8e31

Another approach could be also good:

Instead of using:

tags, tagsMap := fd.convertKeyValuesString(process.Tags)

We could:

tags, tagsMap := fd.convertKeyValuesString(process.GetTags())`

The process.GetTags() has the nil check that we could reuse.

@huahuayu
Copy link

huahuayu commented Jul 26, 2022

there was a fix for this specific NPE in 1.34

But in 1.35 & 1.36 it still occurs I get the same error #3829

I can ensure my spans in kafka is good, but it's hard to tell which span cause the issue, because lack of trace info, can you please catch and print it out.

@juan-ramirez-sp
Copy link
Author

We've deployed 1.37 to our clusters and confirmed it catches the bad message correctly.

Thanks again for working on this!

@BlackDex
Copy link

@juan-ramirez-sp Does catching mean that it also clears those bad messages from Kafka, or are those kept in the Kafka cluster?

@juan-ramirez-sp
Copy link
Author

@BlackDex The linked PR could answer it better than I can, #3819

I didn't notice multiple error messages so I assume it did clear them from kafka.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants