Jaeger ingester panics on convertProcess function #3578

juan-ramirez-sp · 2022-03-10T17:39:22Z

Describe the bug
The Jaeger ingester is panicking when taking data from kafka and attempting to write it into elasticsearch.

I believe the following code is the reason for the panic.

jaeger/plugin/storage/es/spanstore/dbmodel/from_domain.go

Line 123 in 4f55a70

tags, tagsMap := fd.convertKeyValuesString(process.Tags)

There seems to be no safeguards for a null process or null tags?

This is the following stack trace

{"level":"info","ts":1646779105.5014815,"caller":"consumer/processor_factory.go:65","msg":"Creating new processors","partition":31}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0xd94dc6]

goroutine 84516 [running]:
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel.FromDomain.convertProcess(...)
	github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel/from_domain.go:123
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel.FromDomain.convertSpanEmbedProcess({0x0, 0xc0003c48a0, {0x1499618, 0xf23872}}, 0xc00c544690)
	github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel/from_domain.go:64 +0x126
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel.FromDomain.FromDomainEmbedProcess(...)
	github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel/from_domain.go:43
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore.(*SpanWriter).WriteSpan(0xc0005f8cc0, {0x0, 0x0}, 0xc00c544690)
	github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/writer.go:152 +0x7a
github.com/jaegertracing/jaeger/cmd/ingester/app/processor.KafkaSpanProcessor.Process({{0x14ad480, 0x1dee2a8}, {0x14ad340, 0xc0005f8cc0}, {0x0, 0x0}}, {0x14af760, 0xc0111950e0})
	github.com/jaegertracing/jaeger/cmd/ingester/app/processor/span_processor.go:67 +0xd3
github.com/jaegertracing/jaeger/cmd/ingester/app/processor/decorator.(*retryDecorator).Process(0xc00008f180, {0x14af760, 0xc0111950e0})
	github.com/jaegertracing/jaeger/cmd/ingester/app/processor/decorator/retry.go:110 +0x37
github.com/jaegertracing/jaeger/cmd/ingester/app/consumer.(*comittingProcessor).Process(0xc0085cfaa0, {0x14af760, 0xc0111950e0})
	github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/committing_processor.go:44 +0x5e
github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*metricsDecorator).Process(0xc00f38fa00, {0x14af760, 0xc0111950e0})
	github.com/jaegertracing/jaeger/cmd/ingester/app/processor/metrics_decorator.go:44 +0x5b
github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*ParallelProcessor).Start.func1()
	github.com/jaegertracing/jaeger/cmd/ingester/app/processor/parallel_processor.go:57 +0x42
created by github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*ParallelProcessor).Start
	github.com/jaegertracing/jaeger/cmd/ingester/app/processor/parallel_processor.go:53 +0x10c

To Reproduce

We aren't sure how to reproduce this as the only component panicking is the ingester.

We also haven't seen what set of traces are causing this.

The best way to attempt to reproduce this could be to send a trace to the collector without process tags. We just haven't tried that yet.

Expected behavior
We expect the Jaeger ingester to gracefully handle any "bad" data and give us the configuration to either panic or discard the data.

Screenshots
N/A

Version (please complete the following information):

OS: Amazon Linux 2 (5.4.176-91.338.amzn2.x86_64)
Jaeger version: 1.31.0 ( We were on 1.20.0 and didn't see this issue)
Elastic version: 7.16.2
Kafka version: quay.io/strimzi/kafka:0.25.0-kafka-2.8.0
K8s version: 1.19 ( We are also seeing this on 1.21)
Deployment: Kubernetes via Operator

What troubleshooting steps did you try?

We tried downgrading the ingesters back to 1.20.0 and saw the same issue.

We recreated the Jaeger deployment and still saw the issue.

The only temporary fix was clearing the kafka queue, which implies bad data is making it to the queue.

Additional context
We are going to add collector tags to everything which should force process tags if I am reading the doc correctly.

https://www.jaegertracing.io/docs/1.32/cli/#jaeger-collector-kafka

--collector.tags
One or more tags to be added to the Process tags of all spans passing through this collector. Ex: key1=value1,key2=${envVar:defaultValue}

The text was updated successfully, but these errors were encountered:

yurishkuro · 2022-03-21T16:05:37Z

It's certainly possible to add defensive check, but I would rather not do that because process=null is malformed span data that will cause problems later anyway, because all processing in Jaeger expects process.serviceName to be defined.

juan-ramirez-sp · 2022-04-04T20:12:54Z

There are a few places that use the parent function (FromDomainEmbedProcess).

Couldn't we just add second return value of "error" and bubble it up to function that is trying to write the span?

The goal wouldn't be to pass the bad data or even transform it. We just want to inform the logs that a bad span made it to the Ingester and have the original caller of (FromDomainEmbedProcess) skip the write.

yurishkuro · 2022-04-04T21:10:48Z

+1

JungBin-Eom · 2022-04-08T05:42:35Z

Hi,
I'm having the exact same problem.
Did you solve it with the --collector.tags you mentioned?
I'm not sure which tag to add.

Thanks.

yurishkuro · 2022-04-08T13:34:07Z

Please describe your setup - which SDK (lang, version) you're using to emit traces, and what the pipeline looks like (OTEL collector? Jaeger collector? Jaeger ingester? which protocol between the SDK and the backend?)

JungBin-Eom · 2022-04-09T08:45:03Z

We are using jaeger for Thanos tracing. The current structure is thanos -> jaeger agent -> jaeger collector -> AWS msk -> jaegeringester -> elasticsearch.

When the ingester pod was failed, the log found that the key and value were null and that panic: runtime error: invalid memory address or nil pointer reference occurred.

yurishkuro · 2022-04-09T18:27:12Z

What tracing SDK does Thanos code use?

JungBin-Eom · 2022-04-11T02:09:09Z

Are you talking about the package that Thanos is using?

It uses below packages.

github.com/opentracing/basictracer-go
github.com/opentracing/opentracing-go
github.com/uber/jaeger-client-go
github.com/uber/jaeger-lib

juan-ramirez-sp · 2022-04-11T16:41:39Z

Hi, I'm having the exact same problem. Did you solve it with the --collector.tags you mentioned? I'm not sure which tag to add.

Thanks.

No, we saw the same behavior after a few days after adding tags everywhere.

We decided to downgrade back to 1.20 and slowly upgrade over time to see when it was introduced.

As of writing we haven't experienced the behavior again, although our collectors are experiencing a different kind of panic. It happens infrequently enough for it to not cause problems.

I spoke to my team about this and I don't have the bandwidth to work on a solution to this for now.

Hopefully someone from the community picks this up. ( At least until I find the time )

HaroonSaid · 2022-04-15T14:18:54Z

Can someone provide me with guidance

take a debug base64 message and run it thru a debug flow?

yurishkuro · 2022-04-15T16:14:35Z

I put a PR to sanitize spans with empty service name or null process. This should protect the pipeline from malformed spans submitted externally. It will not protect against malformed spans written directly to Kafka - we could do that too, but it changes the expectations of the deployment that Kafka is a public API into ingester, which would be a new guarantee and will require ingester to run preprocessing logic similar to collector.

seanw75 · 2022-04-15T16:24:39Z

I also experienced this issue. It occurred twice and both times were during a kafka maintenance which caused the collector to buffer and unfortunately OOM. After the kafka queue was cleared, both times, everything worked. No changes on the client library side.

yurishkuro · 2022-04-15T16:28:16Z

@seanw75 thanks for update. Did the Kafka maintenance cause data corruption in the queue? It seems that the collector OOM-ing should only result in lost data, not corrupted data in Kafka. Also, you're probably running with too high setting for the collector's internal queue capacity, you may want to tune is to that the collector would start dropping spans prior to enqueueing them rather than causing OOM.

seanw75 · 2022-04-15T16:36:02Z

@yurishkuro - Yeah, I had everything at the default from the operator for the collector regarding memory and internal queue - which obviously wasn't enough. Already took steps to avoid OOM. As for it being kafka data corruption or collector writing a corrupt record - Hard to tell - what I can say is that the kafka server has many other topics on it and this was the only one that experienced any type of corruption from what I could see.

yurishkuro · 2022-04-15T16:57:37Z

The processing in the ingester is really barebones

jaeger/cmd/ingester/app/processor/span_processor.go

Lines 61 to 68 in cdbdb92

    
           func (s KafkaSpanProcessor) Process(message Message) error { 
        
           	span, err := s.unmarshaller.Unmarshal(message.Value()) 
        
           	if err != nil { 
        
           		return fmt.Errorf("cannot unmarshall byte array into span: %w", err) 
        
           	} 
        
           	// TODO context should be propagated from upstream components 
        
           	return s.writer.WriteSpan(context.TODO(), span) 
        
           }

It feels weird to me to be adding data format validation at this point, especially if the root cause is data corruption, which can affect any other aspect of the span message, not just the null process that came up in the exception.

HaroonSaid · 2022-04-17T23:14:00Z

The collector seems to writing null message to Kafka topic, that seems to cause the panic in the ingester

{"level":"debug","ts":1650237044.222346,"caller":"consumer/consumer.go:138","msg":"Got msg","msg":{"Headers":null,"Timestamp":"0001-01-01T00:00:00Z","BlockTimestamp":"0001-01-01T00:00:00Z","Key":null,"Value":null,"Topic":"tools_tools_us_prod_jaeger_spans","Partition":2,"Offset":39319}}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x198f046]

goroutine 148 [running]:
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel.FromDomain.convertProcess(...)
	github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel/from_domain.go:123
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel.FromDomain.convertSpanEmbedProcess({0x40, 0xc000518d80, {0x2094088, 0x1b1eaf2}}, 0xc0003b85a0)
	github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel/from_domain.go:64 +0x126
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel.FromDomain.FromDomainEmbedProcess(...)
	github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel/from_domain.go:43
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore.(*SpanWriter).WriteSpan(0xc0005205a0, {0x0, 0x0}, 0xc0003b85a0)
	github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/writer.go:152 +0x7a
github.com/jaegertracing/jaeger/cmd/ingester/app/processor.KafkaSpanProcessor.Process({{0x20a8120, 0x29ebfc0}, {0x20a7fe0, 0xc0005205a0}, {0x0, 0x0}}, {0x20aa520, 0xc00028caa0})
	github.com/jaegertracing/jaeger/cmd/ingester/app/processor/span_processor.go:67 +0xd3
github.com/jaegertracing/jaeger/cmd/ingester/app/processor/decorator.(*retryDecorator).Process(0xc00019c230, {0x20aa520, 0xc00028caa0})
	github.com/jaegertracing/jaeger/cmd/ingester/app/processor/decorator/retry.go:110 +0x37
github.com/jaegertracing/jaeger/cmd/ingester/app/consumer.(*comittingProcessor).Process(0xc0005d1bc0, {0x20aa520, 0xc00028caa0})
	github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/committing_processor.go:44 +0x5e
github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*metricsDecorator).Process(0xc00041c340, {0x20aa520, 0xc00028caa0})
	github.com/jaegertracing/jaeger/cmd/ingester/app/processor/metrics_decorator.go:44 +0x5b
github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*ParallelProcessor).Start.func1()
	github.com/jaegertracing/jaeger/cmd/ingester/app/processor/parallel_processor.go:57 +0x42
created by github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*ParallelProcessor).Start
	github.com/jaegertracing/jaeger/cmd/ingester/app/processor/parallel_processor.go:53 +0x10c

HaroonSaid · 2022-04-17T23:16:06Z

The processing in the ingester is really barebones

jaeger/cmd/ingester/app/processor/span_processor.go

Lines 61 to 68 in cdbdb92

func (s KafkaSpanProcessor) Process(message Message) error {

span, err := s.unmarshaller.Unmarshal(message.Value())

if err != nil {

return fmt.Errorf("cannot unmarshall byte array into span: %w", err)

}

// TODO context should be propagated from upstream components

return s.writer.WriteSpan(context.TODO(), span)

}

It feels weird to me to be adding data format validation at this point, especially if the root cause is data corruption, which can affect any other aspect of the span message, not just the null process that came up in the exception.

The ingester - if seeing an invalid message, should just log and skip.

HaroonSaid · 2022-04-21T14:02:04Z

We are rolling out tracing to our production environment and seeing more and more panic jaeger investor (current backlog is ~25M messages)

How can I help to fix this bug?

seanw75 · 2022-04-21T14:38:01Z

@HaroonSaid - the only solution I know about is to lower your TTL in your kafka topic and have the corrupt kafka messages removed until there is a fix to the ingester to do some validation. In my own experience, we only got this issue when we were doing maintenance on our kafka servers, that then caused the collector to OOM. The segfault continued until we cleared out the kafka queue and then things stabilized.

That being said, something that may help Yuri is:

Are your kafka servers stable?
Is the connection between the collector and kafka stable?
Are your collectors stable (ie not OOMing)
What version of collector/ingester are you using?

yurishkuro · 2022-04-21T14:42:29Z

At this point I don't know what is wrong with the messages that cause panic. We already skip messages that are cannot be deserialized. I have a fix for null Process fix that you can try, but if it's a random corruption, it won't help. Being isolated to the Process field does not sound like random to me.

HaroonSaid · 2022-04-21T17:58:23Z

Is there a way for me to deserialize the base64 message that looks like it might be a problem ?
I typically see a null message, followed by another message in the debug trace

My current workaround is to change the offset (increase by 1) re-run in debug mode find the next bad message skip and rerun until all bad messages have been skipped

yurishkuro · 2022-04-21T18:02:57Z

Why is it base64?

We could certainly add a util that would read a message from a file, try to parse it into the data model, and then re-serialize that data model into JSON for inspection.

HaroonSaid · 2022-04-22T18:35:27Z

Why is it base64?

We could certainly add a util that would read a message from a file, try to parse it into the data model, and then re-serialize that data model into JSON for inspection.

When jaeger-ingester is run in debug mode (--log-level debug) it prints the incoming message.
The message value and key are encoded in base64.
I don't have a program to convert from base64 to span protobuf message.

I guess my question is, If I have an input message - is there any way to feed it into the ingester to find what it's complaining about. Did any developer create a simple program for debugging purposes?

if not - please provide some hints so that I can write it and call the appropriate library - and find out whats the offending message and write a workaround.

yurishkuro · 2022-04-22T18:49:09Z

I am not aware of such single-message testing setup (sounds like a good idea), but you can always reproduce it by running Kafka locally and using Kafka CLI to write a message to a topic and letting ingester process it

> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test

yurishkuro · 2022-04-22T18:51:52Z

would be good to have a docker-compose file for that ^

HaroonSaid · 2022-04-22T21:43:39Z

I am not aware of such single-message testing setup (sounds like a good idea), but you can always reproduce it by running Kafka locally and using Kafka CLI to write a message to a topic and letting the ingester process it
> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test 

For me to run ingester locally, I still need to create a bad message. the only bad message(s) that I have are the debug trace. I still need a way to convert the debug trace message into a real message and stick it into Kafka to re-produce.

Just want to make sure, we are on the same page. No one has written a debug trace (json) message to protobuf message convertor?

I am sure I can write a program to do that and debug the issue.

yurishkuro · 2022-04-22T21:59:35Z

I thought you said you have base64-encoded message, which I assumed would be binary protobuf (otherwise why base64).

We don't have any standalone conversion utils.

HaroonSaid · 2022-05-19T15:16:49Z

The tracing going great in our production environment, we were processing several billion spans a day and suddenly we start seeing panic again.

[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0xd93886]
"panic: runtime error: invalid memory address or nil pointer dereference
"
{"level":"info","ts":1652926531.4554677,"caller":"consumer/processor_factory.go:65","msg":"Creating new processors","partition":11}
{"level":"info","ts":1652926531.449849,"caller":"consumer/processor_factory.go:65","msg":"Creating new processors","partition":10}
{"level":"info","ts":1652926531.4402878,"caller":"consumer/consumer.go:110","msg":"Starting message handler","partition":10}
{"level":"info","ts":1652926531.4402316,"caller":"consumer/consumer.go:167","msg":"Starting error handler","partition":10}
{"level":"info","ts":1652926531.43911,"caller":"consumer/consumer.go:110","msg":"Starting message handler","partition":12}
{"level":"info","ts":1652926531.4390614,"caller":"consumer/consumer.go:167","msg":"Starting error handler","partition":12}
{"level":"info","ts":1652926531.4358554,"caller":"consumer/consumer.go:110","msg":"Starting message handler","partition":11}
{"level":"info","ts":1652926531.4358284,"caller":"consumer/consumer.go:167","msg":"Starting error handler","partition":11}
github.com/jaegertracing/jaeger/cmd/ingester/app/processor/parallel_processor.go:53 +0x10c
github.com/jaegertracing/jaeger/cmd/ingester/app/processor/parallel_processor.go:53 +0x10c
"created by github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*ParallelProcessor).Start
"
"created by github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*ParallelProcessor).Start
"
github.com/jaegertracing/jaeger/cmd/ingester/app/processor/parallel_processor.go:57 +0x42
github.com/jaegertracing/jaeger/cmd/ingester/app/processor/parallel_processor.go:57 +0x42
github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*ParallelProcessor).Start.func1()
github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*ParallelProcessor).Start.func1()
github.com/jaegertracing/jaeger/cmd/ingester/app/processor/metrics_decorator.go:44 +0x5b
github.com/jaegertracing/jaeger/cmd/ingester/app/processor/metrics_decorator.go:44 +0x5b
github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*metricsDecorator).Process(0xc00008d040, {0x14af600, 0xc000130fa0})
github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*metricsDecorator).Process(0xc000452780, {0x14af600, 0xc0000ce320})
github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/committing_processor.go:44 +0x5e
github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/committing_processor.go:44 +0x5e
github.com/jaegertracing/jaeger/cmd/ingester/app/consumer.(*comittingProcessor).Process(0xc0002960f0, {0x14af600, 0xc000130fa0})
github.com/jaegertracing/jaeger/cmd/ingester/app/consumer.(*comittingProcessor).Process(0xc000358360, {0x14af600, 0xc0000ce320})
github.com/jaegertracing/jaeger/cmd/ingester/app/processor/decorator/retry.go:110 +0x37
github.com/jaegertracing/jaeger/cmd/ingester/app/processor/decorator/retry.go:110 +0x37
github.com/jaegertracing/jaeger/cmd/ingester/app/processor/decorator.(*retryDecorator).Process(0xc00017f2d0, {0x14af600, 0xc000130fa0})
github.com/jaegertracing/jaeger/cmd/ingester/app/processor/decorator.(*retryDecorator).Process(0xc0001daa80, {0x14af600, 0xc0000ce320})
github.com/jaegertracing/jaeger/cmd/ingester/app/processor/span_processor.go:67 +0xd3
github.com/jaegertracing/jaeger/cmd/ingester/app/processor/span_processor.go:67 +0xd3
github.com/jaegertracing/jaeger/cmd/ingester/app/processor.KafkaSpanProcessor.Process({{0x14ad320, 0x1def2a8}, {0x14ad1e0, 0xc000288ea0}, {0x0, 0x0}}, {0x14af600, 0xc000130fa0})
github.com/jaegertracing/jaeger/cmd/ingester/app/processor.KafkaSpanProcessor.Process({{0x14ad320, 0x1def2a8}, {0x14ad1e0, 0xc0004cea20}, {0x0, 0x0}}, {0x14af600, 0xc0000ce320})
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/writer.go:152 +0x7a
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore.(*SpanWriter).WriteSpan(0xc000288ea0, {0x0, 0x0}, 0xc0003c8780)
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel/from_domain.go:43
"github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel.FromDomain.FromDomainEmbedProcess(...)
"
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel/from_domain.go:64 +0x126
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel.FromDomain.convertSpanEmbedProcess({0xc0, 0xc0000c4060, {0x1499478, 0xf23312}}, 0xc0003c8780)
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel/from_domain.go:123
"github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel.FromDomain.convertProcess(...)
"
goroutine 95 [running]:
"
"
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0xd93886]
"panic: runtime error: invalid memory address or nil pointer dereference
"

yurishkuro · 2022-05-19T15:20:50Z

there was a fix for this specific NPE in 1.34

juan-ramirez-sp · 2022-06-27T14:00:01Z

We have deployed 1.35.2 to all of our environments are seeing no issues so far.

Thanks again for implementing the fix @yurishkuro!

yurishkuro · 2022-06-27T17:48:49Z

@juan-ramirez-sp Glad to hear, but to be clear, the fix is just preventing the crash by replacing empty Process with Process{ServiceName: "unknown"}, so that the spans can still be saved and inspected later for their origin. We have not found out the root cause of this issue.

yurishkuro · 2022-06-27T17:50:32Z

Also, people still seem to be getting panics with 1.35 - #3765

juan-ramirez-sp · 2022-07-06T14:52:22Z

I've confirmed we are getting the same error again.
All components are running 1.35.2, and we performed a direct upgrade.

github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel.FromDomain.convertProcess(...) github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel/from_domain.go:123 github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel.FromDomain.convertSpanEmbedProcess({0x0?, 0xc000222ae0?, {0x15a7e88?, 0x1?}}, 0xc0000faff0) github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel/from_domain.go:64 +0x126 github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel.FromDomain.FromDomainEmbedProcess(...) github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel/from_domain.go:43 github.com/jaegertracing/jaeger/plugin/storage/es/spanstore.(*SpanWriter).WriteSpan(0xc0001d3d40, {0x0?, 0x0?}, 0xc0000faff0) github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/writer.go:152 +0x7a github.com/jaegertracing/jaeger/cmd/ingester/app/processor.KafkaSpanProcessor.Process({{0x15ac040, 0x1e84f90}, {0x15abec0, 0xc0001d3d40}, {0x0, 0x0}}, {0x15aeac0?, 0xc0003f8d20?}) github.com/jaegertracing/jaeger/cmd/ingester/app/processor/span_processor.go:67 +0xd3 github.com/jaegertracing/jaeger/cmd/ingester/app/processor/decorator.(*retryDecorator).Process(0xc000089b20, {0x15aeac0, 0xc0003f8d20}) github.com/jaegertracing/jaeger/cmd/ingester/app/processor/decorator/retry.go:110 +0x37 github.com/jaegertracing/jaeger/cmd/ingester/app/consumer.(*comittingProcessor).Process(0xc000375410, {0x15aeac0, 0xc0003f8d20}) github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/committing_processor.go:44 +0x5e github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*metricsDecorator).Process(0xc000128180, {0x15aeac0, 0xc0003f8d20}) github.com/jaegertracing/jaeger/cmd/ingester/app/processor/metrics_decorator.go:44 +0x5b github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*ParallelProcessor).Start.func1() github.com/jaegertracing/jaeger/cmd/ingester/app/processor/parallel_processor.go:57 +0x42 created by github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*ParallelProcessor).Start github.com/jaegertracing/jaeger/cmd/ingester/app/processor/parallel_processor.go:53 +0xf5

locmai · 2022-07-20T05:59:20Z

We are having the same issue with the setup of ElasticSearch backend after upgrading it from 7.17.2 (Jaeger 1.30.0 works great with this) to 7.17.5.

It seems like either the process / process.Tags was null. Adding a safeguard there might solve this?

Update my naive attempt: locmai@32a8e31

Another approach could be also good:

Instead of using:

tags, tagsMap := fd.convertKeyValuesString(process.Tags)

We could:

tags, tagsMap := fd.convertKeyValuesString(process.GetTags())`

The process.GetTags() has the nil check that we could reuse.

huahuayu · 2022-07-26T12:13:44Z

there was a fix for this specific NPE in 1.34

But in 1.35 & 1.36 it still occurs I get the same error #3829

I can ensure my spans in kafka is good, but it's hard to tell which span cause the issue, because lack of trace info, can you please catch and print it out.

juan-ramirez-sp · 2022-08-05T21:15:21Z

We've deployed 1.37 to our clusters and confirmed it catches the bad message correctly.

Thanks again for working on this!

BlackDex · 2022-08-15T13:24:59Z

@juan-ramirez-sp Does catching mean that it also clears those bad messages from Kafka, or are those kept in the Kafka cluster?

juan-ramirez-sp · 2022-08-16T19:38:48Z

@BlackDex The linked PR could answer it better than I can, #3819

I didn't notice multiple error messages so I assume it did clear them from kafka.

juan-ramirez-sp added the bug label Mar 10, 2022

yurishkuro mentioned this issue Mar 23, 2022

Jaeger Ingester crashes V1.32 #3598

Closed

yurishkuro mentioned this issue Apr 15, 2022

Sanitize spans with null process or empty service name #3631

Merged

frzifus mentioned this issue May 24, 2022

Jaeger Ingester CrashLoop - Invalid Memory Address jaegertracing/jaeger-operator#1899

Closed

yurishkuro mentioned this issue Jun 14, 2022

ingester to consumer kafka panic #3753

Closed

juan-ramirez-sp closed this as completed Jun 27, 2022

juame mentioned this issue Jul 5, 2022

jaeger-collector fails to send spans to kafka #3687

Closed

juan-ramirez-sp reopened this Jul 6, 2022

locmai mentioned this issue Jul 20, 2022

[ingester/fix] Apply sanitizers to avoid panic on span.process=nil #3819

Merged

yurishkuro mentioned this issue Jul 26, 2022

Ingester Panic 1.35 #3765

Closed

yurishkuro closed this as completed in #3819 Jul 26, 2022

yurishkuro mentioned this issue Jul 26, 2022

[Bug]: jaeger ingester module nil pointer error (when consume to ES) #3829

Closed

Jaeger ingester panics on convertProcess function #3578

Jaeger ingester panics on convertProcess function #3578

Comments

juan-ramirez-sp commented Mar 10, 2022 • edited Loading

yurishkuro commented Mar 21, 2022

juan-ramirez-sp commented Apr 4, 2022

yurishkuro commented Apr 4, 2022

JungBin-Eom commented Apr 8, 2022

yurishkuro commented Apr 8, 2022

JungBin-Eom commented Apr 9, 2022

yurishkuro commented Apr 9, 2022

JungBin-Eom commented Apr 11, 2022

juan-ramirez-sp commented Apr 11, 2022

HaroonSaid commented Apr 15, 2022

yurishkuro commented Apr 15, 2022

seanw75 commented Apr 15, 2022

yurishkuro commented Apr 15, 2022

seanw75 commented Apr 15, 2022

yurishkuro commented Apr 15, 2022

HaroonSaid commented Apr 17, 2022

HaroonSaid commented Apr 17, 2022

HaroonSaid commented Apr 21, 2022

seanw75 commented Apr 21, 2022

yurishkuro commented Apr 21, 2022

HaroonSaid commented Apr 21, 2022

yurishkuro commented Apr 21, 2022

HaroonSaid commented Apr 22, 2022

yurishkuro commented Apr 22, 2022

yurishkuro commented Apr 22, 2022

HaroonSaid commented Apr 22, 2022 • edited Loading

yurishkuro commented Apr 22, 2022

HaroonSaid commented May 19, 2022 • edited Loading

yurishkuro commented May 19, 2022

juan-ramirez-sp commented Jun 27, 2022

yurishkuro commented Jun 27, 2022

yurishkuro commented Jun 27, 2022

juan-ramirez-sp commented Jul 6, 2022

locmai commented Jul 20, 2022 • edited Loading

huahuayu commented Jul 26, 2022 • edited Loading

juan-ramirez-sp commented Aug 5, 2022

BlackDex commented Aug 15, 2022

juan-ramirez-sp commented Aug 16, 2022

juan-ramirez-sp commented Mar 10, 2022 •

edited

Loading

HaroonSaid commented Apr 22, 2022 •

edited

Loading

HaroonSaid commented May 19, 2022 •

edited

Loading

locmai commented Jul 20, 2022 •

edited

Loading

huahuayu commented Jul 26, 2022 •

edited

Loading