Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

chore/otel-collector: upgrade to v0.103.0, remove jaegerexporter #63171

Merged
merged 31 commits into from
Jul 10, 2024

Conversation

bobheadxi
Copy link
Member

@bobheadxi bobheadxi commented Jun 7, 2024

Thread: https://sourcegraph.slack.com/archives/C1JH2BEHZ/p1717797870638299

One problem caused by this upgrade is that the deprecated jaegerexporter no longer builds at all with the last published version, so for the upgrade to go through it must be removed. I've updated local sg start to work with this change, but some Release team support is needed for deployment configuration + customer-facing docs changes: https://sourcegraph.slack.com/archives/C1JH2BEHZ/p1718143249191349?thread_ts=1717797870.638299&cid=C1JH2BEHZ, since current guidance asks customers to configure jaegerexporter.

Part of https://linear.app/sourcegraph/issue/SEC-1680

Closes https://linear.app/sourcegraph/issue/CORE-177

Test plan

Followed steps shared in https://sourcegraph.slack.com/archives/C04MYFW01NV/p1718136211292469 to run locally, since sg run jaeger otel-collector alone is insufficient to get updated images:

diff --git a/wolfi-images/opentelemetry-collector.yaml b/wolfi-images/opentelemetry-collector.yaml
index b995a3d531b..76de80d4f84 100644
--- a/wolfi-images/opentelemetry-collector.yaml
+++ b/wolfi-images/opentelemetry-collector.yaml
@@ -7,7 +7,11 @@ contents:
     - mailcap
 
     ## opentelemetry-collector packages
-    - opentelemetry-collector@sourcegraph
+    - opentelemetry-collector@branch
+  keyring:
+    - https://packages.sgdev.org/sourcegraph-melange-dev.rsa.pub
+  repositories:
+    - '@branch https://packages.sgdev.org/branches/otel-collector-upgrade'
 
 paths:
   - path: /otel-collector

plus sg wolfi lock opentelemetry-collector will make sg run otel-collector use the correct image.

The above diffs SHOULD NOT be committed. The lock should happen post-merge.

Spot-checked some traces locally with:

sg run jaeger otel-collector
sg start

image

image

Changelog

  • OpenTelemetry Collector: Upgraded OpenTelemetry Collector to v0.103.0
  • OpenTelemetry Collector: The deprecated jaegerexporter has been removed. Users of exporter: { jaeger: ... } in the OpenTelemetry Collector should use exporter: { otlp/jaeger: ... } to send traces directly to Jaeger via its OTLP receiver.
    • Users of the default Jaeger configurations now need to provide JAEGER_OTLP_GRPC_PORT as well as the existing JAEGER_HOST configuration.

@cla-bot cla-bot bot added the cla-signed label Jun 7, 2024
Copy link
Contributor

github-actions bot commented Jun 7, 2024

Caution

License checking failed, please read: how to deal with third parties licensing.

1 similar comment
Copy link
Contributor

github-actions bot commented Jun 7, 2024

Caution

License checking failed, please read: how to deal with third parties licensing.

@bobheadxi bobheadxi requested review from a team June 7, 2024 22:26
Copy link
Contributor

@jhchabran jhchabran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR updates GRPC from 1.61 to 1.64 which deprecates the .Dial* functions. This is caught by the CI, and we most likely want to update it.

Looking at jaegertracing/jaeger#5330 it seems that there are a few little semantics differences to take in account, making this potentially not trivial.

I'll ping the source team as @ggilmore has been leading this for a while.

@eseliger
Copy link
Member

I think I am fine with keeping the deprecated method for now, given the complexity laid out in the thread JH linked.
I ran the stack locally and traces seem to be properly attached to parent spans across services, and gRPC doesn't seem to fail.

The release notes say:

These will continue to be supported throughout 1.x,

So LGTM.

- MPL-2.0
- :who:
:why:
:versions: []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be true for all things in here, and is unrelated to the PR, but I am confused why versions is empty :D

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's optional fields you can record when tracking a dependency change, e.g. "this code was MPL for vX to vY"

go.mod Outdated
@@ -40,8 +40,6 @@ replace (
github.com/gomodule/redigo => github.com/gomodule/redigo v1.8.9
// Pending: Renamed to github.com/google/gnostic. Transitive deps still use the old name (kubernetes/kubernetes).
github.com/googleapis/gnostic => github.com/googleapis/gnostic v0.5.5
// Pending a release cut of https://github.com/prometheus/alertmanager/pull/3010
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes!!

go.mod Outdated
Comment on lines 57 to 64
cloud.google.com/go/bigquery v1.61.0
cloud.google.com/go/kms v1.17.1
cloud.google.com/go/monitoring v1.19.0
cloud.google.com/go/profiler v0.4.0
cloud.google.com/go/pubsub v1.37.0
cloud.google.com/go/secretmanager v1.11.5
cloud.google.com/go/storage v1.38.0
github.com/GoogleCloudPlatform/opentelemetry-operations-go/exporter/metric v0.45.0
github.com/GoogleCloudPlatform/opentelemetry-operations-go/exporter/trace v1.21.0
cloud.google.com/go/pubsub v1.38.0
cloud.google.com/go/secretmanager v1.13.1
cloud.google.com/go/storage v1.40.0
github.com/GoogleCloudPlatform/opentelemetry-operations-go/exporter/metric v0.48.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sad that these all had to upgrade for this otel bump, makes it harder to track down regressions :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah the dependency tree is pretty wild. I did upgrade aggressively but we've ran into more issues with not upgrading in lockstep, than we have with upgrading everything OTEL depends on in one go

@bobheadxi bobheadxi requested a review from jhchabran June 11, 2024 18:58
Comment on lines +833 to +835
COLLECTOR_OTLP_ENABLED: 'true'
JAEGER_OTLP_GRPC_PORT: 4320
JAEGER_OTLP_HTTP_PORT: 4321
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to enable this on deploy-*

@bobheadxi bobheadxi changed the title chore/otel-collector: upgrade to v0.102.0, remove jaegerexporter chore/otel-collector: upgrade to v0.103.0, remove jaegerexporter Jun 26, 2024
@bobheadxi bobheadxi marked this pull request as ready for review June 26, 2024 20:23
@@ -47,15 +47,15 @@ func TestNewExhaustive(t *testing.T) {
(useFullDeadline . true)
(patternInfo . TextPatternInfo{"content",case,nopath,filematchlimit:1000000,lang:cpp,F:"(?i)(?:\\.cpp$)|(?:\\.c\\+\\+$)|(?:\\.cc$)|(?:\\.cp$)|(?:\\.cppm$)|(?:\\.cxx$)|(?:\\.h$)|(?:\\.h\\+\\+$)|(?:\\.hh$)|(?:\\.hpp$)|(?:\\.hxx$)|(?:\\.inc$)|(?:\\.inl$)|(?:\\.ino$)|(?:\\.ipp$)|(?:\\.ixx$)|(?:\\.re$)|(?:\\.tcc$)|(?:\\.tpp$)|(?:\\.txx$)"})
(numRepos . 0)
(pathRegexps . [(?i)(?:\.cpp$)|(?:\.c\+\+$)|(?:\.cc$)|(?:\.cp$)|(?:\.cppm$)|(?:\.cxx$)|(?:\.h$)|(?:\.h\+\+$)|(?:\.hh$)|(?:\.hpp$)|(?:\.hxx$)|(?:\.inc$)|(?:\.inl$)|(?:\.ino$)|(?:\.ipp$)|(?:\.ixx$)|(?:\.re$)|(?:\.tcc$)|(?:\.tpp$)|(?:\.txx$)])
(pathRegexps . ["(?i)(?:\\.cpp$)|(?:\\.c\\+\\+$)|(?:\\.cc$)|(?:\\.cp$)|(?:\\.cppm$)|(?:\\.cxx$)|(?:\\.h$)|(?:\\.h\\+\\+$)|(?:\\.hh$)|(?:\\.hpp$)|(?:\\.hxx$)|(?:\\.inc$)|(?:\\.inl$)|(?:\\.ino$)|(?:\\.ipp$)|(?:\\.ixx$)|(?:\\.re$)|(?:\\.tcc$)|(?:\\.tpp$)|(?:\\.txx$)"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know where this is coming from? cc @varungandhi-src perhaps?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhchabran sorry, I don't know about this. I've flagged this in the search platform team channel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently we use go.opentelemetry.io/otel to encode values in the s-exp printer for the tests (see here).

Without looking at the source of go.opentelemetry.io/otel, My best bet is they changed the way they process arrays.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@jhchabran jhchabran Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @stefanhengl. Do you think this is creating a problem here? (I don't have context on this, was flagged as a reviewer and spotted this, which prompted me to reach out as this could break something)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. As far as I can see it is only used in tests.

@Chickensoupwithrice Chickensoupwithrice merged commit 7d91894 into main Jul 10, 2024
9 checks passed
@Chickensoupwithrice Chickensoupwithrice deleted the otel-collector-upgrade branch July 10, 2024 16:01
Chickensoupwithrice added a commit that referenced this pull request Jul 10, 2024
)

Thread:
https://sourcegraph.slack.com/archives/C1JH2BEHZ/p1717797870638299

One problem caused by this upgrade is that the [deprecated
`jaegerexporter`](open-telemetry/opentelemetry-specification#2858)
no longer builds at all with the last published version, so for the
upgrade to go through it must be removed. I've updated local `sg start`
to work with this change, but some Release team support is needed for
deployment configuration + customer-facing docs changes:
https://sourcegraph.slack.com/archives/C1JH2BEHZ/p1718143249191349?thread_ts=1717797870.638299&cid=C1JH2BEHZ,
since current guidance asks customers to configure `jaegerexporter`.

Part of https://linear.app/sourcegraph/issue/SEC-1680

Closes https://linear.app/sourcegraph/issue/CORE-177

## Test plan

Followed steps shared in
https://sourcegraph.slack.com/archives/C04MYFW01NV/p1718136211292469 to
run locally, since `sg run jaeger otel-collector` alone is insufficient
to get updated images:

```diff
diff --git a/wolfi-images/opentelemetry-collector.yaml b/wolfi-images/opentelemetry-collector.yaml
index b995a3d531b..76de80d4f84 100644
--- a/wolfi-images/opentelemetry-collector.yaml
+++ b/wolfi-images/opentelemetry-collector.yaml
@@ -7,7 +7,11 @@ contents:
     - mailcap
 
     ## opentelemetry-collector packages
-    - opentelemetry-collector@sourcegraph
+    - opentelemetry-collector@branch
+  keyring:
+    - https://packages.sgdev.org/sourcegraph-melange-dev.rsa.pub
+  repositories:
+    - '@Branch https://packages.sgdev.org/branches/otel-collector-upgrade'
 
 paths:
   - path: /otel-collector
```

plus `sg wolfi lock opentelemetry-collector` will make `sg run
otel-collector` use the correct image.

**The above diffs SHOULD NOT be committed. The lock should happen
post-merge.**

Spot-checked some traces locally with:

```
sg run jaeger otel-collector
sg start
```


![image](https://github.com/sourcegraph/sourcegraph/assets/23356519/dff88d5a-db9a-4039-b7ba-682e120bdc41)


![image](https://github.com/sourcegraph/sourcegraph/assets/23356519/7e553894-0026-47de-ae38-ba5502b94c63)


## Changelog 

- OpenTelemetry Collector: Upgraded OpenTelemetry Collector to v0.103.0
- OpenTelemetry Collector: The [deprecated `jaegerexporter` has been
removed](open-telemetry/opentelemetry-specification#2858).
Users of `exporter: { jaeger: ... }` in the OpenTelemetry Collector
should use `exporter: { otlp/jaeger: ... }` to send traces directly to
Jaeger via its OTLP receiver.
- Users of the default Jaeger configurations now need to provide
`JAEGER_OTLP_GRPC_PORT` as well as the existing `JAEGER_HOST`
configuration.

---------

Co-authored-by: Anish Lakhwara <anish+github@lakhwara.com>
Co-authored-by: Anish Lakhwara <anish+git@lakhwara.com>
Chickensoupwithrice referenced this pull request Jul 10, 2024
<!-- PR description tips:
https://www.notion.so/sourcegraph/Write-a-good-pull-request-description-610a7fd3e613496eb76f450db5a49b6e
-->
We need to update the wolfi lock image for
https://github.com/sourcegraph/sourcegraph/pull/63171 in order for `sg
run` to work

We've made all the changes to the deployment repos for this to be pushed
out in the release today.

## Test plan

<!-- REQUIRED; info at
https://docs-legacy.sourcegraph.com/dev/background-information/testing_principles
-->
Manually tested

## Changelog

<!-- OPTIONAL; info at
https://www.notion.so/sourcegraph/Writing-a-changelog-entry-dd997f411d524caabf0d8d38a24a878c
-->
- fix(build): update wolfi lock for otel-collector
bobheadxi referenced this pull request Jul 11, 2024
…#63790)

The OTEL upgrade https://github.com/sourcegraph/sourcegraph/pull/63171
bumps the `prometheus/common` package too far via transitive deps,
causing us to generate configuration for alertmanager that altertmanager
doesn't accept, at least until the alertmanager project cuts a new
release with a newer version of `promethues/common`.

For now we forcibly downgrade with a replace. Everything still builds,
so we should be good to go.

## Test plan
`sg start` and `sg run prometheus`. On `main`, editing
`observability.alerts` will cause Alertmanager to refuse to accept the
generated configuration. With this patch, all is well it seems - config
changes go through as expected. This is a similar test plan for
https://github.com/sourcegraph/sourcegraph/pull/63329

## Changelog

- Fix Prometheus Alertmanager configuration failing to apply
`observability.alerts` from site config
sourcegraph-release-bot referenced this pull request Jul 11, 2024
…#63790)

The OTEL upgrade https://github.com/sourcegraph/sourcegraph/pull/63171
bumps the `prometheus/common` package too far via transitive deps,
causing us to generate configuration for alertmanager that altertmanager
doesn't accept, at least until the alertmanager project cuts a new
release with a newer version of `promethues/common`.

For now we forcibly downgrade with a replace. Everything still builds,
so we should be good to go.

## Test plan
`sg start` and `sg run prometheus`. On `main`, editing
`observability.alerts` will cause Alertmanager to refuse to accept the
generated configuration. With this patch, all is well it seems - config
changes go through as expected. This is a similar test plan for
https://github.com/sourcegraph/sourcegraph/pull/63329

## Changelog

- Fix Prometheus Alertmanager configuration failing to apply
`observability.alerts` from site config

(cherry picked from commit ffa873f)
jdpleiness referenced this pull request Jul 11, 2024
… generated config (#63793)

The OTEL upgrade https://github.com/sourcegraph/sourcegraph/pull/63171
bumps the `prometheus/common` package too far via transitive deps,
causing us to generate configuration for alertmanager that altertmanager
doesn&#39;t accept, at least until the alertmanager project cuts a new
release with a newer version of `promethues/common`.

For now we forcibly downgrade with a replace. Everything still builds,
so we should be good to go.

## Test plan
`sg start` and `sg run prometheus`. On `main`, editing
`observability.alerts` will cause Alertmanager to refuse to accept the
generated configuration. With this patch, all is well it seems - config
changes go through as expected. This is a similar test plan for
https://github.com/sourcegraph/sourcegraph/pull/63329

## Changelog

- Fix Prometheus Alertmanager configuration failing to apply
`observability.alerts` from site config <br> Backport
ffa873f from #63790

Co-authored-by: Robert Lin <robert@bobheadxi.dev>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants