Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature gate -The default endpoints for all servers in components have changed to use localhost instead of 0.0.0.0. #34394

Closed
sairamsadanala opened this issue Aug 2, 2024 · 14 comments
Labels
extension/healthcheck Health Check Extension question Further information is requested

Comments

@sairamsadanala
Copy link

Component(s)

extension/healthcheck

What happened?

Description

We have built an abstraction layer with Otel-coolector-contrib which is intermediate layer where all the otel collector sends telemetry and abstraction layer export to Splunk and Grafana endpoints. Abstraction layer is run on AWS ECS cluster which is load balanced via AWS NLB. This setup is automated using ADO pipeline with CloudFormation template.

Steps to Reproduce

Attached the config and

Expected Result

Up until V0.100.0 our ECS cluster for Abstraction layer run healthy and exports the telemetry to exporter endpoints.

Actual Result

With v0.106.1, AWS NLB target groups health checks are failing on port 13133 and rollbacking the cloudformation teamplate. it is working as expected for v0.100.0.

Collector version

v0.106.1

Environment information

Environment

OS: Amazon Linux

OpenTelemetry Collector configuration

extensions:
  health_check:
  pprof:
    endpoint: 0.0.0.0:1777
  zpages:
    endpoint: 0.0.0.0:55679
receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318
      grpc:
  awsecscontainermetrics:
    collection_interval: 30s    
  hostmetrics:
    collection_interval: 30s
    scrapers:
      cpu:
      memory:
      load:
        cpu_average: true
  hostmetrics/disk:
    collection_interval: 1m
    scrapers:
      disk:
      filesystem:
  splunk_hec/logs:
    endpoint: 0.0.0.0:8088
    access_token_passthrough: true
  splunk_hec/metrics:
    endpoint: 0.0.0.0:8087
    access_token_passthrough: true
processors:
  batch:
    send_batch_size: 12000
    timeout: 10s
    send_batch_max_size: 14000
  resourcedetection/general:
    detectors: [env,ecs,system,docker]
  attributes:
    actions:
      - action: insert
        key: loki.attribute.labels
        value: log.file.name
  resource:
    attributes:
      - action: insert
        key: loki.resource.labels
        value: cloud.account.id,cloud.availability_zone,cloud.platform,cloud.provider,cloud.region,host.id,host.name,host.type

exporters:
  splunk_hec/logs:
    endpoint: "https://XXXXXXXXX:xxxx/services/collector"
    token: "XXXXXX-XXXXXXXXXXXXXXXXXXxx"
    timeout: 30s
    index: devops
    sending_queue:
      enabled: true
      num_consumers: 60
      queue_size: 100000
    retry_on_failure:
      enabled: true
      initial_interval: 10s
      max_interval: 60s
      max_elapsed_time: 60s
    tls:
      insecure_skip_verify: false
      ca_file: "XXXX.pem"
      cert_file: "XXXX.pem"
      key_file: "XXX.key"
  splunk_hec/metrics:
    endpoint: "https://XXXXX/services/collector/event"
    token: "XXXXXXXXXX-XXXX"
    timeout: 30s
    index: "tooling_metrics"
    sending_queue:
      enabled: true
      num_consumers: 60
      queue_size: 100000
    retry_on_failure:
      enabled: true
      initial_interval: 10s
      max_interval: 60s
      max_elapsed_time: 60s
    tls:
      insecure_skip_verify: false
      ca_file: "XXX.pem"
      cert_file: "XXX.pem"
      key_file: "/etc/otel/splunk_dec.key"
  loki:
    endpoint: "https://XXXXX/loki/api/v1/push"
    tls:
        insecure: false
        insecure_skip_verify: true
    headers:
        "authorization": "Basic XXXXXXXXXXXXXXX="
  prometheusremotewrite:
    endpoint: https://XXXXX/mimir/api/v1/push
    tls:
        insecure: false
        insecure_skip_verify: true
    headers:
        "authorization": "Basic XXXXXXXXXXXXXXXXXXXXXXXX"
    external_labels:
        source: otalecsprd
    resource_to_telemetry_conversion: 
      enabled: true
  otlphttp:    
    endpoint: "https://XXXXX/tempo/otlp/"
    traces_endpoint: "https://xxxxxx/tempo/otlp/v1/traces"
    tls:      
        insecure: false
        insecure_skip_verify: true   
    headers:
        "authorization": "Basic XXXXXXXXXXXXXXXXXXXXXX"
service:
  extensions: [health_check, pprof, zpages]
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [XXX,XXX]
      processors: [batch,resourcedetection/general]
    metrics:
      receivers: [otlp,splunk_hec/incomingmetrics,awsecscontainermetrics]
      exporters: [splunk_hec/metrics,prometheusremotewrite]
      processors: [batch]
    metrics/internal:
      receivers: [hostmetrics,hostmetrics/disk]
      exporters: [splunk_hec/metrics,prometheusremotewrite]
      processors: [batch,resourcedetection/general]
    logs:
      receivers: [otlp]
      exporters: [splunk_hec/logs,loki]
      processors: [batch,attributes,resource]
    logs/splunk_hec:
      receivers: [splunk_hec/incominglogs]
      exporters: [splunk_hec/logs,loki]
      processors: [batch,attributes,resource]

Log output

XXXXXXXXXXXXXXXXXXXX:31:16.639Z        info    zpagesextension@v0.106.1/zpagesextension.go:76  Starting zPages extension       {"kind": "extension", "name": "zpages", "config": {"Endpoint":"0.0.0.0:55679","TLSSetting":null,"CORS":null,"Auth":null,"MaxRequestBodySize":0,"IncludeMetadata":false,"ResponseHeaders":null,"CompressionAlgorithms":null,"ReadTimeout":0,"ReadHeaderTimeout":0,"WriteTimeout":0,"IdleTimeout":0}}
XXXXXXXXXXXXXXXXXXXX:31:16.639Z        info    extensions/extensions.go:56     Extension started.      {"kind": "extension", "name": "zpages"}
XXXXXXXXXXXXXXXXXXXX:31:16.639Z        info    extensions/extensions.go:39     Extension is starting...        {"kind": "extension", "name": "pprof"}
XXXXXXXXXXXXXXXXXXXX:31:16.639Z        info    pprofextension@v0.106.1/pprofextension.go:60    Starting net/http/pprof server  {"kind": "extension", "name": "pprof", "config": {"TCPAddr":{"Endpoint":"0.0.0.0:1777","DialerConfig":{"Timeout":0}},"BlockProfileFraction":0,"MutexProfileFraction":0,"SaveToFile":""}}
XXXXXXXXXXXXXXXXXXXX:31:16.639Z        info    extensions/extensions.go:56     Extension started.      {"kind": "extension", "name": "pprof"}
XXXXXXXXXXXXXXXXXXXX:31:16.640Z        info    internal/resourcedetection.go:125       began detecting resource information    {"kind": "processor", "name": "resourcedetection/general", "pipeline": "traces"}
XXXXXXXXXXXXXXXXXXXX:31:16.650Z        warn    internal/resourcedetection.go:130       failed to detect resource       {"kind": "processor", "name": "resourcedetection/general", "pipeline": "traces", "error": "failed getting OS type: failed to fetch Docker OS type: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"}
XXXXXXXXXXXXXXXXXXXX:31:16.650Z        info    internal/resourcedetection.go:139       detected resource information   {"kind": "processor", "name": "resourcedetection/general", "pipeline": "traces", "resource": {"aws.ecs.cluster.arn":"XXXXXXXXXXXXXXXXXXXX","aws.ecs.launchtype":"ec2","aws.ecs.task.arn":"arn:aws:ecs:XXXXXXXXXXXXXXXXXXXX","aws.ecs.task.family":"splunkhec-otelTDef","aws.ecs.task.id":"ee4c36a87b124aad9848542844239e77","aws.ecs.task.revision":"13","cloud.account.id":"XXXXXXXXXXXXXXXXXXXX","cloud.availability_zone":"eu-west-1b","cloud.platform":"aws_ecs","cloud.provider":"aws","cloud.region":"eu-west-1","host.name":"XXXXXXXXXXXXXXXXXXXX","os.type":"linux"}}
XXXXXXXXXXXXXXXXXXXX:31:16.650Z        info    otlpreceiver@v0.106.1/otlp.go:102       Starting GRPC server    {"kind": "receiver", "name": "otlp", "data_type": "traces", "endpoint": "localhost:4317"}
XXXXXXXXXXXXXXXXXXXX:31:16.650Z        info    otlpreceiver@v0.106.1/otlp.go:152       Starting HTTP server    {"kind": "receiver", "name": "otlp", "data_type": "traces", "endpoint": "0.0.0.0:4318"}
XXXXXXXXXXXXXXXXXXXX:31:16.651Z        info    healthcheck/handler.go:132      Health Check state change       {"kind": "extension", "name": "health_check", "status": "ready"}
XXXXXXXXXXXXXXXXXXXX:31:16.651Z        info    service@v0.106.1/service.go:225 Everything is ready. Begin running and processing data.
XXXXXXXXXXXXXXXXXXXX:31:16.651Z        info    localhostgate/featuregate.go:63 The default endpoints for all servers in components have changed to use localhost instead of 0.0.0.0. Disable the feature gate to temporarily revert to the previous default.  {"feature gate ID":

Additional context

We would like to understand what is this change translate?
"localhostgate/featuregate.go:63 The default endpoints for all servers in components have changed to use localhost instead of 0.0.0.0. Disable the feature gate to temporarily revert to the previous default. {"feature gate ID": "component.UseLocalHostAsDefaultHost"}"

How do we disable this change from default to use localhost. Any document or steps are highly appreciated.

@sairamsadanala sairamsadanala added bug Something isn't working needs triage New item requiring triage labels Aug 2, 2024
@github-actions github-actions bot added the extension/healthcheck Health Check Extension label Aug 2, 2024
Copy link
Contributor

github-actions bot commented Aug 2, 2024

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@crobert-1
Copy link
Member

Hello @sairamsadanala, thanks for filing this issue. As the message states, if you prefer to keep the default endpoint as 0.0.0.0 you can disable the component.UseLocalHostAsDefaultHost feature gate. Information can be found here on how to do disable feature gates.

For more information on the reasoning and context of this change, changing the default to be localhost instead of 0.0.0.0, please refer to this issue.

The best option is to be able to update your configuration to work with an endpoint other than 0.0.0.0 as pointed out in the linked issue, as it's a potential security risk.

@crobert-1 crobert-1 added question Further information is requested and removed bug Something isn't working needs triage New item requiring triage labels Aug 2, 2024
@sairamsadanala
Copy link
Author

sairamsadanala commented Aug 12, 2024 via email

@crobert-1
Copy link
Member

The feature gate name is component.UseLocalHostAsDefaultHost 👍

@sairamsadanala
Copy link
Author

sairamsadanala commented Aug 12, 2024 via email

@crobert-1
Copy link
Member

Right, I believe that should work.

@jpkrohling
Copy link
Member

The real solution though is to set your health check extension to use 0.0.0.0 (or NodeIP) instead:

  health_check:
    endpoint: 0.0.0.0:13133

@WamBamBoozle
Copy link

thanks @jpkrohling -- that was the answer that was eluding me

Except: the real solution is that that be the default, as I was using the default which was leading to this error

@jpkrohling
Copy link
Member

the real solution is that that be the default

We consciously moved from the default "0.0.0.0" to "localhost".

@Mathiasdm
Copy link

Mathiasdm commented Sep 3, 2024

Having 'localhost' as a default is sensible security-wise.

What I did not expect was that, even if I explicitly specify '0.0.0.0', it's still changed to localhost. I would expect this to only happen if I did not specify anything (hence 'default').

Example config:

receivers:
    otlp:
        protocols:
            grpc:
                endpoint: 0.0.0.0:4317
            http:
                endpoint: 0.0.0.0:4318

Wouldn't it make more sense to only change the endpoint to localhost in case of:

receivers:
    otlp:
        protocols:
            grpc:
            http:

@jpkrohling
Copy link
Member

I agree with you, and I just tested on v0.108.0 and it works as expected:

receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318

Logs:

2024-09-06T15:06:21.023+0200    info    otlpreceiver@v0.108.1/otlp.go:153       Starting HTTP server    {"kind": "receiver", "name": "otlp", "data_type": "traces", "endpoint": "0.0.0.0:4318"}

@Mathiasdm
Copy link

Mathiasdm commented Sep 9, 2024

Well, that's surprising, my previous test last week didn't seem to work, but it does work without adapting the feature gate now.
I must have made a mistake last time.

I was also doing my tests on 0.108.0.

Please ignore my previous message.

@jpkrohling
Copy link
Member

I'm closing this issue for now, but please reopen it if we are still missing something.

@TRAD-Anthony-CKO
Copy link

TRAD-Anthony-CKO commented Sep 14, 2024

@jpkrohling I think this work for the OTLP endpoints indeed, but not the healthcheck extension (without disabling the featuregate). See below example tested on 109.0:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

exporters:
  otlp:
    endpoint: "${COLLECTOR_GATEWAY_ENDPOINT}"
    tls:
      insecure: true

processors:

extensions:
  health_check:
    endpoint: "0.0.0.0:13133"

service:
  extensions: [health_check]
  telemetry:
    logs:
      level: "debug"
  pipelines:
    metrics:
      receivers: [otlp]
      exporters: [otlp]
    logs:
      receivers: [otlp]
      exporters: [otlp]
    traces:
      receivers: [otlp]
      exporters: [otlp]

Collector logs:

2024-09-14T08:11:23.913Z        info    healthcheckextension@v0.106.1/healthcheckextension.go:32        Starting health_check extension {"kind": "extension", "name": "health_check", "config": {"Endpoint":"localhost:13133","TLSSetting":null,"CORS":null,"Auth":null,"MaxRequestBodySize":0,"IncludeMetadata":false,"ResponseHeaders":null,"CompressionAlgorithms":null,"ReadTimeout":0,"ReadHeaderTimeout":0,"WriteTimeout":0,"IdleTimeout":0,"Path":"/","ResponseBody":null,"CheckCollectorPipeline":{"Enabled":false,"Interval":"5m","ExporterFailureThreshold":5}}}
2024-09-14T08:11:23.914Z        info    extensions/extensions.go:56     Extension started.      {"kind": "extension", "name": "health_check"}
2024-09-14T08:11:23.914Z        info    zapgrpc/zapgrpc.go:176  [core] [Server #1]Server created        {"grpc_log": true}
2024-09-14T08:11:23.914Z        info    otlpreceiver@v0.106.1/otlp.go:102       Starting GRPC server    {"kind": "receiver", "name": "otlp", "data_type": "logs", "endpoint": "0.0.0.0:55680"}
2024-09-14T08:11:23.914Z        info    otlpreceiver@v0.106.1/otlp.go:152       Starting HTTP server    {"kind": "receiver", "name": "otlp", "data_type": "logs", "endpoint": "0.0.0.0:55681"}
2024-09-14T08:11:23.914Z        info    healthcheck/handler.go:132      Health Check state change       {"kind": "extension", "name": "health_check", "status": "ready"}
2024-09-14T08:11:23.914Z        info    service@v0.106.1/service.go:225 Everything is ready. Begin running and processing data.
2024-09-14T08:11:23.914Z        info    localhostgate/featuregate.go:63 The default endpoints for all servers in components have changed to use localhost instead of 0.0.0.0. Disable the feature gate to temporarily revert to the previous default.     {"feature gate ID": "component.UseLocalHostAsDefaultHost"}
2024-09-14T08:11:23.914Z        info    zapgrpc/zapgrpc.go:176  [core] [Server #1 ListenSocket #2]ListenSocket created  {"grpc_log": true}

Since the featuregate is planned to be removed in future releases, looking for a more long term solution here.
Edit: Raised a new issue in case that behavior is new.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
extension/healthcheck Health Check Extension question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants