Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error sending alert: bad response status 422 Unprocessable Entity #6053

Open
mousimin opened this issue Jul 2, 2024 · 4 comments
Open

Error sending alert: bad response status 422 Unprocessable Entity #6053

mousimin opened this issue Jul 2, 2024 · 4 comments

Comments

@mousimin
Copy link

mousimin commented Jul 2, 2024

Describe the bug
We are running micro services for cortex, we were using v1 version for alertmanager api by specifying flag -ruler.alertmanager-use-v2=false(used cortex v1.16.0), now we upgrade cortex to v1.17.1, from the log, I see we are using v2 version for alertmanager, when I create some alert rules, I see the alerts fire, but we can't get any email notification, meanwhile we are getting some error messages like:
caller=notifier.go:544 level=error user=Test alertmanager=https://cortex-alertmanager.org/alertmanager/api/v2/alerts count=1 msg="Error sending alert" err="bad response status 422 Unprocessable Entity"

To Reproduce
Steps to reproduce the behavior:

  1. Start Cortex (SHA or version): start cortex v1.17.1 with micro service mode
  2. Perform Operations(Read/Write/Others): create alert rule and observe the logs of ruler

Expected behavior
we should get the notifications and no error log should appear.

Environment:

  • Infrastructure: [e.g., Kubernetes, bare-metal, laptop]: bare-metal
  • Deployment tool: [e.g., helm, jsonnet]: we are using ansible to deploy systemd services for cortex micro services

Additional Context
configuration file for cortex ruler:

ExecStart=/usr/sbin/cortex-1.17.1 \
  -auth.enabled=true \
  -log.level=info \
  -config.file=/etc/cortex-ruler/cortex-ruler.yaml \
  -runtime-config.file=/etc/cortex-shared/cortex-runtime.yaml \
  -server.http-listen-port=8061 \
  -server.grpc-listen-port=9061 \
  -server.grpc-max-recv-msg-size-bytes=104857600 \
  -server.grpc-max-send-msg-size-bytes=104857600 \
  -server.grpc-max-concurrent-streams=1000 \
  \
  -distributor.sharding-strategy=shuffle-sharding \
  -distributor.ingestion-tenant-shard-size=12 \
  -distributor.replication-factor=2 \
  -distributor.shard-by-all-labels=true \
  -distributor.zone-awareness-enabled=true \
  \
  -store.engine=blocks \
  -blocks-storage.backend=s3 \
  -blocks-storage.s3.endpoint=s3.org:10444 \
  -blocks-storage.s3.bucket-name=staging-metrics \
  -blocks-storage.s3.insecure=false \
  \
  -blocks-storage.bucket-store.sync-dir=/local/cortex-ruler/tsdb-sync \
  -blocks-storage.bucket-store.metadata-cache.backend=memcached \
  -blocks-storage.bucket-store.metadata-cache.memcached.addresses=100.76.51.1:11211,100.76.51.2:11211,100.76.51.3:11211 \
  \
  -querier.active-query-tracker-dir=/local/cortex-ruler/active-query-tracker \
  -querier.ingester-streaming=true \
  -querier.query-store-after=23h \
  -querier.query-ingesters-within=24h \
  -querier.shuffle-sharding-ingesters-lookback-period=25h \
  \
  -store-gateway.sharding-enabled=true \
  -store-gateway.sharding-strategy=shuffle-sharding \
  -store-gateway.tenant-shard-size=6 \
  -store-gateway.sharding-ring.store=etcd \
  -store-gateway.sharding-ring.etcd.endpoints=10.120.121.1:2379 \
  -store-gateway.sharding-ring.etcd.endpoints=10.120.121.2:2379 \
  -store-gateway.sharding-ring.etcd.endpoints=10.120.121.3:2379 \
  -store-gateway.sharding-ring.etcd.endpoints=10.120.121.4:2379 \
  -store-gateway.sharding-ring.etcd.endpoints=10.120.121.5:2379 \
  -store-gateway.sharding-ring.prefix=cortex-store-gateways/ \
  -store-gateway.sharding-ring.replication-factor=2 \
  -store-gateway.sharding-ring.zone-awareness-enabled=true \
  -store-gateway.sharding-ring.instance-availability-zone=t1 \
  -store-gateway.sharding-ring.wait-stability-min-duration=1m \
  -store-gateway.sharding-ring.wait-stability-max-duration=5m \
  -store-gateway.sharding-ring.instance-addr=100.76.75.1 \
  -store-gateway.sharding-ring.instance-id=s_8061 \
  -store-gateway.sharding-ring.heartbeat-period=15s \
  -store-gateway.sharding-ring.heartbeat-timeout=1m \
  \
  -ring.store=etcd \
  -ring.prefix=cortex-ingesters/ \
  -ring.heartbeat-timeout=1m \
  -etcd.endpoints=10.120.119.1:2379 \
  -etcd.endpoints=10.120.119.2:2379 \
  -etcd.endpoints=10.120.119.3:2379 \
  -etcd.endpoints=10.120.119.4:2379 \
  -etcd.endpoints=10.120.119.5:2379 \
  \
  -ruler.enable-sharding=true \
  -ruler.sharding-strategy=shuffle-sharding \
  -ruler.tenant-shard-size=2 \
  -ruler.ring.store=etcd \
  -ruler.ring.prefix=cortex-rulers/ \
  -ruler.ring.num-tokens=32 \
  -ruler.ring.heartbeat-period=15s \
  -ruler.ring.heartbeat-timeout=1m \
  -ruler.ring.etcd.endpoints=10.120.119.1:2379 \
  -ruler.ring.etcd.endpoints=10.120.119.2:2379 \
  -ruler.ring.etcd.endpoints=10.120.119.3:2379 \
  -ruler.ring.etcd.endpoints=10.120.119.4:2379 \
  -ruler.ring.etcd.endpoints=10.120.119.5:2379 \
  -ruler.ring.instance-id=s_8061 \
  -ruler.ring.instance-interface-names=e1 \
  \
  -ruler.max-rules-per-rule-group=500 \
  -ruler.max-rule-groups-per-tenant=5000 \
  \
  -ruler.external.url=staging-cortex-ruler.org \
  -ruler.client.grpc-max-recv-msg-size=104857600 \
  -ruler.client.grpc-max-send-msg-size=16777216 \
  -ruler.client.grpc-compression= \
  -ruler.client.grpc-client-rate-limit=0 \
  -ruler.client.grpc-client-rate-limit-burst=0 \
  -ruler.client.backoff-on-ratelimits=false \
  -ruler.client.backoff-min-period=500ms \
  -ruler.client.backoff-max-period=10s \
  -ruler.client.backoff-retries=5 \
  -ruler.evaluation-interval=15s \
  -ruler.poll-interval=15s \
  -ruler.rule-path=/local/cortex-ruler/rules \
  -ruler.alertmanager-url=https://staging-cortex-alertmanager.org/alertmanager \
  -ruler.alertmanager-discovery=false \
  -ruler.alertmanager-refresh-interval=1m \
  -ruler.notification-queue-capacity=10000 \
  -ruler.notification-timeout=10s \
  -ruler.flush-period=1m \
  -experimental.ruler.enable-api=true \
  \
  -ruler-storage.backend=s3 \
  -ruler-storage.s3.endpoint=s3.org:10444 \
  -ruler-storage.s3.bucket-name=staging-rules \
  -ruler-storage.s3.insecure=false \
  \
  -target=ruler

configuration file for cortex alertmanager:

ExecStart=/usr/sbin/cortex-1.17.1 \
  -auth.enabled=true \
  -log.level=info \
  -config.file=/etc/cortex-alertmanager-8071/cortex-alertmanager.yaml \
  -runtime-config.file=/etc/cortex-shared/cortex-runtime.yaml \
  -server.http-listen-port=8071 \
  -server.grpc-listen-port=9071 \
  -server.grpc-max-recv-msg-size-bytes=104857600 \
  -server.grpc-max-send-msg-size-bytes=104857600 \
  -server.grpc-max-concurrent-streams=1000 \
  \
  -alertmanager.storage.path=/local/cortex-alertmanager-8071/data \
  -alertmanager.storage.retention=120h \
  -alertmanager.web.external-url=https://staging-cortex-alertmanager.org/alertmanager \
  -alertmanager.configs.poll-interval=1m \
  -experimental.alertmanager.enable-api=true \
  \
  -alertmanager.sharding-enabled=true \
  -alertmanager.sharding-ring.store=etcd \
  -alertmanager.sharding-ring.prefix=cortex-alertmanagers/ \
  -alertmanager.sharding-ring.heartbeat-period=15s \
  -alertmanager.sharding-ring.heartbeat-timeout=1m \
  -alertmanager.sharding-ring.etcd.endpoints=10.120.121.1:2379 \
  -alertmanager.sharding-ring.etcd.endpoints=10.120.121.2:2379 \
  -alertmanager.sharding-ring.etcd.endpoints=10.120.121.3:2379 \
  -alertmanager.sharding-ring.etcd.endpoints=10.120.121.4:2379 \
  -alertmanager.sharding-ring.etcd.endpoints=10.120.121.5:2379 \
  -alertmanager.sharding-ring.instance-id=b_8071 \
  -alertmanager.sharding-ring.instance-interface-names=e1 \
  -alertmanager.sharding-ring.replication-factor=2 \
  -alertmanager.sharding-ring.zone-awareness-enabled=true \
  -alertmanager.sharding-ring.instance-availability-zone=t1 \
  \
  -alertmanager-storage.backend=s3 \
  -alertmanager-storage.s3.endpoint=s3.org:10444 \
  -alertmanager-storage.s3.bucket-name=staging-alerts \
  -alertmanager-storage.s3.insecure=false \
  \
  -alertmanager.receivers-firewall-block-cidr-networks=10.163.131.164/28,10.163.131.180/28 \
  -alertmanager.receivers-firewall-block-private-addresses=true \
  -alertmanager.notification-rate-limit=0 \
  -alertmanager.max-config-size-bytes=0 \
  -alertmanager.max-templates-count=0 \
  -alertmanager.max-template-size-bytes=0 \
  \
  -target=alertmanager

the configuration for alertmanager:

template_files:
  default_template: |
    {{ define "__alertmanager" }}AlertManager{{ end }}
    {{ define "__alertmanagerURL" }}{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver | urlquery }}{{ end }}
alertmanager_config: |
  global:
    smtp_smarthost: 'yourmailhost'
    smtp_from: 'youraddress'
    smtp_require_tls: false
  templates:
    - 'default_template'
  route:
    receiver: example-email
  receivers:
    - name: example-email
      email_configs:
      - to: 'youraddress'
@mousimin
Copy link
Author

Hi @friedrichg & @yeya24 ,
I guess the error message "bad response status 422 Unprocessable Entity" was from altermanager, right?
But I didn't find any error log from altermanager even I used debug log level, any suggestion from you will be appreciated!

@mousimin
Copy link
Author

mousimin commented Jul 19, 2024

I want to answer myself so that the others can refer to it.
I manually sent the HTTP request using curl and got the detailed response from alertmanager:
maxFailure (quorum) on a given error family, rpc error: code = Code(422) desc = addr=10.120.131.81:9071 state=ACTIVE zone=z1, rpc error: code = Code(422) desc = {"code":601,"message":"0.generatorURL in body must be of type uri: \"staging-cortex-ruler.org/graph?g0.expr=up%7Bapp%3D%22cert-manager%22%7D+%3E+0\u0026g0.tab=1\""}
So I added the schema "https://" at the beginning of the value -ruler.external.url and then it worked.

Map this to the code:

func (n *Manager) sendOne(ctx context.Context, c *http.Client, url string, b []byte) error {
	req, err := http.NewRequest(http.MethodPost, url, bytes.NewReader(b))
	if err != nil {
		return err
	}
	req.Header.Set("User-Agent", userAgent)
	req.Header.Set("Content-Type", contentTypeJSON)
	resp, err := n.opts.Do(ctx, c, req)
	if err != nil {
		return err
	}
	defer func() {
		io.Copy(io.Discard, resp.Body)
		resp.Body.Close()
	}()

	// Any HTTP status 2xx is OK.
	//nolint:usestdlibvars
	if resp.StatusCode/100 != 2 {
		return fmt.Errorf("bad response status %s", resp.Status)
	}

	return nil
}

Maybe we should add the response body into the error message as well? Currently we only add the status which is not easy for debugging.

@rapphil
Copy link
Contributor

rapphil commented Jul 25, 2024

@friedrichg @yeya24 should we go ahead and start logging the body of the response? It makes sense IMHO.

@yeya24
Copy link
Contributor

yeya24 commented Jul 25, 2024

@rapphil Agree. Would you like to work on it? Just want to make sure AM doesn't send something crazy in the response body. Maybe we can truncate the message with a limit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants