Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver/windowsperfcounters] When collecting instances with multiple matches, data is lost #32319

Closed
alxbl opened this issue Apr 11, 2024 · 3 comments · Fixed by #32321
Closed
Labels
bug Something isn't working needs triage New item requiring triage receiver/windowsperfcounters

Comments

@alxbl
Copy link
Member

alxbl commented Apr 11, 2024

Component(s)

receiver/windowsperfcounters

What happened?

Description

Whenever a multi instance counter is scraped and there are multiple instances with the same name (e.g. Process\ID Process for notepad.exe) the receiver scrapes all instances, but puts the exact same label value in instance. This is incompatible with most backends as the metrics will be treated as the same time series and either aggregated or only the last datapoint will be kept.

The behavior also does not match what PerfMon shows, which would be notepad and notepad#1 in my example above.

Steps to Reproduce

  • Start Notepad.exe as your normal user
  • Start Notepad.exe as an administrator (to ensure you have 2 different Notepad.exe PIDs on Windows 11)
  • Use the provided configuration file (modify as needed)
    • Mimir optional, I was testing another issue with prometheusremotewrite

Expected Result

Windows Performance Monitor handles this by concatenating the instance name with its index when there are multiple occurrences of the same instance (usually when multiple instances of a process are running):

image

  • Metrics for instances notepad and notepad_1 as shown in Windows Performance Monitor
  • Two time series with each PID value

Actual Result

  • Two data points for instance notepad combined in the same time series:

Collector version

0.97

Environment information

Environment

Windows 11
go 1.22 on Ubuntu 22.04 (GOOS=windows)

OpenTelemetry Collector configuration

receivers:
  windowsperfcounters:
    metrics:
      process.pid:
        gauge:
    collection_interval: 5s
    perfcounters:
      - object: Process
        instances: "note*"
        counters:
          - name: "ID Process"
            metric: process.pid
processors:
  batch:
  memory_limiter:
    check_interval: 1s
    limit_mib: 500
    spike_limit_mib: 100
extensions:
exporters:
  prometheusremotewrite:
    endpoint: http://mimir:9009/api/v1/push
    tls:
      insecure: true

  debug:
    verbosity: detailed

service:
  extensions: []
  pipelines:
    metrics:
      receivers: [windowsperfcounters]
      processors: []
      exporters: [debug, prometheusremotewrite]

Log output

2024-04-11T07:05:15.889-0400	info	MetricsExporter	{"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 1, "data points": 2} // <------ Two data points
2024-04-11T07:05:15.889-0400	info	ResourceMetrics #0
Resource SchemaURL: 
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope  
Metric #0
Descriptor:
     -> Name: process.pid
     -> Description: 
     -> Unit: 
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> instance: Str(Notepad) // <--------- 
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-04-11 11:03:44.8430194 +0000 UTC
Value: 16660.000000
NumberDataPoints #1
Data point attributes:
     -> instance: Str(Notepad) // <-------
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-04-11 11:03:44.8430194 +0000 UTC
Value: 21988.000000
	{"kind": "exporter", "data_type": "metrics", "name": "debug"}

Additional context

I already have a PR that I can submit for this. I understand that this might be a problem in terms of cardinality so I am open to gating this behind a config option for the receiver.

@alxbl alxbl added bug Something isn't working needs triage New item requiring triage labels Apr 11, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jun 11, 2024
@alxbl
Copy link
Member Author

alxbl commented Jun 12, 2024

/label -Stale

Still ongoing. Related PR (#32321) is awaiting review/merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage New item requiring triage receiver/windowsperfcounters
Projects
None yet
1 participant