Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MongoDB Atlas] Process data stream #9552

Merged

Conversation

niraj-elastic
Copy link
Contributor

@niraj-elastic niraj-elastic commented Apr 9, 2024

  • Enhancement

What does this PR do?

  • Added 1 data stream (Process Metrics).
  • Added data collection logic for the data streams.
  • Added the ingest pipeline for the data streams.
  • Mapped fields according to the ECS schema and added Fields metadata in the appropriate yaml files.
  • Added dashboards and visualizations.
  • Added system test cases for the data stream.

Checklist

  • I have reviewed tips for building integrations and this pull request is aligned with them.
  • I have verified that all data streams collect metrics or logs.
  • I have added an entry to my package's changelog.yml file.
  • I have verified that Kibana version constraints are current according to guidelines.

How to test this PR locally

  • Clone integrations repo.
  • Install elastic-package locally.
  • Start elastic stack using elastic-package.
  • Move to integrations/packages/mongodb_atlas) directory.
  • Run the following command to run tests. elastic-package test

Screenshots

mongodb_atlas-process-dashboard

Related issue

multi: false
required: false
show_user: false
description: The request tracer logs requests and responses to the agent's local file-system for debugging configurations. Enabling this request tracing compromises security and should only be used for debugging. See [documentation](https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-httpjson.html#_request_tracer_filename) for details.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put references to the agent one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We dont have any documentation for request_tracer in fleet, I Checked with other integations and it seems that link of configuration of Individual input/module is used here. Here is one referance.

required: false
show_user: false
description: >-
Processors are used to reduce the number of fields in the exported event or to enhance the event with metadata. This executes in the agent before the logs are parsed. See [Processors](https://www.elastic.co/guide/en/beats/filebeat/current/filtering-and-enhancing-data.html) for details.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put references to the agent one

- name: period
type: text
title: Period
description: Period of fetching metrics.Value of Granularity and Period will be same. Supported units for this parameter are h/m/s.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats metrics.Value.
Might not be clear for first time user

Copy link
Contributor Author

@niraj-elastic niraj-elastic Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate more on metrics? Period of fetching metrics This description is same across the integrations for period, let me know what I can change here.
For value of Granularity and Period I will add details in readme and put reference here.

- name: groupId
type: text
title: GroupId
description: Identifier string that identifies your project. Groups and projects are synonymous terms. Your group id is the same as your project id. Ex. 32b6e34b3d91647abb20e7b8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
description: Identifier string that identifies your project. Groups and projects are synonymous terms. Your group id is the same as your project id. Ex. 32b6e34b3d91647abb20e7b8
description: Identifier string that identifies your project. Groups and projects are synonymous terms. Group id is identical to project id. Ex. 32b6e34b3d91647abb20e7b8

multi: false
required: true
show_user: true
secret: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since its a private key, should we make as secret:true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Milan has one open PR where secret is true for private key but it seems that CI is falling because of this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason of failure doesn't look to be this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it was my bad, the system test is falling when secret is kept true. @milan-elastic can elaborate more here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any update on this ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need to keep secret false as of now! If we'll make it true it will lead to system test failure!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

System Test Failure by making it true in the manifest ?

"CACHE_BYTES_READ_INTO": 0,
"CACHE_USED_BYTES": 0,
"CACHE_BYTES_WRITTEN_FROM": 0,
"CONNECTIONS": 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets have some values populated

?
state
:
state.with(request("GET", state.url + "/api/atlas/v2/groups/" + state.groupId + "/processes?pageNum=" + string(state.page_num) + "&itemsPerPage=100").with({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@efd6 : Any suggestions for improvements in this program here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a cleanup (correct indentation and whitespace makes CEL code much easier to debug). It is untested.

state:
  group_id: {{groupId}}
  want_more: false
  page_num: 1
  query: /measurements?granularity=PT{{period}}&period=PT{{period}}
redact:
  fields: ~
program: |
      (
        has(state.hostlist) && size(state.hostlist) > 0 ?
          state
        : 
          state.with(request("GET", state.url + "/api/atlas/v2/groups/" + state.group_id + "/processes?pageNum=" + string(state.page_num) + "&itemsPerPage=100").with({
            "Header": {
              "Accept": ["application/vnd.atlas." + string(now.getFullYear()) + "-01-01+json"]
            }
          }).do_request().as(resp, bytes(resp.Body).decode_json().as(body, {
            "hostlist": body.results.map(e, state.url + "/api/atlas/v2/groups/" + state.group_id + "/processes/" + e.id + state.query),
            "next": 0,
            "page_num": body.links.exists_one(res, res.rel == "next") ? int(state.page_num)+1 : 1
          })))
      ).as(state, state.next >= size(state.hostlist) ? {} :
        request("GET", string(state.hostlist[state.next])).with({
          "Header": {
            "Accept": ["application/vnd.atlas." + string(timestamp(now).getFullYear()) + "-01-01+json"]
          }
        }).do_request().as(res, {
          "events": bytes(res.Body).decode_json().as(f, f.with({"response": zip(
            f.measurements.map(m, m.name),
            f.measurements.map(m, m.dataPoints.map(d, d.value).as(v, size(v) == 0 ? 0 : v[0]))
          )}).drop(["measurements", "links"])),
          ?"hostlist": (int(state.next)+1) < size(state.hostlist) ? state.hostlist : optional.none(),
          "next": (int(state.next)+1) < size(state.hostlist) ? (int(state.next)+1) : 0,
          "want_more": (int(state.next)+1) < size(state.hostlist) || state.page_num != 1,
          "page_num": state.page_num,
          "group_id": state.group_id,
          "query": state.query,
        })
      )
  • I've used optional types to replace the use of drop_empty; I assume that that is the only reason that it is being used.
  • I've aggregated the two drop calls into one to avoid multiple cross-language calls.
  • The aggregation for events probably wants a comment explaining what it is doing; I take it that it's assigning a vector to each of the names in measurements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@niraj-elastic : Please address this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@efd6 I have updated code according to your suggestions, I have removed drop_empty since code works fine even without it. let me know if any other changes are required.

processors:
- set:
field: ecs.version
value: 8.11.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required since we are now using ecs@mappings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will need to check this in detail, ecs.version got set to 8.0.0 automatically when I removed it.

Copy link
Contributor

@ishleenk17 ishleenk17 Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just a tag?
This value being autoset, can create confusion in users mind.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our datasets are in ECS. Setting the version is a nice indicator, that the data users are looking at is fully ECS and we also use this sometimes in telemetry.

For datasets / integrations: Can it be assumed that each time an integration is released, it is on the most recent version that this version of elatsic-package supports until set otherwise. Or can we go even further, can it be assumed that the ECS version used by the integration can be set to the version that is currently shipped via Fleet?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO ECS version used by Integration should be the one shipped via Fleet.
There is now an issue created for it.

type: group
fields:
- name: msg
description: The average rate of message asserts per second over the selected sample period. These internal server errors have a well-defined text string. Atlas logs stack traces for these.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"These internal server errors"

We are using asserts first and then errors later. Lets use the same term.

metric_type: gauge
unit: byte
- name: write.bytes
description: The maximum Disk read Latency value over the time period specified by the metric granularity.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
description: The maximum Disk read Latency value over the time period specified by the metric granularity.
description: The maximum disk read latency value over the period specified by the metric granularity.

@niraj-elastic niraj-elastic changed the title [WIP] [MongoDB Atlas] Process data stream [MongoDB Atlas] Process data stream Apr 11, 2024
@niraj-elastic niraj-elastic marked this pull request as ready for review April 11, 2024 17:19
@niraj-elastic niraj-elastic self-assigned this Apr 11, 2024
@elasticmachine
Copy link

🚀 Benchmarks report

To see the full report comment with /test benchmark fullreport

- Set up alerts to minimize Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR) by quickly referencing relevant logs during troubleshooting.

## Data streams

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope when we merge both logs and metrics, we will get merge the documentation appropriately as this just states anut metrics currently

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, once any PR is merged we will sync the readme.

?
state
:
state.with(request("GET", state.url + "/api/atlas/v2/groups/" + state.groupId + "/processes?pageNum=" + string(state.page_num) + "&itemsPerPage=100").with({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@niraj-elastic : Please address this


Data streams:

- `process` : This data stream Collects host Metrics per process for all the hosts of the specified group. Metrics like Measurements for the host, such as CPU usage, number of I/O operations and memory are available on this data stream.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `process` : This data stream Collects host Metrics per process for all the hosts of the specified group. Metrics like Measurements for the host, such as CPU usage, number of I/O operations and memory are available on this data stream.
- `process` : This data stream collects host metrics per process for all the hosts of the specified group. Metrics like measurements for the host, such as CPU usage, number of I/O operations and memory are available on this data stream.

type: group
fields:
- name: max.pct
description: MAX Children User CPU usage scaled to a range of 0% to 100% by dividing by the number of CPU cores.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
description: MAX Children User CPU usage scaled to a range of 0% to 100% by dividing by the number of CPU cores.
description: MAX Children user CPU usage scaled to a range of 0% to 100% by dividing by the number of CPU cores.

Can we update the description on this section by dividing by the number of CPU cores. to make it more readable?
This is applicable wherever we add this statement.

type: group
fields:
- name: deleted
description: Displays the documents per second deleted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
description: Displays the documents per second deleted.
description: Displays the documents deleted per second.

Applies to all the below document related metrics.

type: group
fields:
- name: kernel.pct
description: The amount of CPU time spent by the Full-Text Search process in kernel space.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
description: The amount of CPU time spent by the Full-Text Search process in kernel space.
description: The amount of CPU time spent by the Full-Text search process in kernel space.

Applicable wherever we are using this description.

description: Amount of COMPUTED process memory, in mebibytes (MiB).
type: double
metric_type: gauge
- name: mapped.mb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While looking into the mongodb metrics document, I am just curious to understand why the computed memory is represented in mib and the rest of them are in mb. Can you please verify the differences in the type here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likely it is MB too, i have updated this field.

description: Timestamp of the latest operation entry in the oplog on the primary node.
type: double
metric_type: gauge
unit: s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the value is timestamp as per the description, then the unit: s has to be removed and the name of the field should be changed or if the value is not timestamp then we need to update the description.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the description.

type: group
fields:
- name: kb
description: Physical memory used in kilobytes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we have consistency over the descriptions. Please remove the extra comma added to the other field descriptions.

type: group
fields:
- name: max.pct
description: MAX User CPU usage of processes on the host scaled to a range of 0 to 100% by dividing by the number of CPU cores.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
description: MAX User CPU usage of processes on the host scaled to a range of 0 to 100% by dividing by the number of CPU cores.
description: MAX user CPU usage of processes on the host scaled to a range of 0 to 100% by dividing by the number of CPU cores.

version: 0.0.1
source:
license: "Elastic-2.0"
description: This Elastic integration collects metrics from MongoDB Atlas integration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
description: This Elastic integration collects metrics from MongoDB Atlas integration.
description: This Elastic integration collects metrics from MongoDB Atlas database.

policy_templates:
- name: mongodb_atlas
title: MongoDB Atlas metrics
description: Collect MongoDB Atlas and metrics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please check the description here?

@ishleenk17
Copy link
Contributor

@muthu-mps : Do you have any other comments?
I have left a small comment, otherwise looks good!

@muthu-mps
Copy link
Contributor

@niraj-elastic - While checking the process metrics, the REPLICATION_STATUS_HEALTH and the REPLICATION_STATUS_STATE metrics is not available in the fields. This helps in identifying the health of the running replica server's. Do we encounter this metrics during the analysis phase?

"group_id": state.group_id,
"query": state.query,
})
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a final new line.

f.measurements.map(m, m.name),
f.measurements.map(m, m.dataPoints.map(d, d.value).as(v, size(v) == 0 ? 0 : v[0]))
)}).drop(["measurements", "links"])),
"hostlist": (int(state.next)+1) < size(state.hostlist) ? state.hostlist : [],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you not using the optional field here? Is this to avoid complication on the want_more iteration?

Copy link
Contributor Author

@niraj-elastic niraj-elastic Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed ofline, this change will bring more complication to our existing CEL script. So I am keeping it as it is for now. If we decide to change this I will raise one more PR to change this in future.

).as(state, state.next >= size(state.hostlist) ? {} :
request("GET", string(state.hostlist[state.next])).with({
"Header": {
"Accept": ["application/vnd.atlas." + string(timestamp(now).getFullYear()) + "-01-01+json"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed this

Suggested change
"Accept": ["application/vnd.atlas." + string(timestamp(now).getFullYear()) + "-01-01+json"]
"Accept": ["application/vnd.atlas." + string(now.getFullYear()) + "-01-01+json"]

now is already a timestamp, so no conversion is required.

@niraj-elastic
Copy link
Contributor Author

niraj-elastic commented Apr 16, 2024

@niraj-elastic - While checking the process metrics, the REPLICATION_STATUS_HEALTH and the REPLICATION_STATUS_STATE metrics is not available in the fields. This helps in identifying the health of the running replica server's. Do we encounter this metrics during the analysis phase?

@muthu-mps I checked the API documentation for all the metrics but I can not find reference to REPLICATION_STATUS_HEALTH & REPLICATION_STATUS_STATE. So it seems like these metrics can not be included as of now.

@muthu-mps
Copy link
Contributor

@niraj-elastic - While checking the process metrics, the REPLICATION_STATUS_HEALTH and the REPLICATION_STATUS_STATE metrics is not available in the fields. This helps in identifying the health of the running replica server's. Do we encounter this metrics during the analysis phase?

@muthu-mps I checked the API documentation for all the metrics but I can not find reference to REPLICATION_STATUS_HEALTH & REPLICATION_STATUS_STATE. So it seems like these metrics can not be included as of now.

Fine, lets figure out this metric later.

- name: input.type
type: keyword
description: Type of Filebeat input.
- name: tags
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we leverage the ECS field for tags?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tags field falles under base fields group. hance we have kept it under base-fields.yml file. let me know if we have to change it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Those are not the current specifications we follow for base-fields. Then the data stream fields are not part of the base fields group. Are we removing those fields from the file? No. One of the requirement when the dynamic mapping was enabled is to retain the base-fields mentioned here.
  • I would recommend to remove the tags field from the base-fields.yml.

@SubhrataK
Copy link

Dashboard looks good to me

@elasticmachine
Copy link

💚 Build Succeeded

History

cc @niraj-elastic

Copy link

Copy link
Contributor

@muthu-mps muthu-mps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@ishleenk17 ishleenk17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@niraj-elastic niraj-elastic merged commit 5f6cb12 into elastic:main Apr 16, 2024
5 checks passed
@elasticmachine
Copy link

Package mongodb_atlas - 0.0.1 containing this change is available at https://epr.elastic.co/search?package=mongodb_atlas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants