Change stream.* fields to dataset.* fields #482

ruflin · 2020-05-27T06:24:19Z

For the new indexing strategy currently the fields used are stream.type, stream.dataset, stream.namespace. Over the last weeks it showed that these fields might not be optimal so the proposal is to change it to dataset.type, dataset.name, dataset.namespace.

Note: This issue is in the package registry as at the moment the registry enforces these fields and public but it will have many other places that need update if we move forward with this.

What is the problem with stream.* fields?

stream is Agent specific: The name stream.* came initially out of building the Elastic Agent configuration as there we have inputs with streams, and each stream goes to a single dataset. But anyone can use the new indexing strategy so it should not be tied to a specific technology.
It is more than a stream: Proposed values for stream.type also can be content which is not necessarily a stream. See also [Meta] Add ECS Dataset fields ecs#845
Talking about dataset as a whole: When talking about the indexing strategy I realised I often talk about the dataset name and all of it as one dataset. A dataset is a set of data which belongs together. It is uniquely defined by the type, name and namespace. Having logs-nginx.access-default and logs-nginx.access-prod are two different datasets.

Based on the above I came to the conclusion that dataset should be an object and used for the indexing strategy fields.

One alternative that was discussed is using datastream instead as each dataset is stored in a datastream. But not each datastream is a dataset per this definition and it would attach it again to a specific technology implementation.

The other alternative discussed was using existing ECS fields like event.kind and event.dataset but as the types are different (constant_keyword), this does not work and we will be even more strict on names than currently in these fields. But the idea is that these fields will be closely linked on possible values.

Benefits of dataset.*

Using dataset.* also solves some existing problems:

stream.* conflicts with an existing docker input field in Filebeat which is a keyword
Decoupling of input.type in the Elastic Agent config from dataset.type. Even if the input.type is log, the dataset.type could be metrics if the log file contains metrics.
Removes confusion inside the agent config between stream and streams.

Changes needed

Places to change current stream.* implementation:

Elastic Agent field enrichment
Endpoint binary field enrichment
Package registry field validation
Package registry base package with templates
Integrations repository field addition
Integrations repository modules export script for dashboard filters
Indexing strategy docs

This change will likely have no impact on the UI side.

The text was updated successfully, but these errors were encountered:

andresrc · 2020-05-27T07:44:40Z

👍

webmat · 2020-05-27T15:31:34Z

Yeah I like this new proposal better. dataset.* is a bit more agnostic in that it works just as well for time series events, and for other types of documents 👍

neptunian · 2020-05-27T16:39:24Z

I believe there are at least 2 queries in Ingest Manager that currently use the stream fields.

ruflin · 2020-05-28T11:22:32Z

I'll leave this open until Monday 2020-06-01 and if by then no objections are raised, will proceed with the implementation.

mostlyjason · 2020-05-28T19:02:16Z

@ruflin one key question is whether dataset.type will have separate values for logs and events? I think in ECS there is event.kind with a value of event but not log. If we align the values with event.kind as you mentioned, it might make sense to also open a PR to add log to event.kind.

webmat · 2020-05-28T19:13:46Z

There's currently no plan to add the value "log" to event.kind in ECS. "event" is a category that's meant to include logs.

ruflin · 2020-06-02T06:39:19Z

@mostlyjason Good point, I missed that event.kind misses logs :-( Perhaps I need to loosen up my statement that it is a strict subset. On our end we definitively need logs.

mostlyjason · 2020-06-02T15:39:26Z

@ruflin what determines the allowed values for dataset.type? Are package creators allowed to put any value or do we have constraints on allowed ones? If it's constrained, we should publish a list of allowed values.

ph · 2020-06-02T17:15:42Z

@mostlyjason @ruflin I don't think we should not allow everything under that field, we could expand it layer. I would expect the allowed types be listed as part of the indexing strategy document?

ruflin · 2020-06-02T19:05:17Z

We will enforce it in the validation code of the package-registry. So if a package creator uses a value not allowed, the package is invalid and cannot be published.

++ on publishing it. We will figure it out where, potentially ECS ;-)

ruflin · 2020-06-03T10:49:41Z

Closing this issue as I think we should move forward here. Follow up implementation issue can be found here: #491

webmat · 2020-06-10T18:25:16Z

@ruflin Did you forget to actually close the issue? ;-)

ruflin self-assigned this May 27, 2020

ruflin mentioned this issue May 27, 2020

[Elastic Agent] Proposal: Change structure of Elastic Agent config elastic/beats#18758

Closed

ph added the Team:Ingest Management label May 27, 2020

webmat mentioned this issue May 28, 2020

[Meta] Add ECS Dataset fields elastic/ecs#845

Closed

This was referenced May 28, 2020

[Ingest Manager] Migrate from using stream.* fields to dataset.* fields elastic/kibana#67672

Closed

[Elastic Agent] Migrate from using stream.* fields to dataset.* fields elastic/beats#18826

Closed

ruflin mentioned this issue Jun 3, 2020

[Indexing Strategy] Change stream.* to dataset.* fields #491

Closed

8 tasks

This was referenced Jun 9, 2020

[Elastic Agent] Change structure of Elastic Agent across the Stack elastic/beats#19082

Closed

Change usage of categories #478

Closed

ph closed this as completed Jun 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change stream.* fields to dataset.* fields #482

Change stream.* fields to dataset.* fields #482

ruflin commented May 27, 2020

andresrc commented May 27, 2020

webmat commented May 27, 2020

neptunian commented May 27, 2020 •

edited

Loading

ruflin commented May 28, 2020

mostlyjason commented May 28, 2020

webmat commented May 28, 2020

ruflin commented Jun 2, 2020

mostlyjason commented Jun 2, 2020

ph commented Jun 2, 2020

ruflin commented Jun 2, 2020

ruflin commented Jun 3, 2020

webmat commented Jun 10, 2020

Change stream.* fields to dataset.* fields #482

Change stream.* fields to dataset.* fields #482

Comments

ruflin commented May 27, 2020

What is the problem with stream.* fields?

Benefits of dataset.*

Changes needed

andresrc commented May 27, 2020

webmat commented May 27, 2020

neptunian commented May 27, 2020 • edited Loading

ruflin commented May 28, 2020

mostlyjason commented May 28, 2020

webmat commented May 28, 2020

ruflin commented Jun 2, 2020

mostlyjason commented Jun 2, 2020

ph commented Jun 2, 2020

ruflin commented Jun 2, 2020

ruflin commented Jun 3, 2020

webmat commented Jun 10, 2020

neptunian commented May 27, 2020 •

edited

Loading