Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change stream.* fields to dataset.* fields #482

Closed
ruflin opened this issue May 27, 2020 · 12 comments
Closed

Change stream.* fields to dataset.* fields #482

ruflin opened this issue May 27, 2020 · 12 comments
Assignees

Comments

@ruflin
Copy link
Member

ruflin commented May 27, 2020

For the new indexing strategy currently the fields used are stream.type, stream.dataset, stream.namespace. Over the last weeks it showed that these fields might not be optimal so the proposal is to change it to dataset.type, dataset.name, dataset.namespace.

Note: This issue is in the package registry as at the moment the registry enforces these fields and public but it will have many other places that need update if we move forward with this.

What is the problem with stream.* fields?

  • stream is Agent specific: The name stream.* came initially out of building the Elastic Agent configuration as there we have inputs with streams, and each stream goes to a single dataset. But anyone can use the new indexing strategy so it should not be tied to a specific technology.
  • It is more than a stream: Proposed values for stream.type also can be content which is not necessarily a stream. See also [Meta] Add ECS Dataset fields ecs#845
  • Talking about dataset as a whole: When talking about the indexing strategy I realised I often talk about the dataset name and all of it as one dataset. A dataset is a set of data which belongs together. It is uniquely defined by the type, name and namespace. Having logs-nginx.access-default and logs-nginx.access-prod are two different datasets.

Based on the above I came to the conclusion that dataset should be an object and used for the indexing strategy fields.

One alternative that was discussed is using datastream instead as each dataset is stored in a datastream. But not each datastream is a dataset per this definition and it would attach it again to a specific technology implementation.

The other alternative discussed was using existing ECS fields like event.kind and event.dataset but as the types are different (constant_keyword), this does not work and we will be even more strict on names than currently in these fields. But the idea is that these fields will be closely linked on possible values.

Benefits of dataset.*

Using dataset.* also solves some existing problems:

  • stream.* conflicts with an existing docker input field in Filebeat which is a keyword
  • Decoupling of input.type in the Elastic Agent config from dataset.type. Even if the input.type is log, the dataset.type could be metrics if the log file contains metrics.
  • Removes confusion inside the agent config between stream and streams.

Changes needed

Places to change current stream.* implementation:

  • Elastic Agent field enrichment
  • Endpoint binary field enrichment
  • Package registry field validation
  • Package registry base package with templates
  • Integrations repository field addition
  • Integrations repository modules export script for dashboard filters
  • Indexing strategy docs

This change will likely have no impact on the UI side.

@andresrc
Copy link

👍

@webmat
Copy link

webmat commented May 27, 2020

Yeah I like this new proposal better. dataset.* is a bit more agnostic in that it works just as well for time series events, and for other types of documents 👍

@neptunian
Copy link
Contributor

neptunian commented May 27, 2020

I believe there are at least 2 queries in Ingest Manager that currently use the stream fields.

@ruflin
Copy link
Member Author

ruflin commented May 28, 2020

I'll leave this open until Monday 2020-06-01 and if by then no objections are raised, will proceed with the implementation.

@mostlyjason
Copy link

@ruflin one key question is whether dataset.type will have separate values for logs and events? I think in ECS there is event.kind with a value of event but not log. If we align the values with event.kind as you mentioned, it might make sense to also open a PR to add log to event.kind.

@webmat
Copy link

webmat commented May 28, 2020

There's currently no plan to add the value "log" to event.kind in ECS. "event" is a category that's meant to include logs.

@ruflin
Copy link
Member Author

ruflin commented Jun 2, 2020

@mostlyjason Good point, I missed that event.kind misses logs :-( Perhaps I need to loosen up my statement that it is a strict subset. On our end we definitively need logs.

@mostlyjason
Copy link

@ruflin what determines the allowed values for dataset.type? Are package creators allowed to put any value or do we have constraints on allowed ones? If it's constrained, we should publish a list of allowed values.

@ph
Copy link

ph commented Jun 2, 2020

@mostlyjason @ruflin I don't think we should not allow everything under that field, we could expand it layer. I would expect the allowed types be listed as part of the indexing strategy document?

@ruflin
Copy link
Member Author

ruflin commented Jun 2, 2020

We will enforce it in the validation code of the package-registry. So if a package creator uses a value not allowed, the package is invalid and cannot be published.

++ on publishing it. We will figure it out where, potentially ECS ;-)

@ruflin
Copy link
Member Author

ruflin commented Jun 3, 2020

Closing this issue as I think we should move forward here. Follow up implementation issue can be found here: #491

@webmat
Copy link

webmat commented Jun 10, 2020

@ruflin Did you forget to actually close the issue? ;-)

@ph ph closed this as completed Jun 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants