-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[POC][Idea] Create field mappings on the fly instead of using template #7972
Conversation
Currently Beats loads in advance a template with all fields inside and each index will rely on the template. This PR turns this concept around by having a strict template without any fields inside. This means when a Beat tries to ingest data which does not exist yet the requested will be rejected. If it's a mapping rejection the Beat will check all the events which failed to send and get their fields and the ready the mappings for it. Out of these an index mapping is created and pushed to the index before the events are sent again. This has the following advantages * The mappings existing in an index are only for the fields that exist in an index * If someone uses a different index pattern, the mapping is still applied even if the template would not match * If we use an index per module, it just works and not all libbeat fields for the processors are added if they are not used (for example k8s) * Every time a new index is created the mapping is recreated so if a module is used only once it will not be in all the following indices * Support for adding new fields on the fly: With append_fields the user can add new fields to mapping without having to restart and it will work with reloading. There are also disadvantages * Every time a new index is created all mappings have to be sent again. So instead of having one request there are 3. * Assuming 1000 Beats send data to one index, all the Beats will try to write the mappings again at the same time -> peak load? Instead of trying to apply this change to the existing ingest path I could see this becoming interesting in combination with ILM as it would allow us there to use an index pattern which does not match the current template. Questions * How do we handle dynamic mapping fields? Should we add logic to Beats to also create them? * What about fields which have no mapping assigned -> keyword or delete entry? * Should we get rid of the template completely and always manually create an index? Test implementation In the current implementation most things are hard coded, global look ups and short cuts were taken as this is only meant as a POC. To test it first the template must be created: ``` PUT _template/metricbeat-7.0.0-alpha1 { "index_patterns": ["metricbeat-*"], "settings": { "number_of_shards": 1 }, "mappings": { "doc": { "dynamic": "strict" } } } ``` As the code does not create indices automatically yet, the following index has to be added (adjust the date): ``` PUT metricbeat-7.0.0-alpha1-2018.08.15 ``` The same index name must be hardcoded in the client.go file around 596.
@@ -39,6 +39,8 @@ var ( | |||
dynamicTemplates []common.MapStr | |||
|
|||
defaultFields []string | |||
|
|||
Fields common.Fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exported var Fields should have comment or be unexported
return fields | ||
} | ||
|
||
func (f Fields) Flatten() map[string]Field { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exported method Fields.Flatten should have comment or be unexported
@@ -216,3 +251,44 @@ func (f Fields) getKeys(namespace string) []string { | |||
|
|||
return keys | |||
} | |||
|
|||
// GetKeys returns a flat list of keys this Fields contains |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment on exported method Fields.GetFlatFields should be of the form "GetFlatFields ..."
@@ -119,6 +119,15 @@ func (f Fields) HasKey(key string) bool { | |||
return f.hasKey(keys) | |||
} | |||
|
|||
// HasKey checks if inside fields the given key exists |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment on exported method Fields.GetKey should be of the form "GetKey ..."
@@ -39,6 +39,8 @@ var ( | |||
dynamicTemplates []common.MapStr | |||
|
|||
defaultFields []string | |||
|
|||
Fields common.Fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exported var Fields should have comment or be unexported
return fields | ||
} | ||
|
||
func (f Fields) Flatten() map[string]Field { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exported method Fields.Flatten should have comment or be unexported
@@ -216,3 +251,44 @@ func (f Fields) getKeys(namespace string) []string { | |||
|
|||
return keys | |||
} | |||
|
|||
// GetKeys returns a flat list of keys this Fields contains |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment on exported method Fields.GetFlatFields should be of the form "GetFlatFields ..."
@@ -119,6 +119,15 @@ func (f Fields) HasKey(key string) bool { | |||
return f.hasKey(keys) | |||
} | |||
|
|||
// HasKey checks if inside fields the given key exists |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment on exported method Fields.GetKey should be of the form "GetKey ..."
Question, what's the user pain we're trying to solve here? Is it that dealing with a lot of mappings that aren't in use is a pain? Is there a performance issue? This is quite nifty, but it'd be helpful in evaluating this proposal to better understand that. |
I'm worried about the potential downsides you're listing, for one. But I'm also worried about the fact that now it becomes really difficult to take a "minimum privilege" approach. When running a Beats' |
For the pain points:
I would not expect this feature to replace the template approach but to be an addition. @webmat Interesting point about the security aspect. I just checked what the |
I think those goals make sense, and I like what this does in many ways. Some thoughts:
|
I think this approach because it interferes with data ingestion too much. Right now users expect that once data flow begins, it won't be rejected due to missing objects in ES (or anywhere else). That's why This is also problematic if someone does not want Beats to interact with ES. The solution requires connection between Beats and ES. I would rather make our current solution a bit "smarter". However, I am a bit worried this idea would be too hard to implement compared to the advantages it could provide. A more advanced solution could be generating the index template based on the configuration like it's done in pipeline loading or machine learning. (Right now only pipelines of enabled Filebeat modules and Machine Learning jobs are installed/uploaded when Thanks for pushing this issue forward! This can be painful to users. :( |
I like the minimal template idea @kvch |
I think that something like this is specially interesting for modules whose field mapping is unknown in For "normal" modules I like @kvch idea of loading only the fields of the enabled ones. |
To recap a short discussion I had with Nic, one of the pain points this addresses is that currently the index template is enormous. This is important because it's a big (and growing) object that's part of the cluster state. Not 100% sure if there' other specific pain points, however. But I agree with @andrewvc that this introduces a lot of magic, which I'm not a big fan of. Especially if this magic means lots of Filebeats instances suddenly try to update the mapping at the same time (either after an automation run, or at midnight UTC if we get rid of the index template altogether). The current way the template is generated is helpful but has problems too, I agree. If I understand @kvch correctly, you're saying only the parts of the template that are used are uploaded? That's not what I'm seeing. After running Some of this will be mitigated as we slowly firm up and start adopting ECS across the board. But each source/module will always have their own specific fields. I think the simplest way to address this would be to go to one index per module. If we do this, it becomes trivial to support the use case where we only upload the templates that make sense according to the And we can also structure our index templates and index names in such a way to reduce duplication, for example by defining ECS fields only once. Index template examples (including their order parameter):
And so on for each module... Here are some advantages I see in this approach:
|
cc @elastic/apm-server |
APM Server uses the same template creation and uploading logic as beats, but does not have a module structure. By default, there is a different index per event type (transaction, span, error, metric, sourcemap). Currently one template is created and applied for all indices. |
Thanks everyonf for chiming into this discussion which I was hoping for with this POC. As a bit of background: This POC comes out of my initial approach of having 1 index per module which I realised is not necessarly as great as I hoped for. One index per module has still a few downsides:
For the comments above:
One interesting thing that all template solutions we discussed so far do not offer is for example having one index for the Elasticsearch the Elasticsearch module of Filebeat and Metricbeat as the fields heavily overlap. Also it allows our users to select any index pattern and they will always have the right template which at the moment is tricky to configure correctly. As said before: I think our template approach works pretty well most of the time, so something like above should not replace it but be an addition. |
Closing this PR for now as I currently don't have the time to push this forward. @urso @monicasarbu I still think it is an idea that could be of value when we change our indexing strategy. |
Currently Beats loads in advance a template with all fields inside and each index will rely on the template. This PR turns this concept around by having a strict template without any fields inside. This means when a Beat tries to ingest data which does not exist yet the requested will be rejected. If it's a mapping rejection the Beat will check all the events which failed to send and get their fields and the ready the mappings for it. Out of these an index mapping is created and pushed to the index before the events are sent again.
This has the following advantages
There are also disadvantages
Instead of trying to apply this change to the existing ingest path I could see this becoming interesting in combination with ILM as it would allow us there to use an index pattern which does not match the current template.
Questions
Test implementation
In the current implementation most things are hard coded, global look ups and short cuts were taken as this is only meant as a POC. To test it first the template must be created:
As the code does not create indices automatically yet, the following index has to be added (adjust the date):
The same index name must be hardcoded in the client.go file around 596.