Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[POC][Idea] Create field mappings on the fly instead of using template #7972

Closed
wants to merge 1 commit into from

Conversation

ruflin
Copy link
Member

@ruflin ruflin commented Aug 15, 2018

Currently Beats loads in advance a template with all fields inside and each index will rely on the template. This PR turns this concept around by having a strict template without any fields inside. This means when a Beat tries to ingest data which does not exist yet the requested will be rejected. If it's a mapping rejection the Beat will check all the events which failed to send and get their fields and the ready the mappings for it. Out of these an index mapping is created and pushed to the index before the events are sent again.

This has the following advantages

  • The mappings existing in an index are only for the fields that exist in an index
  • If someone uses a different index pattern, the mapping is still applied even if the template would not match
  • If we use an index per module, it just works and not all libbeat fields for the processors are added if they are not used (for example k8s)
  • Every time a new index is created the mapping is recreated so if a module is used only once it will not be in all the following indices
  • Support for adding new fields on the fly: With append_fields the user can add new fields to mapping without having to restart and it will work with reloading.

There are also disadvantages

  • Every time a new index is created all mappings have to be sent again. So instead of having one request there are 3.
  • Assuming 1000 Beats send data to one index, all the Beats will try to write the mappings again at the same time -> peak load?

Instead of trying to apply this change to the existing ingest path I could see this becoming interesting in combination with ILM as it would allow us there to use an index pattern which does not match the current template.

Questions

  • How do we handle dynamic mapping fields? Should we add logic to Beats to also create them?
  • What about fields which have no mapping assigned -> keyword or delete entry?
  • Should we get rid of the template completely and always manually create an index?

Test implementation

In the current implementation most things are hard coded, global look ups and short cuts were taken as this is only meant as a POC. To test it first the template must be created:

PUT _template/metricbeat-7.0.0-alpha1
{
  "index_patterns": ["metricbeat-*"],
  "settings": {
    "number_of_shards": 1
  },
  "mappings": {
    "doc": {
      "dynamic": "strict"
    }
  }
}

As the code does not create indices automatically yet, the following index has to be added (adjust the date):

PUT metricbeat-7.0.0-alpha1-2018.08.15

The same index name must be hardcoded in the client.go file around 596.

Currently Beats loads in advance a template with all fields inside and each index will rely on the template. This PR turns this concept around by having a strict template without any fields inside. This means when a Beat tries to ingest data which does not exist yet the requested will be rejected. If it's a mapping rejection the Beat will check all the events which failed to send and get their fields and the ready the mappings for it. Out of these an index mapping is created and pushed to the index before the events are sent again.

This has the following advantages

* The mappings existing in an index are only for the fields that exist in an index
* If someone uses a different index pattern, the mapping is still applied even if the template would not match
* If we use an index per module, it just works and not all libbeat fields for the processors are added if they are not used (for example k8s)
* Every time a new index is created the mapping is recreated so if a module is used only once it will not be in all the following indices
* Support for adding new fields on the fly: With append_fields the user can add new fields to mapping without having to restart and it will work with reloading.

There are also disadvantages

* Every time a new index is created all mappings have to be sent again. So instead of having one request there are 3.
* Assuming 1000 Beats send data to one index, all the Beats will try to write the mappings again at the same time -> peak load?

Instead of trying to apply this change to the existing ingest path I could see this becoming interesting in combination with ILM as it would allow us there to use an index pattern which does not match the current template.

Questions

* How do we handle dynamic mapping fields? Should we add logic to Beats to also create them?
* What about fields which have no mapping assigned -> keyword or delete entry?
* Should we get rid of the template completely and always manually create an index?

Test implementation

In the current implementation most things are hard coded, global look ups and short cuts were taken as this is only meant as a POC. To test it first the template must be created:

```
PUT _template/metricbeat-7.0.0-alpha1
{
  "index_patterns": ["metricbeat-*"],
  "settings": {
    "number_of_shards": 1
  },
  "mappings": {
    "doc": {
      "dynamic": "strict"
    }
  }
}
```

As the code does not create indices automatically yet, the following index has to be added (adjust the date):

```
PUT metricbeat-7.0.0-alpha1-2018.08.15
```

The same index name must be hardcoded in the client.go file around 596.
@ruflin ruflin added the discuss Issue needs further discussion. label Aug 15, 2018
@@ -39,6 +39,8 @@ var (
dynamicTemplates []common.MapStr

defaultFields []string

Fields common.Fields

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported var Fields should have comment or be unexported

return fields
}

func (f Fields) Flatten() map[string]Field {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported method Fields.Flatten should have comment or be unexported

@@ -216,3 +251,44 @@ func (f Fields) getKeys(namespace string) []string {

return keys
}

// GetKeys returns a flat list of keys this Fields contains

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on exported method Fields.GetFlatFields should be of the form "GetFlatFields ..."

@@ -119,6 +119,15 @@ func (f Fields) HasKey(key string) bool {
return f.hasKey(keys)
}

// HasKey checks if inside fields the given key exists

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on exported method Fields.GetKey should be of the form "GetKey ..."

@@ -39,6 +39,8 @@ var (
dynamicTemplates []common.MapStr

defaultFields []string

Fields common.Fields

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported var Fields should have comment or be unexported

return fields
}

func (f Fields) Flatten() map[string]Field {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported method Fields.Flatten should have comment or be unexported

@@ -216,3 +251,44 @@ func (f Fields) getKeys(namespace string) []string {

return keys
}

// GetKeys returns a flat list of keys this Fields contains

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on exported method Fields.GetFlatFields should be of the form "GetFlatFields ..."

@@ -119,6 +119,15 @@ func (f Fields) HasKey(key string) bool {
return f.hasKey(keys)
}

// HasKey checks if inside fields the given key exists

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on exported method Fields.GetKey should be of the form "GetKey ..."

@andrewvc
Copy link
Contributor

Question, what's the user pain we're trying to solve here? Is it that dealing with a lot of mappings that aren't in use is a pain? Is there a performance issue?

This is quite nifty, but it'd be helpful in evaluating this proposal to better understand that.

@webmat
Copy link
Contributor

webmat commented Aug 16, 2018

I'm worried about the potential downsides you're listing, for one. But I'm also worried about the fact that now it becomes really difficult to take a "minimum privilege" approach.

When running a Beats' --setup, an operator (or automation) can do so with a user that has the rights to modify index templates. When the ingestion can dynamically at any time modify the template, this means a fleet of potentially thousands of servers must have this capability.

@ruflin
Copy link
Member Author

ruflin commented Aug 17, 2018

For the pain points:

  • Large cluster states: As the Metricbeat mapping contains lots of unused fields and the mapping is stored for each index / shard, this has quite an effect on the cluster state. We wanted to improve this with an index per module but as mentioned in the PR description, it does not fully solve the issue.
  • No index template: It's a common problem that users change the name of the index but don't adjust the template correctly. So they will send data to an index without template and at one point ingestion stops as they have a type conflict (0 was converted to long instead of float as an example

I would not expect this feature to replace the template approach but to be an addition.

@webmat Interesting point about the security aspect. I just checked what the create and write permissions allow. It seems the write permission also grants access to the update mapping action. As we will need the write permission in any case it should not make a difference.

@andrewvc
Copy link
Contributor

andrewvc commented Aug 17, 2018

I think those goals make sense, and I like what this does in many ways.

Some thoughts:

  1. @webmat made a great point about security. Configuring security for this would be awkward IMHO. I suppose we could hide that somehow
  2. Field explosion is irritating to deal with now, and will only get worse. The problems you cite are real and should be addressed.
  3. This is a lot of magic. I wonder if by solving a problem some users face we are adding complexity to understanding the system as a whole.
  4. I can't shake the feeling that the solution here should be an ES feature in the templating system. This is very similar to dynamic templates. If I had to name this as an ES feature, I'd call it "Lazy Templates", or similar.

@kvch
Copy link
Contributor

kvch commented Aug 17, 2018

I think this approach because it interferes with data ingestion too much. Right now users expect that once data flow begins, it won't be rejected due to missing objects in ES (or anywhere else). That's why setup command is valuable. It makes it possible to setup everything in advance.

This is also problematic if someone does not want Beats to interact with ES. The solution requires connection between Beats and ES.

I would rather make our current solution a bit "smarter". However, I am a bit worried this idea would be too hard to implement compared to the advantages it could provide. A more advanced solution could be generating the index template based on the configuration like it's done in pipeline loading or machine learning. (Right now only pipelines of enabled Filebeat modules and Machine Learning jobs are installed/uploaded when setup is called.) To include only fields of enabled modules and processors we could utilize the existing fields.yml collector functions we added recently. WDYT?

Thanks for pushing this issue forward! This can be painful to users. :(

@andrewvc
Copy link
Contributor

I like the minimal template idea @kvch

@jsoriano
Copy link
Member

I think that something like this is specially interesting for modules whose field mapping is unknown in setup time, for example the jolokia module doesn't contain any template, but during runtime it can request also the types of all the fields it collects, it'd be interesting if it could send the mapping with its first event or something like this. Similar things could be also done with other modules like munin or http.

For "normal" modules I like @kvch idea of loading only the fields of the enabled ones.

@webmat
Copy link
Contributor

webmat commented Aug 17, 2018

To recap a short discussion I had with Nic, one of the pain points this addresses is that currently the index template is enormous. This is important because it's a big (and growing) object that's part of the cluster state. Not 100% sure if there' other specific pain points, however.

But I agree with @andrewvc that this introduces a lot of magic, which I'm not a big fan of. Especially if this magic means lots of Filebeats instances suddenly try to update the mapping at the same time (either after an automation run, or at midnight UTC if we get rid of the index template altogether).

The current way the template is generated is helpful but has problems too, I agree. If I understand @kvch correctly, you're saying only the parts of the template that are used are uploaded? That's not what I'm seeing. After running ./filebeats --setup --modules=suricata, I have an index template ready to receive logs from all of the Filebeats modules. Not sure if I missed anything. But I could see why this is the case: we're generating one template, and if a user runs setup with --modules=nginx and later realizes he should also add --modules=mysql, we can't overwrite the whole template to only cover mysql.

Some of this will be mitigated as we slowly firm up and start adopting ECS across the board. But each source/module will always have their own specific fields.

I think the simplest way to address this would be to go to one index per module. If we do this, it becomes trivial to support the use case where we only upload the templates that make sense according to the --setup -modules=X commands that get run over time.

And we can also structure our index templates and index names in such a way to reduce duplication, for example by defining ECS fields only once.

Index template examples (including their order parameter):

IT name IT matcher order IT content
ecs "*-ecs-*" 10 All generic ECS fields
filebeat-7.0.0-ecs-nginx "filebeat-7.0.0-*-nginx-*" 5 All fields specific to the nginx module

And so on for each module...

Here are some advantages I see in this approach:

  • This has the advantage of being really straightforward: there's an index template for each module the user decides to use, and it's not polluted by other module's fields.
  • Usage of order allows customers or "future us" to override things if ever needed.
  • We're no longer encouraging the creation of sparse indices. Each index will contain a few types of events at most. I would for example have nginx access and error logs in the same index, but that's it.
  • Cluster state doesn't need to contain an index template that's ready to support an ever growing list of technologies that are largely not used by the customer. They only create the index templates that necessary for the technologies & modules they actually intend to use.
  • This isolates each module from one another, making the occasional field name clashes less painful. Clashes will cause problems at visualization time when pulling from filebeats-*, but not at ingestion time, which is a big improvement.

@simitt
Copy link
Contributor

simitt commented Aug 20, 2018

cc @elastic/apm-server

@simitt
Copy link
Contributor

simitt commented Aug 20, 2018

APM Server uses the same template creation and uploading logic as beats, but does not have a module structure. By default, there is a different index per event type (transaction, span, error, metric, sourcemap). Currently one template is created and applied for all indices.
Allowing for a separate template per index, would make a lot of sense for APM from my perspective. Not sure how we could support this with the current setup.template config options from the beat.yml though.
I am also not a huge fan of having quite some magic going on during data ingestion.

@ruflin
Copy link
Member Author

ruflin commented Aug 20, 2018

Thanks everyonf for chiming into this discussion which I was hoping for with this POC. As a bit of background: This POC comes out of my initial approach of having 1 index per module which I realised is not necessarly as great as I hoped for. One index per module has still a few downsides:

  • Indices are still rather sparse as each metricset/fileset will have a lot of different fields. So to have really dense index we would need an index per metricset/fileset.
  • We have more and more processors like add_kubernetes_metata. All these fields would be also in each index which makes it sparser. Loading processor templates on demand would be an idea but also adds complexity.
  • Module fields can overlap, so in the case of Kibana it has Elasticsearch fields which makes 1 index per module tricky. The shared fields can all be global but will again increase the number of fields in a template.
  • Having said all the above, I don't think having sparse indices is really an issue in Elasticsearch 6.0 and I'm more worried about template size / cluster state.

For the comments above:

  • Security: It seems same user rights are needed for write and adding mapping, so security should not be an issue.
  • Lazy loading: We actually discussed this in the past and I remember @pickypg even brought it up with the Elasticsearch team but I think it was not feasible back then. Perhaps worth discussing again.
  • Rejection of data: It's not so different from what happens now. If a wrong template is used or a dynamic field is used and an other Beat has a different typed value, the document gets rejected. It's basically moving the dynamic field creation to our side but the outcome should be exactly the same. If users modify the template they will have the same affect as before.
  • For making the existing solution smarter: One way here could be having lots of small templates for each processor, module etc. That would mean we add complexity to the template loading to decide what template should be loaded based on config options I assume. For this I think the above is a more generic solution.
  • Overhead: It's something I'm worried over but not too much as if 1000 Beats are sending data to a ES cluster I think a short peak of 3x the request should be easy to handle. I had a discussion with @jpountz about potentially reduce it to 2 request if needed, meaning sending the new mapping as part of a bulk request.

One interesting thing that all template solutions we discussed so far do not offer is for example having one index for the Elasticsearch the Elasticsearch module of Filebeat and Metricbeat as the fields heavily overlap. Also it allows our users to select any index pattern and they will always have the right template which at the moment is tricky to configure correctly.

As said before: I think our template approach works pretty well most of the time, so something like above should not replace it but be an addition.

@ruflin
Copy link
Member Author

ruflin commented Dec 14, 2018

Closing this PR for now as I currently don't have the time to push this forward. @urso @monicasarbu I still think it is an idea that could be of value when we change our indexing strategy.

@ruflin ruflin closed this Dec 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issue needs further discussion. Team:Integrations Label for the Integrations team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants