Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better out-of-the-box mappings for logs, metrics and synthetics #64978

Merged
merged 24 commits into from
May 4, 2021

Conversation

ruflin
Copy link
Member

@ruflin ruflin commented Nov 12, 2020

One of the problems we have today with the default templates is that ip addresses and message fields are not mapped correct. Auto detection of ip addresses would be great: #64400 But in the meantime, we could also match on the naming convention that all *.ip fields are of type ip address.

During implementation I was not sure if I should use match: ip for path_match: *.ip. The main difference I would expect that the top level is not matched. Are there others?

Do we have fields which are named ip or message which do not adhere to these rules?

This PR is not complete but is to share the idea, also it was not tested. If we decide to move forward, also the other base templates need updating and it might be useful to add the same logic to all datasets created by Kibana.

One of the problems we have today with the default templates is that ip addresses and message fields are not mapped correct. Auto detection of ip addresses would be great: elastic#64400 But in the meantime, we could also match on the naming convention that all `*.ip` fields are of type ip address.

During implementation I was not sure if I should use `match: ip` for `path_match: *.ip`. The main difference I would expect that the top level is not matched. Are there others?

Do we have fields which are named `ip` or `message` which do not adhere to these rules?

This PR is not complete but is to share the idea. If we decide to move forward, also the other base templates need updating and it might be useful to add the same logic to all datasets created by Kibana.
@ruflin
Copy link
Member Author

ruflin commented Nov 12, 2020

@dakrone @exekias @axw @jpountz @jalvz Would be great to get your thoughts on the above. @webmat This might also be an interesting option for ECS.

@ruflin ruflin self-assigned this Nov 12, 2020
@jpountz
Copy link
Contributor

jpountz commented Nov 12, 2020

The main difference I would expect that the top level is not matched. Are there others?

This is the only difference indeed. So maybe we should use match rather than path_match then? And with your change we could also remove the message field from the mappings, which could be nice for the logging datasets that don't have a message field?

Co-authored-by: Adrien Grand <jpountz@gmail.com>
@webmat
Copy link

webmat commented Nov 12, 2020

In ECS all of the .ip fields are indeed of type ip. We also strive for consistency, so I don't expect any .ip leaf fields getting into ECS that would not be of type ip. Whenever there's a doubt (could be a socket / hostname or IP), we have the dual .address => .ip fields. So this will be fine from the ECS fields POV.

If you want to catch 100% of the ECS ip fields, you'd need to also catch network.forwarded_ip and dns.resolved_ip, but this may be out of scope for the intent here.

So I'm good with this approach as is.

@axw
Copy link
Member

axw commented Nov 16, 2020

These look fine for the base logs template, but I don't think we could extend this to the datasets created for APM.

In APM we have some object fields called message related to messaging (message queue/bus) events. The specific fields we index at the moment are span.message.queue.name and span.message.age.ms.

@jpountz
Copy link
Contributor

jpountz commented Nov 16, 2020

@axw The proposal uses a combination of match: message and match_mapping_type: string so it would only be considered for fields that both have message as a name and are represented as strings in the JSON document. So it wouldn't be an issue for objects whose name is message.

@axw
Copy link
Member

axw commented Nov 17, 2020

Thanks @jpountz, I missed that bit.

This would be fine to include for all templates for APM then, and I think it's reasonable to assume all string-type fields called *.message should be mapped as text.

@ruflin
Copy link
Member Author

ruflin commented Nov 17, 2020

@jpountz I'm torn if we should remove message or not by just using match. From a pure technical reason it seems the final mapping would be identical. But if a user looks at the template / mapping the user always see message and can think about how to use it. At the same time, the simpler the template the better ...

@webmat I'm wondering if the above is also a convention we should put into ECS itself. The same field name MUST have always the same type no matter where it is in the hierarchy. message and ip are the two obvious ones here, but it might also apply to id for example and could be used for "groups" that are usable in different places like user.

@exekias
Copy link

exekias commented Nov 17, 2020

I have checked beats codebase and we have a few instances where *.ip is not of type ip. A few of them are mapped as keywords, but I would consider this a bug, others are alias, which may be more legit:

https://github.com/elastic/beats/blob/af4007eaef6e142bb0d46370918db4ce7135f316/x-pack/filebeat/module/suricata/eve/_meta/fields.yml#L92-L94
https://github.com/elastic/beats/blob/af4007eaef6e142bb0d46370918db4ce7135f316/filebeat/module/system/auth/_meta/fields.yml#L50-L53
https://github.com/elastic/beats/blob/af4007eaef6e142bb0d46370918db4ce7135f316/filebeat/module/haproxy/_meta/fields.yml#L90-L97

Are you trying to solve getting host.ip mapped correctly or do you have some other field in mind?

I wonder if we can make this decision for the user based on field names, at the very least we would need to do a couple of changes in the modules to make things consistent.

To some extent I wonder if we want to unify how type hinting should be done. It seems things are moving towards #61939, where we use path matching as a workaround.

@jpountz
Copy link
Contributor

jpountz commented Nov 17, 2020

@jpountz I'm torn if we should remove message or not by just using match. From a pure technical reason it seems the final mapping would be identical. But if a user looks at the template / mapping the user always see message and can think about how to use it. At the same time, the simpler the template the better ...

It looks like we have some logging datasets that are completely structured and don't have a message field, so maybe we should just have it in the dynamic template so that it would only be introduced when relevant?

@ruflin
Copy link
Member Author

ruflin commented Nov 17, 2020

I'm not trying to fix the host.ip problem here as even if we have this mapping in, someone could still use host as a keyword and break things. But much more try to solve the problem around all the other ip addresses which for example show up in the add_*_metadata processors so we don't have to map everything.

@exekias Thanks for checking the Beats fields on this. The alias fields should not be an issue as these do not exist as actual data. Also my assumption is, an actual mapping would overwrite the dynamic mapping in case both match.

For #61939 I think these two are complementary. #61939 is used for any ingestion where the fields are actually known and probably in most cases goes to a specific dataset. This change here is to get the base mappings correct in most cases without complicating the base template too much.

@webmat
Copy link

webmat commented Nov 18, 2020

a convention we should put into ECS itself

@ruflin It's a convention we've implicitly been following indeed. It could make sense to capture this guidance explicitly as well.

@ruflin
Copy link
Member Author

ruflin commented Nov 20, 2020

I updated the PR with all the above conversations:

  • disable norms on message:
  • Use match and remove message

Should we move forward with this? If yes, how are these things tested? And we should also add it to the other templates? Would it be possible that the ES team takes over from here @dakrone ?

@webmat
Copy link

webmat commented Nov 20, 2020

Should we add this to ECS too?

Yes I think it's a good idea. It's in line with how we try to keep field names consistent. Identifying those that come with an obvious associated type and extending the consistency this way makes a lot of sense both for ECS fields, and to help users following these conventions in custom fields.

I've opened this issue to track this elastic/ecs#1144

@webmat
Copy link

webmat commented Nov 20, 2020

With the advent of wildcard, I wonder if we shouldn't think about moving the message fields to wildcard instead of text as the main data type, however.

Some message types like the multi-paragraph Windows event messages obviously work well with text. But there's also so many message types that are short and/or structured and would benefit more from wildcard. If we had the courage make this breaking change at the next major, then we'd have 100% string fields in ECS be aggregatable & offer fast exact matches when filtering. And we'd have all fields that benefit from an analyzer have a .text multi-field.

Right now ECS has many keyword fields with a .text multi-field. But message and error.message are the only two who don't behave like the rest of the string fields, and use type text on the canonical field instead of in a multi-field.

I definitely see why we might want to avoid this breaking change as well, so I could see this going both ways. But this issue seemed like a good time to bring this up 😄

@ruflin
Copy link
Member Author

ruflin commented Nov 23, 2020

@webmat Thanks for the ECS follow up. I subscribed where it is head. For the message part, lets open a separate thread. As the change here should not be breaking one, we can decouple the two.

@jpountz
Copy link
Contributor

jpountz commented Nov 24, 2020

Some message types like the multi-paragraph Windows event messages obviously work well with text. But there's also so many message types that are short and/or structured and would benefit more from wildcard.

Yes, the more I'm thinking of this, the more I believe that we won't be able to come up with a good default for all datasets, some datasets will benefit from having the message field mapped as wildcard but other datasets like many application logs should probably keep using text.

If we had the courage make this breaking change at the next major

For the aspect about changing the semantics of the message field, we've had a couple discussions about having a variant of wildcard that behaves like text (similarly to how wildcard behaves exactly like keyword today). This is something that could help going through the backward break I believe?

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to move forward with this

},
"match_ip": {
"match_mapping_type": "string",
"match": "ip",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep a single space after the colon?

},
"match_message": {
"match_mapping_type": "string",
"match": "message",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep a single space after the colon?

@ruflin
Copy link
Member Author

ruflin commented Nov 25, 2020

@jpountz Is there a good way to automatically test these changes?

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See Github actions on this PR, we have tests that run automatically on all PRs. I just made suggestions that should hopefully address the test failures we are seeing.

x-pack/plugin/core/src/main/resources/logs-mappings.json Outdated Show resolved Hide resolved
x-pack/plugin/core/src/main/resources/logs-mappings.json Outdated Show resolved Hide resolved
ruflin and others added 2 commits November 26, 2020 08:20
Co-authored-by: Adrien Grand <jpountz@gmail.com>
Co-authored-by: Adrien Grand <jpountz@gmail.com>
@ruflin
Copy link
Member Author

ruflin commented Nov 26, 2020

The tests I wanted to are something like:

  • Set template
  • Add doc with field foo.ip
  • Check if mapping is correct

The same for message to confirm the change works as expected. I found https://github.com/elastic/elasticsearch/pull/57629/files#diff-1e7ab3525fa67e621593216ed026d39eb002b44fe440e0cdb33b577f40631b00, might this be the right place to add something like this?

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for shepherding this Adrien. I left a few more comments about the application order. I think it's best to go from most general -> most specific in terms of mappings so that when a user changes a more specific mapping it doesn't get accidentally bypassed by a general mapping.

Comment on lines 6 to 8
"logs-mappings",
"data-streams-mappings",
"logs-settings"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should apply the data-streams-mappings settings first, so that any changes made to the logs-mappings component template always take precedence over the generic data stream mappings.

Suggested change
"logs-mappings",
"data-streams-mappings",
"logs-settings"
"data-streams-mappings",
"logs-mappings",
"logs-settings"

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this would break tests because then the dynamic template that maps strings as keywords would take precedence over the dynamic template that maps message fields as match_only_text. In order to change the order, we would also need to configure unmatch:message on the default dynamic template that maps strings as keywords. What is your preference?

Comment on lines 6 to +7
"metrics-mappings",
"data-streams-mappings",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here with:

Suggested change
"metrics-mappings",
"data-streams-mappings",
"data-streams-mappings",
"metrics-mappings",

Comment on lines 6 to +7
"synthetics-mappings",
"data-streams-mappings",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And same here with:

Suggested change
"synthetics-mappings",
"data-streams-mappings",
"data-streams-mappings",
"synthetics-mappings",

Copy link
Member Author

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I quite like how you use components here to simplify the templates.

For the message field, I'm worried that if a metrics event will contain a message, it will be mapped as keyword but should be text. This field is so basic, I would keep it the default template. If not used, does this cause lots of overhead?

"host": {
"type": "object"
},
"observer": {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could leave this one out.

"type": "keyword"
},
"match_mapping_type": "string"
"type": "match_only_text"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a breaking change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What aspect are you concerned with? I expect this to be transparent for the vast majority of our users, and I started discussions with the ECS team to make similar changes on ECS.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read up on it by now a bit more. What happens if a user runs a non "supported" query across logs-* and some message fields are text and some are match_only_text. Is an error returned or the non supported fields are just skipped?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only unsupported queries are span queries. If a span query is run, then all shards that have the field mapped as match_only_text will be ignored from the search response and reported as "failed".

@jpountz
Copy link
Contributor

jpountz commented Apr 24, 2021

so that when a user changes a more specific mapping it doesn't get accidentally bypassed by a general mapping.

@dakrone Can you give more details about the use-case you have in mind? How could a user be affected by this? My understanding was that users couldn't inherit from these templates and would need to copy them anyway if they wanted to make changes?

For the message field, I'm worried that if a metrics event will contain a message, it will be mapped as keyword but should be text. This field is so basic, I would keep it the default template. If not used, does this cause lots of overhead?

@ruflin At this stage it doesn't matter much in terms of overhead, but if we put message in the base template I worry that it would set a precedent that things should be folded into the base template if possible while I'd rather like to make sure that everything that ends up in a template is actually relevant to the considered index pattern. When I talked with @exekias, he seemed confident that it would be a bug if there was a message field in a metrics dataset.

@ruflin
Copy link
Member Author

ruflin commented Apr 26, 2021

@exekias I remember we used error.message or similar in the past. Maybe an option here is to do the same trick as we do with ip. Everything that is called message must be text.

@jpountz Don't block on this, we can always follow up.

@exekias
Copy link

exekias commented Apr 26, 2021

Yes, error.message is a thing, I think it's fine to have a rule to map *message as text. I wonder how expected will that be to our users, but at least for our datasets I don't think that would cause any issue.

@dakrone
Copy link
Member

dakrone commented Apr 26, 2021

so that when a user changes a more specific mapping it doesn't get accidentally bypassed by a general mapping.

@dakrone Can you give more details about the use-case you have in mind? How could a user be affected by this? My understanding was that users couldn't inherit from these templates and would need to copy them anyway if they wanted to make changes?

It is certainly recommended that a user copy them rather than make changes directly.

That being said, I think in terms of our defaults, we should always go from most general to most specific. First, because we should set a good example of a recommended pattern for component template usage, and second, so that in the case that a user does edit the template directly, we are closer to the desired and expected effect.

Do you agree? Do you have a different reason for giving the more generic template higher precedence?

@jpountz
Copy link
Contributor

jpountz commented Apr 27, 2021

Do you agree? Do you have a different reason for giving the more generic template higher precedence?

This is a good question.

One problem we have is that the first component template wins when it comes to dynamic templates (because dynamic templates are appended and the first one that matches wins), but the last component template wins when it comes to properties (because properties get overridden).

This makes me wonder if we should split properties and dynamic templates into different component templates so that we could order them that way:

  • specialized dynamic templates
  • generic dynamic templates
  • generic properties
  • specialized properties

What do you think?

@dakrone
Copy link
Member

dakrone commented Apr 30, 2021

Sorry for the delay on this,

This makes me wonder if we should split properties and dynamic templates into different component templates so that we could order them that way ... What do you think?

This sounds reasonable to me, though I do think we start growing in complexity the more we add. In that regard, maybe it's okay to leave it the way it is, with two component templates rather than four.

I think I'm okay either way, regardless, this still LGTM.

@jpountz
Copy link
Contributor

jpountz commented May 3, 2021

Right, this complexity is the reason why I didn't jump on this solution. My gut feeling is that we won't often need to override properties given how they are supposed to be more-or-less standardized via ECS, so a simpler path forward might be to put the most specific templates first and avoid defining the same field across multiple templates.

@jpountz
Copy link
Contributor

jpountz commented May 3, 2021

I kept templates from the most specific to the most generic for now and applied remaining feedback:

  • observer object field removed,
  • moved message to the base template due to the error.message field.

@jpountz
Copy link
Contributor

jpountz commented May 4, 2021

@elasticmachine run elasticsearch-ci/2

@ruflin
Copy link
Member Author

ruflin commented May 4, 2021

@jpountz Thanks for pushing this over the line.

dakrone added a commit to dakrone/elasticsearch that referenced this pull request Jun 1, 2021
This template was added in elastic#64978, however, there can be some test failures if we try to remove
built-in templates. It was missing from the list and now needs to be added back.
dakrone added a commit that referenced this pull request Jun 1, 2021
This template was added in #64978, however, there can be some test failures if we try to remove
built-in templates. It was missing from the list and now needs to be added back.
dakrone added a commit to dakrone/elasticsearch that referenced this pull request Jun 2, 2021
This template was added in elastic#64978, however, there can be some test failures if we try to remove
built-in templates. It was missing from the list and now needs to be added back.
dakrone added a commit that referenced this pull request Jun 2, 2021
…73672)

This template was added in #64978, however, there can be some test failures if we try to remove
built-in templates. It was missing from the list and now needs to be added back.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants