From b997c42b16d7c8b995f3da0bc7fe9c754855c89e Mon Sep 17 00:00:00 2001 From: Tigran Najaryan Date: Mon, 6 May 2024 15:44:30 -0400 Subject: [PATCH] Introduce Entities Data Model, Part 1 This is a proposal of a data model to represent entities. The purpose of the data model is to have a common understanding of what an entity is, what data needs to be recorded, transferred, stored and interpreted by an entity observability system. This data model sets the foundation for adding entities to OpenTelemetry. The data model is largely borrowed from [the initial proposal](https://docs.google.com/document/d/1VUdBRInLEhO_0ABAoiLEssB1CQO_IcD5zDnaMEha42w/edit) that was accepted for entities SIG formation. This OTEP is step 1 in introducing the entities data model. Follow up OTEPs will add further data model definitions, including the linking of Resource information to entities. --- text/entities/0256-entities-data-model.md | 765 ++++++++++++++++++++++ 1 file changed, 765 insertions(+) create mode 100644 text/entities/0256-entities-data-model.md diff --git a/text/entities/0256-entities-data-model.md b/text/entities/0256-entities-data-model.md new file mode 100644 index 000000000..ffc861e38 --- /dev/null +++ b/text/entities/0256-entities-data-model.md @@ -0,0 +1,765 @@ +# Entities Data Model, Part 1 + +This is a proposal of a data model to represent entities. The purpose of the data model +is to have a common understanding of what an entity is, what data needs to be recorded, +transferred, stored and interpreted by an entity observability system. + + + +- [Motivation](#motivation) +- [Design Principles](#design-principles) +- [Data Model](#data-model) + * [Minimally Sufficient Id](#minimally-sufficient-id) + * [Examples of Entities](#examples-of-entities) +- [Entity Events](#entity-events) + * [EntityState Event](#entitystate-event) + * [EntityDelete Event](#entitydelete-event) +- [Entity Identification](#entity-identification) + * [LID, GID and IDCONTEXT](#lid-gid-and-idcontext) + * [Semantic Conventions](#semantic-conventions) + * [Examples](#examples) + + [Process in a Host](#process-in-a-host) + + [Process in Kubernetes](#process-in-kubernetes) + + [Host in Cloud Account](#host-in-cloud-account) +- [Prototypes](#prototypes) +- [Prior Art](#prior-art) +- [Alternatives](#alternatives) + * [Different ID Structure](#different-id-structure) + * [No Entity Events](#no-entity-events) + * [Merge Entity Events data into Resource](#merge-entity-events-data-into-resource) + * [Hierarchical ID Field](#hierarchical-id-field) +- [Open questions](#open-questions) + * [Attribute Data Type](#attribute-data-type) + * [Classes of Entity Types](#classes-of-entity-types) + * [Multiple Observers](#multiple-observers) + * [Is Type part of Entity's identity?](#is-type-part-of-entitys-identity) +- [Future Work](#future-work) +- [References](#references) + + + +## Motivation + +This data model sets the foundation for adding entities to OpenTelemetry. The data model +is largely borrowed from +[the initial proposal](https://docs.google.com/document/d/1VUdBRInLEhO_0ABAoiLEssB1CQO_IcD5zDnaMEha42w/edit) +that was accepted for entities SIG formation. + +This OTEP is step 1 in introducing the entities data model. Follow up OTEPs will add +further data model definitions, including the linking of Resource information +to entities. + +## Design Principles + +- Consistency with the rest of OpenTelemetry is important. We heavily favor solutions + that look and feel like other OpenTelemetry data models. + +- Meaningful (especially human-readable) IDs are more valuable than random-generated IDs. + Long-lived IDs that survive state changes (e.g. entity restarts) are more valuable than + short-lived, ephemeral IDs. + See [the need for navigation](https://docs.google.com/document/d/1Xd1JP7eNhRpdz1RIBLeA1_4UYPRJaouloAYqldCeNSc/edit#heading=h.fut2c2pec5wa). + +- We cannot make an assumption that the entirety of information that is necessary for + global identification of an entity is available at once, in one place. This knowledge + may be distributed across multiple participants and needs to be combined to form a + globally unique identifier. + +- Semantic conventions must bring as much order as possible to telemetry, however they + cannot be too rigid and prevent real-world use cases. + +## Data Model + +We propose a new concept of Entity. + +Entity represents an object of interest associated with produced telemetry: +traces, metrics or logs. + +The Entity that produces the telemetry is called the Producing Entity. For example +telemetry produced using OpenTelemetry SDK is normally associated with a Service. +The Service is the Producing Entity. Similarly, OpenTelemetry defines system metrics +for a host. The Host is the Producing Entity in this case. + +Entities may be also associated with produced telemetry indirectly, typically by +virtue of being related to the Producing Entity. For example a Service that produces +telemetry is also related with a process in which the Service runs, so we say that +the Service entity is related to the Process entity. The process normally also runs +on a host, so we say that the Process entity is related to the Host entity. + +Note: subsequent OTEPs will define how the Producing Entity is associated with +traces, metrics and logs and how relations between entities will be specified. +See [Future Work](#future-work). + +The data model below defines a logical model for an entity (irrespective of the physical +format and encoding of how entity data is recorded). + + + + + + + + + + + + + + + + + + + + + + +
Field + Type + Description +
Type + string + Defines the type of the Entity. MUST not change during the +lifetime of the entity. For example: "service" or "host". This field is +required and MUST not be empty for valid entities. +
Id + key/value pair list + A set of attributes that identifies the Entity. +

+MUST not change during the lifetime of the Entity. The Id must contain +at least one attribute. +

+Follows OpenTelemetry common +attribute definition. SHOULD follow OpenTelemetry semantic +conventions for attributes. +

Attributes + key/value pair list + A set of descriptive (non-identifying) attributes of the Entity. +

+MAY change over the lifetime of the entity. MAY be empty. These +attributes are not part of Entity's identity. +

+“value” follows any +value definition in the OpenTelemetry spec - it can be a scalar value, +byte array, an array or map of values. Arbitrary deep nesting of values +for arrays and maps is allowed. +

+SHOULD follow OpenTelemetry semantic +conventions for attributes. +

+ +### Minimally Sufficient Id + +Often a number of attributes of an entity is readily available for the telemetry +producer to compose an Id from. Of the available attributes the Entity Id should +include the minimal set of attributes that is sufficient for uniquely identifying +that entity. No superfluous attributes should be included in the Id set. For example +a Process on a host can be uniquely identified by `process.id` attribute. Adding for +example `process.executable.name` attribute to the Id is unnecessary and violates the +Minimally Sufficient Id rule. + +### Examples of Entities + +_This section is non-normative and is present only for the purposes of demonstrating +the data model._ + +Here are examples of entities, the typical identifying attributes they +have and some examples of non-identifying attributes that may be +associated with the Entity. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Entity + Entity Type + Identifying Attributes + Non-identifying Attributes +
Service + "service" + service.name (required) +

+service.instance.id +

+service.namespace +

service.version +
Host + "host" + host.id + host.name +

+host.type +

+host.image.id +

+host.image.name +

K8s Pod + "k8s.pod" + k8s.pod.uid (required) +

+k8s.cluster.name +

Any pod labels +
K8s Pod Container + "container" + k8s.pod.uid (required) +

+k8s.cluster.name +

+container.name +

Any container labels +
+ +See more examples showing nuances of ID field composition in the +[Entity Identification](#entity-identification) section. + +## Entity Events + +Information about Entities can be produced and communicated using 2 +types of Entity events: EntityState and EntityDelete. + +### EntityState Event + +The EntityState event stores information about the _state_ of the Entity +at a particular moment of time. The data model of the EntityState event +is the same as the Entity Data model with some extra fields: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Field + Type + Description +
Timestamp + nanoseconds + The time since when the entity state is described by this event. +The time is measured by the origin clock. The field is required. +
Interval + milliseconds + Defines the reporting period, i.e. how frequently the +information about this entity is reported via EntityState events even if +the entity does not change. The next expected EntityEvent for this +entity is expected at (Timestamp+Interval) time. Can be used by +receivers to infer that a no longer reported entity is gone, even if the +EntityDelete event was not observed. Optional, if missing the interval +is unknown. +
Type + + See data model. + +
Id + + See data model +
Attributes + + See data model +
+ +We say that an Entity mutates (changes) when one or more of its +descriptive attributes changes. A new descriptive attribute may be +added, an existing descriptive attribute may be deleted or a value of an +existing descriptive attribute may be changed. All these changes +represent valid mutations of an Entity over time. When these mutations +happen the identity of the Entity does not change. + +When the entity's state is changed it is expected that the source will +emit a new EntityState event with a fresh timestamp and full list of +values of all other fields. + +Entity event producers are recommended to periodically emit events even +if the Entity does not change. In this case the Type, Id and Attribute +fields will remain the same, but a fresh Timestamp will be recorded in +the event. Producing such events allows the system to be resilient to +event losses. Even if some events are lost eventually the correct state +of the Entity is more likely to be delivered to the final destination. +Periodic sending of EntityState events also serves as a liveliness +indicator (see below how it can be used in lieu of EntityDelete event). + +### EntityDelete Event + +EntityDelete event indicates that a particular entity is gone: + + + + + + + + + + + + + + + + + + + + + + +
Field + Type + Description +
Timestamp + nanoseconds + The time when the entity was deleted. The time is measured by +the origin clock. The field is required. +
Type + + See data model +
Id + + See data model +
+ +Note that EntityDelete is optional and is not guaranteed to be sent when +the entity is gone. Recipients of entity signals should be prepared to +handle this situation by expiring entities that are no longer seeing +EntityState events reported (i.e. treat the presence of EntityState +events as a liveliness indicator). + +The expiration mechanism is based on the previously reported `Interval` field of +EntityState event. The recipient can use this value to compute when to expect the next +EntityState event and if the event does not arrive in a timely manner (plus some slack) +it can consider the entity to be gone even if the EntityDelete event was not observed. + +## Entity Identification + +_This section is a supplementary guideline and is not part of logical data model._ + +The data model defines the structure of the entity ID field. This section explains +how the ID field is computed. + +### LID, GID and IDCONTEXT + +All entities have a local ID (LID) and a global ID (GID). + +The LID is unique in a particular identification context, but is not necessarily globally +unique. For example a process entity's LID is its PID number and process start time. +The (PID,StartTime) pair is unique only in the context of a host where the process runs +(and the host in this case is the identification context). + +The GID of an entity is globally unique, in the sense that for the entire set of entities +in a particular telemetry store no 2 entities exist that have the same GID value. + +The GID of an entity E is defined as: + +`GID(E) = { LID(E), GID(IDCONTEXT(E)) }` + +Where `IDCONTEXT(E)` is the identification context in which the LID of entity E is unique. +The value of `IDCONTEXT(E)` is an Entity itself, and thus we can compute the GID value of it too. + +In other words, the GID of an entity is a union of its LID and the GID of its +identification context. Note: GID(E) is a flat set of key/value attributes. + +The enrichment process often is responsible for determining the value of `IDCONTEXT(E)` +and for computing the GID according to the formula defined above, although the GID may +also be produced at once by the telemetry source (e.g. by Otel SDK) without requiring +any additional enrichment. + +### Semantic Conventions + +OpenTelemetry semantic conventions will be enhanced to include entity definitions for +well-known entities such as Service, Process, Host, etc. + +For well-known entity types LID(E) is defined in Otel semantic conventions per +entity type. The value of LID is a flat set of key/value attributes. For example, +for entity of type "process" the semantic conventions define LID as 2 attributes: + +```json5 +{ + "process.pid": $pid, + "process.start_time": $starttime +} +``` + +For custom entity types (not defined in Otel semantic conventions) the end-user is +responsible for defining their custom semantic conventions in a similar way. + +The entity information producer is responsible for determining the identification +context of each entity it is producing information about. + +In certain cases, where only one possible IDCONTEXT definition is meaningful, the +IDCONTEXT MAY be defined in the semantic conventions. For example Kubernetes nodes +always exist in the identifying context of a Kubernetes cluster. The semantic convention +for "k8s.node" and "k8s.cluster" can prescribe that the IDCONTEXT of entity of type +"k8s.node" is always an entity of type "k8s.cluster". + +Important: semantic conventions can NEVER prescribe the complete GID composition. +Semantic conventions SHOULD prescribe LID and MAY prescribe IDCONTEXT, but GID +composition, generally speaking, cannot be known statically. + +For example: a host's LID should be a `host.id` attribute. A host running on a cloud +should have an IDCONTEXT of "cloud.account" and the LID of "cloud.account" entity +is (`cloud.provider`, `cloud.account.id`). However semantic conventions cannot prescribe +that the GID of a host is (`host.id`, `cloud.provider`, `cloud.account.id`) because not all +hosts run on cloud. A host that runs on prem in a single data center may have a GID +of just (`host.id`) or if a customer has multiple on prem data centers they may use +data.center.id as its identifier and use (`host.id`, `data.center.id`) as GID of the host. + +### Examples + +#### Process in a Host + +A locally running host agent (an Otel Collector) that produces +information about "process" entities has the knowledge that the +processes run in the particular host and thus the "host" is the +identification context for the processes that the agent observes. The +LID of a process can look like this: + +```json5 +{ + "process.pid": 12345, + "process.start_time": 1714491491 +} +``` + +and Collector will use "host" as the IDCONTEXT and add host's LID to it: + +```json5 +{ + // Process LID, unique per host. + "process.pid": 12345, + "process.start_time": 1714491491, + + + // Host LID + "host.id": "fdbf79e8af94cb7f9e8df36789187052" +} +``` + +If we assume that we have only one data center and host ids are globally +unique then the above id is globally unique and is the GID of the +process. If this assumption is not valid in our situation we would +continue applying additional IDCONTEXT's until the GID is globally +unique. See for example the +[Host in Cloud Account](#host-in-cloud-account) example below. + +#### Process in Kubernetes + +A Kubernetes Collector that produces information about process entities +has the knowledge that the processes run in the particular containers in +the particular pod and thus the container is the identification context +for the process, and the pod is the identification context for the +container. If we begin with the same process LID: + +```json5 +{ + "process.pid": 12345, + "process.start_time": 1714491491 +} +``` + +the Kubernetes Collector will then add the IDCONTEXT of container and of +pod to this, resulting in: + +```json5 +{ + // Process LID, unique per container. + "process.pid": 12345, + "process.start_time": 1714491491, + + // Container LID, unique per pod. + "k8s.container.name": "redis", + + + // Pod LID has 2 attributes. + "k8s.pod.uid": "0c4cbbf8-d4b4-4e84-bc8b-b95f0d537fc7", + "k8s.cluster.name": "dev" +} +``` + +Note that we used 3 different LIDs above to compose the GID. The +attributes that are part of each LID are defined in Otel semantic +conventions. + +In this example we assume this to be a valid GID because Pod is the root +IDCONTEXT, since Pod's LID includes the cluster name, which is expected +to be globally unique. If this assumption about global uniqueness of +cluster names is wrong then another containing IDCONTEXT within which +cluster names are unique will need to be applied and so forth. + +Note also how we used a pair (`k8s.pod.uid`, `k8s.cluster.name`). +Alternatively, we could say that Kubernetes Cluster is a separate entity +we care about. This would mean the Pod's IDCONEXT is the cluster. The +net result for process's GID would be exactly the same, but we would +arrive to it in a different way: + +```json5 +{ + // Process LID, unique per container. + "process.pid": 12345, + "process.start_time": 1714491491, + + // Container LID, unique per pod. + "k8s.container.name": "redis", + + + // Pod LID, unique per cluster. + "k8s.pod.uid": "0c4cbbf8-d4b4-4e84-bc8b-b95f0d537fc7", + + // Cluster LID, also globally unique since cluster is root entity. + "k8s.cluster.name": "dev" +} +``` + +#### Host in Cloud Account + +A host running in a cloud account (e.g. AWS) will have a LID that uses +the host instance id, unique within a single cloud account, e.g.: + +```json5 +{ + // Host LID, unique per cloud account. + "host.id": "fdbf79e8af94cb7f9e8df36789187052" +} +``` + +Otel Collector with +[resourcedetection](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/resourcedetectionprocessor) +processor with "aws" detector enabled will add the IDCONTEXT of the +cloud account like this: + +```json5 +{ + // Host LID, unique per cloud account. + "host.id": "fdbf79e8af94cb7f9e8df36789187052" + + // Cloud account LID has 2 attributes: + "cloud.provider": "aws", + "cloud.account.id": "1234567890" +} +``` + +## Prototypes + +A set of prototypes that demonstrate this data model has been implemented: + +- [Go SDK Prototype](https://github.com/tigrannajaryan/opentelemetry-go/pull/244) +- [Collector Prototype](https://github.com/tigrannajaryan/opentelemetry-collector/pull/4) +- [Collector Contrib Prototype](https://github.com/tigrannajaryan/opentelemetry-collector-contrib/pull/1/files) +- [OTLP Protocol Buffer changes](https://github.com/tigrannajaryan/opentelemetry-proto/pull/2/files) + +## Prior Art + +An experimental entity data model was implemented in OpenTelemetry Collector as described +in [this document](https://docs.google.com/document/d/1Tg18sIck3Nakxtd3TFFcIjrmRO_0GLMdHXylVqBQmJA/edit). +The Collector's design uses LogRecord as the carrier of Entity events, with logical structure +virtually identical to what this OTEP proposes. + +There is also an implementation of this design in the Collector, see +[completed issue to add entity events](https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/23565) +and [the PR](https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/24419) +that implements entity event emitting for k8scluster receiver in the Collector. + +## Alternatives + +### Different ID Structure + +Alternative proposals were made [here](https://docs.google.com/document/d/1PLPSAnWvFWCsm6meAj6OIVDBvxsk983V51WugF0NgVo/edit) and +[here](https://docs.google.com/document/d/1bLmkQSv35Fi6Wbe4bAqQ-_JS7XWIXWbvuirVmAkz4a4/edit) +to use a different structure for entity ID field. + +We rejected these proposals in favour of the ID field proposed in this OTEP for the +following reasons: + +- The flat set of key/value attributes is widely used elsewhere in OpenTelemetry as + Resource attributes, as Scope attributes, as Metric datapoint attributes, etc. so + it is conceptually consistent with the rest of Otel. + +- We already have a lot of machinery that works well with this definition of attributes, + for example OTTL language has syntax for working with attributes, or Collector's pdata + API or Attribute value types in SDKs. All this code will no longer work as is if we + have a different data structure and needs to be re-implemented in a different way. + +### No Entity Events + +Entity signal allows recording the state of the entities. As the entity's state changes +events are emitted that describe the new state. In this proposal the entity's state is +(type,id,attributes) tuple, but we envision that in the future we may also want to add +more information to the Entity signal, particularly to record the relationships between +entities (i.e the fact that a Process runs on a Host). + +### Merge Entity Events data into Resource + +If we eliminate the Entity signal as a concept and put the entire entity's state into +the Resource then every time the entity's state changes we must emit one of +ResourceLogs/ResourceSpans/ResourceMetrics messages that includes the Resource that +represents the entity's state. + +However, what do we do that if there are no logs or spans or metrics data points to +report? Do we emit a ResourceLogs/ResourceSpans/ResourceMetrics OTLP message with empty logs +or spans or metrics data points? Which one do we emit: ResourceLogs, ResourceSpans +or ResourceMetrics? + +What do we do when we want to add support for recording entity relationships in the +future? Do we add all that information to the Resource and bloat the Resource size? + +How do we report the EntityDelete event? + +All these questions don't have good answers. Attempting to shoehorn the entity +information into the Resource where it does not naturally fit is likely to result +in ugly and inefficient solutions. + +### Hierarchical ID Field + +We had an alternate proposal to retain the information about how the ID was +[composed from LID and IDCONTEXT](#entity-identification), essentially to record the +hierarchy of identification contexts in the ID data structure instead of flattening it +and losing the information about the composition process that resulted in the particular ID. + +There are a couple of reasons: + +- The flat ID structure is simpler. + +- There are no known use cases that require a hierarchical ID structure. The use case + of "record parental relationship between entities" will be handled explicitly via + separate relationship data structures (see [Future Work](#future-work)). + +## Open questions + +### Attribute Data Type + +The data model requires the Attributes field to use the extended +[any](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/data-model.md#type-any) +attribute values, that allows more complex data types. This is different from the data +type used by the Id field, which is more restricted in the shape. + +Are we happy with this discrepancy? + +### Classes of Entity Types + +Do we need to differentiate between infrastructure entities (e.g. Pod, Host, Process) +and non-infrastructure entities (logical entities) such as Service? Is this distinction +important? + +### Multiple Observers + +The same entity may be observed by different observers simultaneously. For example the +information about a Host may be reported by the agent that runs on the host. At the same +time more information about that same host may be obtained via cloud provider's API. + +The information obtained by different observers can be complementary, they don't +necessarily have access exactly to the same data. It can be very useful to combine this +information in the backend and make it all available to the user. + +However, it is not possible for multiple observers to simultaneously use EntityState +events as they are defined earlier in this document, since the information in the event +will overwrite information in the previously received event about that same entity. + +A possible way to allow multiple observers to report portions of information about the +same entity simultaneously is to indicate the observer in the EntityState event by adding +an "ObserverId" field. EntityState event will then look like this: + +|Field|Type| +|---|---| +|Timestamp|nanoseconds| +|Interval|milliseconds| +|Type|| +|Id|| +|Attributes|| +|ObserverId|string or bytes| + +ObserverId field can be optional. Attributes from EntityState events that contain +different ObserverId values will be merged in the backend. Attributes from EntityState +events that contain the same ObserverId value will overwrite attributes from the previous +reporting of the EntityState event from that observer. + +### Is Type part of Entity's identity? + +Is the Type field part of the entity's identity together with the Id field? + +For example let's assume we have a Host and an Otel Collector running on the Host. +The Host's Id will contain one attribute: `host.id`, and the Type of the entity will be +"host". The Collector technically speaking can be also identified by one attribute +`host.id` and the Type of the entity will be "otel.collector". This only works if we +consider the Type field to be part of the entity's identity. + +If the Type field is not part of identity then in the above example we require that the +entity that describes the Collector has some other attribute in its Id (for example +`agent.type` attribute [if it gets accepted](https://github.com/open-telemetry/semantic-conventions/pull/950)). + +## Future Work + +This OTEP is step 1 of defining the Entities data model. It will be followed by other +OTEPs that cover the following topics: + +- How the existing Resource concept will be modified to link existing + signals to entities. + +- How relationships between entities are modeled. + +- Representation of entity data over the wire and the transmission + protocol for entities. + +- Add transformations that describe entity semantic convention changes in + OpenTelemetry Schema Files. + +We will possibly also submit additional OTEPs that address the Open Questions. + +## References + +- [OpenTelemetry Proposal: Resources and Entities](https://docs.google.com/document/d/1VUdBRInLEhO_0ABAoiLEssB1CQO_IcD5zDnaMEha42w/edit) +- [OpenTelemetry Entity Data Model](https://docs.google.com/document/d/1FdhTOvB1xhx7Ks7dFW6Ht1Vfw2myU6vyKtEfr_pqZPw/edit) +- [OpenTelemetry Entity Identification](https://docs.google.com/document/d/1hJIAIMsRCgZs-poRsw3lnirP14d3sMfn1LB08C4LCDw/edit) +- [OpenTelemetry Resources - Principles and Characteristics](https://docs.google.com/document/d/1Xd1JP7eNhRpdz1RIBLeA1_4UYPRJaouloAYqldCeNSc/edit)