diff --git a/text/entities/0000-entities-data-model.md b/text/entities/0000-entities-data-model.md new file mode 100644 index 000000000..ffc861e38 --- /dev/null +++ b/text/entities/0000-entities-data-model.md @@ -0,0 +1,765 @@ +# Entities Data Model, Part 1 + +This is a proposal of a data model to represent entities. The purpose of the data model +is to have a common understanding of what an entity is, what data needs to be recorded, +transferred, stored and interpreted by an entity observability system. + + + +- [Motivation](#motivation) +- [Design Principles](#design-principles) +- [Data Model](#data-model) + * [Minimally Sufficient Id](#minimally-sufficient-id) + * [Examples of Entities](#examples-of-entities) +- [Entity Events](#entity-events) + * [EntityState Event](#entitystate-event) + * [EntityDelete Event](#entitydelete-event) +- [Entity Identification](#entity-identification) + * [LID, GID and IDCONTEXT](#lid-gid-and-idcontext) + * [Semantic Conventions](#semantic-conventions) + * [Examples](#examples) + + [Process in a Host](#process-in-a-host) + + [Process in Kubernetes](#process-in-kubernetes) + + [Host in Cloud Account](#host-in-cloud-account) +- [Prototypes](#prototypes) +- [Prior Art](#prior-art) +- [Alternatives](#alternatives) + * [Different ID Structure](#different-id-structure) + * [No Entity Events](#no-entity-events) + * [Merge Entity Events data into Resource](#merge-entity-events-data-into-resource) + * [Hierarchical ID Field](#hierarchical-id-field) +- [Open questions](#open-questions) + * [Attribute Data Type](#attribute-data-type) + * [Classes of Entity Types](#classes-of-entity-types) + * [Multiple Observers](#multiple-observers) + * [Is Type part of Entity's identity?](#is-type-part-of-entitys-identity) +- [Future Work](#future-work) +- [References](#references) + + + +## Motivation + +This data model sets the foundation for adding entities to OpenTelemetry. The data model +is largely borrowed from +[the initial proposal](https://docs.google.com/document/d/1VUdBRInLEhO_0ABAoiLEssB1CQO_IcD5zDnaMEha42w/edit) +that was accepted for entities SIG formation. + +This OTEP is step 1 in introducing the entities data model. Follow up OTEPs will add +further data model definitions, including the linking of Resource information +to entities. + +## Design Principles + +- Consistency with the rest of OpenTelemetry is important. We heavily favor solutions + that look and feel like other OpenTelemetry data models. + +- Meaningful (especially human-readable) IDs are more valuable than random-generated IDs. + Long-lived IDs that survive state changes (e.g. entity restarts) are more valuable than + short-lived, ephemeral IDs. + See [the need for navigation](https://docs.google.com/document/d/1Xd1JP7eNhRpdz1RIBLeA1_4UYPRJaouloAYqldCeNSc/edit#heading=h.fut2c2pec5wa). + +- We cannot make an assumption that the entirety of information that is necessary for + global identification of an entity is available at once, in one place. This knowledge + may be distributed across multiple participants and needs to be combined to form a + globally unique identifier. + +- Semantic conventions must bring as much order as possible to telemetry, however they + cannot be too rigid and prevent real-world use cases. + +## Data Model + +We propose a new concept of Entity. + +Entity represents an object of interest associated with produced telemetry: +traces, metrics or logs. + +The Entity that produces the telemetry is called the Producing Entity. For example +telemetry produced using OpenTelemetry SDK is normally associated with a Service. +The Service is the Producing Entity. Similarly, OpenTelemetry defines system metrics +for a host. The Host is the Producing Entity in this case. + +Entities may be also associated with produced telemetry indirectly, typically by +virtue of being related to the Producing Entity. For example a Service that produces +telemetry is also related with a process in which the Service runs, so we say that +the Service entity is related to the Process entity. The process normally also runs +on a host, so we say that the Process entity is related to the Host entity. + +Note: subsequent OTEPs will define how the Producing Entity is associated with +traces, metrics and logs and how relations between entities will be specified. +See [Future Work](#future-work). + +The data model below defines a logical model for an entity (irrespective of the physical +format and encoding of how entity data is recorded). + + + + + + + + + + + + + + + + + + + + + + +
Field + Type + Description +
Type + string + Defines the type of the Entity. MUST not change during the +lifetime of the entity. For example: "service" or "host". This field is +required and MUST not be empty for valid entities. +
Id + key/value pair list + A set of attributes that identifies the Entity. +

+MUST not change during the lifetime of the Entity. The Id must contain +at least one attribute. +

+Follows OpenTelemetry common +attribute definition. SHOULD follow OpenTelemetry semantic +conventions for attributes. +

Attributes + key/value pair list + A set of descriptive (non-identifying) attributes of the Entity. +

+MAY change over the lifetime of the entity. MAY be empty. These +attributes are not part of Entity's identity. +

+“value” follows any +value definition in the OpenTelemetry spec - it can be a scalar value, +byte array, an array or map of values. Arbitrary deep nesting of values +for arrays and maps is allowed. +

+SHOULD follow OpenTelemetry semantic +conventions for attributes. +

+ +### Minimally Sufficient Id + +Often a number of attributes of an entity is readily available for the telemetry +producer to compose an Id from. Of the available attributes the Entity Id should +include the minimal set of attributes that is sufficient for uniquely identifying +that entity. No superfluous attributes should be included in the Id set. For example +a Process on a host can be uniquely identified by `process.id` attribute. Adding for +example `process.executable.name` attribute to the Id is unnecessary and violates the +Minimally Sufficient Id rule. + +### Examples of Entities + +_This section is non-normative and is present only for the purposes of demonstrating +the data model._ + +Here are examples of entities, the typical identifying attributes they +have and some examples of non-identifying attributes that may be +associated with the Entity. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Entity + Entity Type + Identifying Attributes + Non-identifying Attributes +
Service + "service" + service.name (required) +

+service.instance.id +

+service.namespace +

service.version +
Host + "host" + host.id + host.name +

+host.type +

+host.image.id +

+host.image.name +

K8s Pod + "k8s.pod" + k8s.pod.uid (required) +

+k8s.cluster.name +

Any pod labels +
K8s Pod Container + "container" + k8s.pod.uid (required) +

+k8s.cluster.name +

+container.name +

Any container labels +
+ +See more examples showing nuances of ID field composition in the +[Entity Identification](#entity-identification) section. + +## Entity Events + +Information about Entities can be produced and communicated using 2 +types of Entity events: EntityState and EntityDelete. + +### EntityState Event + +The EntityState event stores information about the _state_ of the Entity +at a particular moment of time. The data model of the EntityState event +is the same as the Entity Data model with some extra fields: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Field + Type + Description +
Timestamp + nanoseconds + The time since when the entity state is described by this event. +The time is measured by the origin clock. The field is required. +
Interval + milliseconds + Defines the reporting period, i.e. how frequently the +information about this entity is reported via EntityState events even if +the entity does not change. The next expected EntityEvent for this +entity is expected at (Timestamp+Interval) time. Can be used by +receivers to infer that a no longer reported entity is gone, even if the +EntityDelete event was not observed. Optional, if missing the interval +is unknown. +
Type + + See data model. + +
Id + + See data model +
Attributes + + See data model +
+ +We say that an Entity mutates (changes) when one or more of its +descriptive attributes changes. A new descriptive attribute may be +added, an existing descriptive attribute may be deleted or a value of an +existing descriptive attribute may be changed. All these changes +represent valid mutations of an Entity over time. When these mutations +happen the identity of the Entity does not change. + +When the entity's state is changed it is expected that the source will +emit a new EntityState event with a fresh timestamp and full list of +values of all other fields. + +Entity event producers are recommended to periodically emit events even +if the Entity does not change. In this case the Type, Id and Attribute +fields will remain the same, but a fresh Timestamp will be recorded in +the event. Producing such events allows the system to be resilient to +event losses. Even if some events are lost eventually the correct state +of the Entity is more likely to be delivered to the final destination. +Periodic sending of EntityState events also serves as a liveliness +indicator (see below how it can be used in lieu of EntityDelete event). + +### EntityDelete Event + +EntityDelete event indicates that a particular entity is gone: + + + + + + + + + + + + + + + + + + + + + + +
Field + Type + Description +
Timestamp + nanoseconds + The time when the entity was deleted. The time is measured by +the origin clock. The field is required. +
Type + + See data model +
Id + + See data model +
+ +Note that EntityDelete is optional and is not guaranteed to be sent when +the entity is gone. Recipients of entity signals should be prepared to +handle this situation by expiring entities that are no longer seeing +EntityState events reported (i.e. treat the presence of EntityState +events as a liveliness indicator). + +The expiration mechanism is based on the previously reported `Interval` field of +EntityState event. The recipient can use this value to compute when to expect the next +EntityState event and if the event does not arrive in a timely manner (plus some slack) +it can consider the entity to be gone even if the EntityDelete event was not observed. + +## Entity Identification + +_This section is a supplementary guideline and is not part of logical data model._ + +The data model defines the structure of the entity ID field. This section explains +how the ID field is computed. + +### LID, GID and IDCONTEXT + +All entities have a local ID (LID) and a global ID (GID). + +The LID is unique in a particular identification context, but is not necessarily globally +unique. For example a process entity's LID is its PID number and process start time. +The (PID,StartTime) pair is unique only in the context of a host where the process runs +(and the host in this case is the identification context). + +The GID of an entity is globally unique, in the sense that for the entire set of entities +in a particular telemetry store no 2 entities exist that have the same GID value. + +The GID of an entity E is defined as: + +`GID(E) = { LID(E), GID(IDCONTEXT(E)) }` + +Where `IDCONTEXT(E)` is the identification context in which the LID of entity E is unique. +The value of `IDCONTEXT(E)` is an Entity itself, and thus we can compute the GID value of it too. + +In other words, the GID of an entity is a union of its LID and the GID of its +identification context. Note: GID(E) is a flat set of key/value attributes. + +The enrichment process often is responsible for determining the value of `IDCONTEXT(E)` +and for computing the GID according to the formula defined above, although the GID may +also be produced at once by the telemetry source (e.g. by Otel SDK) without requiring +any additional enrichment. + +### Semantic Conventions + +OpenTelemetry semantic conventions will be enhanced to include entity definitions for +well-known entities such as Service, Process, Host, etc. + +For well-known entity types LID(E) is defined in Otel semantic conventions per +entity type. The value of LID is a flat set of key/value attributes. For example, +for entity of type "process" the semantic conventions define LID as 2 attributes: + +```json5 +{ + "process.pid": $pid, + "process.start_time": $starttime +} +``` + +For custom entity types (not defined in Otel semantic conventions) the end-user is +responsible for defining their custom semantic conventions in a similar way. + +The entity information producer is responsible for determining the identification +context of each entity it is producing information about. + +In certain cases, where only one possible IDCONTEXT definition is meaningful, the +IDCONTEXT MAY be defined in the semantic conventions. For example Kubernetes nodes +always exist in the identifying context of a Kubernetes cluster. The semantic convention +for "k8s.node" and "k8s.cluster" can prescribe that the IDCONTEXT of entity of type +"k8s.node" is always an entity of type "k8s.cluster". + +Important: semantic conventions can NEVER prescribe the complete GID composition. +Semantic conventions SHOULD prescribe LID and MAY prescribe IDCONTEXT, but GID +composition, generally speaking, cannot be known statically. + +For example: a host's LID should be a `host.id` attribute. A host running on a cloud +should have an IDCONTEXT of "cloud.account" and the LID of "cloud.account" entity +is (`cloud.provider`, `cloud.account.id`). However semantic conventions cannot prescribe +that the GID of a host is (`host.id`, `cloud.provider`, `cloud.account.id`) because not all +hosts run on cloud. A host that runs on prem in a single data center may have a GID +of just (`host.id`) or if a customer has multiple on prem data centers they may use +data.center.id as its identifier and use (`host.id`, `data.center.id`) as GID of the host. + +### Examples + +#### Process in a Host + +A locally running host agent (an Otel Collector) that produces +information about "process" entities has the knowledge that the +processes run in the particular host and thus the "host" is the +identification context for the processes that the agent observes. The +LID of a process can look like this: + +```json5 +{ + "process.pid": 12345, + "process.start_time": 1714491491 +} +``` + +and Collector will use "host" as the IDCONTEXT and add host's LID to it: + +```json5 +{ + // Process LID, unique per host. + "process.pid": 12345, + "process.start_time": 1714491491, + + + // Host LID + "host.id": "fdbf79e8af94cb7f9e8df36789187052" +} +``` + +If we assume that we have only one data center and host ids are globally +unique then the above id is globally unique and is the GID of the +process. If this assumption is not valid in our situation we would +continue applying additional IDCONTEXT's until the GID is globally +unique. See for example the +[Host in Cloud Account](#host-in-cloud-account) example below. + +#### Process in Kubernetes + +A Kubernetes Collector that produces information about process entities +has the knowledge that the processes run in the particular containers in +the particular pod and thus the container is the identification context +for the process, and the pod is the identification context for the +container. If we begin with the same process LID: + +```json5 +{ + "process.pid": 12345, + "process.start_time": 1714491491 +} +``` + +the Kubernetes Collector will then add the IDCONTEXT of container and of +pod to this, resulting in: + +```json5 +{ + // Process LID, unique per container. + "process.pid": 12345, + "process.start_time": 1714491491, + + // Container LID, unique per pod. + "k8s.container.name": "redis", + + + // Pod LID has 2 attributes. + "k8s.pod.uid": "0c4cbbf8-d4b4-4e84-bc8b-b95f0d537fc7", + "k8s.cluster.name": "dev" +} +``` + +Note that we used 3 different LIDs above to compose the GID. The +attributes that are part of each LID are defined in Otel semantic +conventions. + +In this example we assume this to be a valid GID because Pod is the root +IDCONTEXT, since Pod's LID includes the cluster name, which is expected +to be globally unique. If this assumption about global uniqueness of +cluster names is wrong then another containing IDCONTEXT within which +cluster names are unique will need to be applied and so forth. + +Note also how we used a pair (`k8s.pod.uid`, `k8s.cluster.name`). +Alternatively, we could say that Kubernetes Cluster is a separate entity +we care about. This would mean the Pod's IDCONEXT is the cluster. The +net result for process's GID would be exactly the same, but we would +arrive to it in a different way: + +```json5 +{ + // Process LID, unique per container. + "process.pid": 12345, + "process.start_time": 1714491491, + + // Container LID, unique per pod. + "k8s.container.name": "redis", + + + // Pod LID, unique per cluster. + "k8s.pod.uid": "0c4cbbf8-d4b4-4e84-bc8b-b95f0d537fc7", + + // Cluster LID, also globally unique since cluster is root entity. + "k8s.cluster.name": "dev" +} +``` + +#### Host in Cloud Account + +A host running in a cloud account (e.g. AWS) will have a LID that uses +the host instance id, unique within a single cloud account, e.g.: + +```json5 +{ + // Host LID, unique per cloud account. + "host.id": "fdbf79e8af94cb7f9e8df36789187052" +} +``` + +Otel Collector with +[resourcedetection](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/resourcedetectionprocessor) +processor with "aws" detector enabled will add the IDCONTEXT of the +cloud account like this: + +```json5 +{ + // Host LID, unique per cloud account. + "host.id": "fdbf79e8af94cb7f9e8df36789187052" + + // Cloud account LID has 2 attributes: + "cloud.provider": "aws", + "cloud.account.id": "1234567890" +} +``` + +## Prototypes + +A set of prototypes that demonstrate this data model has been implemented: + +- [Go SDK Prototype](https://github.com/tigrannajaryan/opentelemetry-go/pull/244) +- [Collector Prototype](https://github.com/tigrannajaryan/opentelemetry-collector/pull/4) +- [Collector Contrib Prototype](https://github.com/tigrannajaryan/opentelemetry-collector-contrib/pull/1/files) +- [OTLP Protocol Buffer changes](https://github.com/tigrannajaryan/opentelemetry-proto/pull/2/files) + +## Prior Art + +An experimental entity data model was implemented in OpenTelemetry Collector as described +in [this document](https://docs.google.com/document/d/1Tg18sIck3Nakxtd3TFFcIjrmRO_0GLMdHXylVqBQmJA/edit). +The Collector's design uses LogRecord as the carrier of Entity events, with logical structure +virtually identical to what this OTEP proposes. + +There is also an implementation of this design in the Collector, see +[completed issue to add entity events](https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/23565) +and [the PR](https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/24419) +that implements entity event emitting for k8scluster receiver in the Collector. + +## Alternatives + +### Different ID Structure + +Alternative proposals were made [here](https://docs.google.com/document/d/1PLPSAnWvFWCsm6meAj6OIVDBvxsk983V51WugF0NgVo/edit) and +[here](https://docs.google.com/document/d/1bLmkQSv35Fi6Wbe4bAqQ-_JS7XWIXWbvuirVmAkz4a4/edit) +to use a different structure for entity ID field. + +We rejected these proposals in favour of the ID field proposed in this OTEP for the +following reasons: + +- The flat set of key/value attributes is widely used elsewhere in OpenTelemetry as + Resource attributes, as Scope attributes, as Metric datapoint attributes, etc. so + it is conceptually consistent with the rest of Otel. + +- We already have a lot of machinery that works well with this definition of attributes, + for example OTTL language has syntax for working with attributes, or Collector's pdata + API or Attribute value types in SDKs. All this code will no longer work as is if we + have a different data structure and needs to be re-implemented in a different way. + +### No Entity Events + +Entity signal allows recording the state of the entities. As the entity's state changes +events are emitted that describe the new state. In this proposal the entity's state is +(type,id,attributes) tuple, but we envision that in the future we may also want to add +more information to the Entity signal, particularly to record the relationships between +entities (i.e the fact that a Process runs on a Host). + +### Merge Entity Events data into Resource + +If we eliminate the Entity signal as a concept and put the entire entity's state into +the Resource then every time the entity's state changes we must emit one of +ResourceLogs/ResourceSpans/ResourceMetrics messages that includes the Resource that +represents the entity's state. + +However, what do we do that if there are no logs or spans or metrics data points to +report? Do we emit a ResourceLogs/ResourceSpans/ResourceMetrics OTLP message with empty logs +or spans or metrics data points? Which one do we emit: ResourceLogs, ResourceSpans +or ResourceMetrics? + +What do we do when we want to add support for recording entity relationships in the +future? Do we add all that information to the Resource and bloat the Resource size? + +How do we report the EntityDelete event? + +All these questions don't have good answers. Attempting to shoehorn the entity +information into the Resource where it does not naturally fit is likely to result +in ugly and inefficient solutions. + +### Hierarchical ID Field + +We had an alternate proposal to retain the information about how the ID was +[composed from LID and IDCONTEXT](#entity-identification), essentially to record the +hierarchy of identification contexts in the ID data structure instead of flattening it +and losing the information about the composition process that resulted in the particular ID. + +There are a couple of reasons: + +- The flat ID structure is simpler. + +- There are no known use cases that require a hierarchical ID structure. The use case + of "record parental relationship between entities" will be handled explicitly via + separate relationship data structures (see [Future Work](#future-work)). + +## Open questions + +### Attribute Data Type + +The data model requires the Attributes field to use the extended +[any](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/data-model.md#type-any) +attribute values, that allows more complex data types. This is different from the data +type used by the Id field, which is more restricted in the shape. + +Are we happy with this discrepancy? + +### Classes of Entity Types + +Do we need to differentiate between infrastructure entities (e.g. Pod, Host, Process) +and non-infrastructure entities (logical entities) such as Service? Is this distinction +important? + +### Multiple Observers + +The same entity may be observed by different observers simultaneously. For example the +information about a Host may be reported by the agent that runs on the host. At the same +time more information about that same host may be obtained via cloud provider's API. + +The information obtained by different observers can be complementary, they don't +necessarily have access exactly to the same data. It can be very useful to combine this +information in the backend and make it all available to the user. + +However, it is not possible for multiple observers to simultaneously use EntityState +events as they are defined earlier in this document, since the information in the event +will overwrite information in the previously received event about that same entity. + +A possible way to allow multiple observers to report portions of information about the +same entity simultaneously is to indicate the observer in the EntityState event by adding +an "ObserverId" field. EntityState event will then look like this: + +|Field|Type| +|---|---| +|Timestamp|nanoseconds| +|Interval|milliseconds| +|Type|| +|Id|| +|Attributes|| +|ObserverId|string or bytes| + +ObserverId field can be optional. Attributes from EntityState events that contain +different ObserverId values will be merged in the backend. Attributes from EntityState +events that contain the same ObserverId value will overwrite attributes from the previous +reporting of the EntityState event from that observer. + +### Is Type part of Entity's identity? + +Is the Type field part of the entity's identity together with the Id field? + +For example let's assume we have a Host and an Otel Collector running on the Host. +The Host's Id will contain one attribute: `host.id`, and the Type of the entity will be +"host". The Collector technically speaking can be also identified by one attribute +`host.id` and the Type of the entity will be "otel.collector". This only works if we +consider the Type field to be part of the entity's identity. + +If the Type field is not part of identity then in the above example we require that the +entity that describes the Collector has some other attribute in its Id (for example +`agent.type` attribute [if it gets accepted](https://github.com/open-telemetry/semantic-conventions/pull/950)). + +## Future Work + +This OTEP is step 1 of defining the Entities data model. It will be followed by other +OTEPs that cover the following topics: + +- How the existing Resource concept will be modified to link existing + signals to entities. + +- How relationships between entities are modeled. + +- Representation of entity data over the wire and the transmission + protocol for entities. + +- Add transformations that describe entity semantic convention changes in + OpenTelemetry Schema Files. + +We will possibly also submit additional OTEPs that address the Open Questions. + +## References + +- [OpenTelemetry Proposal: Resources and Entities](https://docs.google.com/document/d/1VUdBRInLEhO_0ABAoiLEssB1CQO_IcD5zDnaMEha42w/edit) +- [OpenTelemetry Entity Data Model](https://docs.google.com/document/d/1FdhTOvB1xhx7Ks7dFW6Ht1Vfw2myU6vyKtEfr_pqZPw/edit) +- [OpenTelemetry Entity Identification](https://docs.google.com/document/d/1hJIAIMsRCgZs-poRsw3lnirP14d3sMfn1LB08C4LCDw/edit) +- [OpenTelemetry Resources - Principles and Characteristics](https://docs.google.com/document/d/1Xd1JP7eNhRpdz1RIBLeA1_4UYPRJaouloAYqldCeNSc/edit)