Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added documentation for select entries, map-to-list, and trucate processors. #6660

Merged
merged 18 commits into from
Mar 27, 2024
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
eba0f77
Added documentation for select entries and trucate processors. Update…
kkondaka Mar 13, 2024
4a1c1cf
Update select-entries.md
Naarcha-AWS Mar 25, 2024
7038330
Update and rename truncate-processor.md to truncate.md
Naarcha-AWS Mar 25, 2024
53b4bce
Apply suggestions from code review
Naarcha-AWS Mar 25, 2024
d037519
Delete _data-prepper/pipelines/configuration/processors/map-to-list.md
Naarcha-AWS Mar 25, 2024
30443c8
Update _data-prepper/pipelines/configuration/processors/mutate-event.md
Naarcha-AWS Mar 25, 2024
8bb36c9
Merge branch 'main' into 27-update
Naarcha-AWS Mar 25, 2024
230ce17
Apply suggestions from code review
Naarcha-AWS Mar 25, 2024
dfcbecf
Apply suggestions from code review
Naarcha-AWS Mar 26, 2024
c394dfa
Apply suggestions from code review
Naarcha-AWS Mar 27, 2024
a8ac14b
Update _data-prepper/pipelines/configuration/buffers/kafka.md
Naarcha-AWS Mar 27, 2024
79f740f
Update _data-prepper/pipelines/configuration/buffers/kafka.md
Naarcha-AWS Mar 27, 2024
0f061af
Update _data-prepper/pipelines/configuration/buffers/kafka.md
Naarcha-AWS Mar 27, 2024
9c68116
Update _data-prepper/pipelines/configuration/processors/mutate-event.md
Naarcha-AWS Mar 27, 2024
3e858ad
Update _data-prepper/pipelines/configuration/sources/s3.md
Naarcha-AWS Mar 27, 2024
e186870
Apply suggestions from code review
Naarcha-AWS Mar 27, 2024
f3a5b17
Update _data-prepper/pipelines/configuration/processors/truncate.md
Naarcha-AWS Mar 27, 2024
bc86d25
Merge branch 'main' into 27-update
Naarcha-AWS Mar 27, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 12 additions & 3 deletions _data-prepper/pipelines/configuration/buffers/kafka.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,11 +41,12 @@

Option | Required | Type | Description
--- | --- | --- | ---
`bootstrap_servers` | Yes | String list | The host and port for the initial connection to the Kafka cluster. You can configure multiple Kafka brokers by using the IP address or the port number for each broker. When using [Amazon Managed Streaming for Apache Kafka (Amazon MSK)](https://aws.amazon.com/msk/) as your Kafka cluster, the bootstrap server information is obtained from Amazon MSK using the Amazon Resource Name (ARN) provided in the configuration.
`topics` | Yes | List | A list of [topics](#topic) to use. You must supply one topic per buffer.
`authentication` | No | [Authentication](#authentication) | Sets the authentication options for both the pipeline and Kafka. For more information, see [Authentication](#authentication).
`encryption` | No | [Encryption](#encryption) | The encryption configuration for encryption in transit. For more information, see [Encryption](#encryption).
`aws` | No | [AWS](#aws) | The AWS configuration. For more information, see [aws](#aws).
`bootstrap_servers` | Yes | String list | The host and port for the initial connection to the Kafka cluster. You can configure multiple Kafka brokers by using the IP address or the port number for each broker. When using [Amazon Managed Streaming for Apache Kafka (Amazon MSK)](https://aws.amazon.com/msk/) as your Kafka cluster, the bootstrap server information is obtained from Amazon MSK using the Amazon Resource Name (ARN) provided in the configuration.

Check failure on line 46 in _data-prepper/pipelines/configuration/buffers/kafka.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: aws. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: aws. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_data-prepper/pipelines/configuration/buffers/kafka.md", "range": {"start": {"line": 46, "column": 287}}}, "severity": "ERROR"}
`encryption` | No | [Encryption](#encryption) | The encryption configuration for encryptions in transit. For more information, see [Encryption](#encryption).
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
`producer_properties` | No | [Producer Properties](#producer_properties) | A list configurable Kafka producer properties.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
`topics` | Yes | List | A list of [topics](#topic) for the buffer to use. You must supply one topic per buffer.


### topic
Expand Down Expand Up @@ -73,6 +74,7 @@
`retry_backoff` | No | Integer | The amount of time to wait before attempting to retry a failed request to a given topic partition. Default is `10s`.
`max_poll_interval` | No | Integer | The maximum delay between invocations of a `poll()` when using group management through Kafka's `max.poll.interval.ms` option. Default is `300s`.
`consumer_max_poll_records` | No | Integer | The maximum number of records returned in a single `poll()` call through Kafka's `max.poll.records` setting. Default is `500`.
`max_message_bytes` | No | Integer | The maximum size of the message in bytes. Default is 1 MB.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved


### kms
Expand Down Expand Up @@ -123,6 +125,13 @@
`type` | No | String | The encryption type. Use `none` to disable encryption. Default is `ssl`.
`insecure` | No | Boolean | A Boolean flag used to turn off SSL certificate verification. If set to `true`, certificate authority (CA) certificate verification is turned off and insecure HTTP requests are sent. Default is `false`.

#### producer_properties

Use the following configuration options to configure a Kafka producer.
Option | Required | Type | Description
:--- | :--- | :--- | :---
`max_request_size` | No | Integer | The maximum size of the request that producer sends to Kafka. Default is 1 MB.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved


#### aws

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,13 @@ nav_order: 65
Mutate event processors allow you to modify events in Data Prepper. The following processors are available:

* [add_entries]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/add-entries/) allows you to add entries to an event.
* [convert_entry_type]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/convert_entry_type/) allows you to convert value types in an event.
* [copy_values]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/copy-values/) allows you to copy values within an event.
* [delete_entries]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/delete-entries/) allows you to delete entries from an event.
* [rename_keys]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/rename-keys/) allows you to rename keys in an event.
* [convert_entry_type]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/convert_entry_type/) allows you to convert value types in an event.
* [list_to_map]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/list-to-map) allows you to convert list of objects from an event where each object contains a `key` field into a map of target keys.
* [map_to_list]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/map-to-list/) allows you to convert a map of objects from an event where each object contains a `key` field into a list of target keys.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
* [rename_keys]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/rename-keys/) allows you to rename keys in an event.
* [select_entries]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/select-entries/) allows you to select entries from an event.



Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
---
layout: default
title: select_entries
parent: Processors
grand_parent: Pipelines
nav_order: 59
---

# select_entries

The `select_entries` processor selects entries from a Data Prepper event. Only the selected entries will remain in the event and all other entries will be removed from the event.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

## Configuration

You can configure the `select_entries` processor using the following options.

| Option | Required | Description |
| :--- | :--- | :--- |
| `include_keys` | Yes | A list of keys to be selected from an event. |
| `select_when` | No | A [conditional expression](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/), such as `/some-key == "test"'`, that will be evaluated to determine whether the processor will be run on the event. |

### Usage

The following example shows how to configure the `select_entries` processor inside of the `pipeline.yaml` file:
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

```yaml
pipeline:
source:
...
....
processor:
- select_entries:
entries:
- include_keys: [ "key1", "key2" ]
add_when: '/some_key == "test"'
sink:
```
{% include copy.html %}


For example, when your source contains the following event record:

```json
{"message": "hello", "key1" : "value1", "key2" : "value2", "some_key" : "test"}
```

The `select_entries` processor include only `key1` and `key2` in the processed output:
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

```json
{"key1": "value1", "key2": "value2"}
```
107 changes: 107 additions & 0 deletions _data-prepper/pipelines/configuration/processors/truncate.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
---
layout: default
title: truncate
parent: Processors
grand_parent: Pipelines
nav_order: 121
---

# truncate

Check failure on line 9 in _data-prepper/pipelines/configuration/processors/truncate.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.HeadingCapitalization] 'truncate' is a heading and should be in sentence case. Raw Output: {"message": "[OpenSearch.HeadingCapitalization] 'truncate' is a heading and should be in sentence case.", "location": {"path": "_data-prepper/pipelines/configuration/processors/truncate.md", "range": {"start": {"line": 9, "column": 3}}}, "severity": "ERROR"}

The `truncate` processor truncates a key's value at the beginning, the end or, on both sides of the value string based on processor's configuration. If the key's value is a list, then each member in the string list is truncated. Non-string members of the list are left untouched. When the `truncate_when` option is provided, the truncation of the input is done only when the condition specified is true for the event being processed.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the last sentence, should "true" be in code font?


## Configuration

You can configure the `truncate` processor using the following options.

Option | Required | Type | Description
:--- | :--- | :--- | :---
`entries` | Yes | String list | A list of entries to add to an event.
`source_keys` | No | String list | The list of sources keys that will be modified by the processor. The default value is an empty list which indicates truncating all values.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
`truncate_when` | No | Conditional expression | A condition that, when met, determines when the truncate operation is performed.
`start_at` | No | Integer | Where inside the string value to start the truncation process. Default is `0`, which means the truncation of each key's value starts at the beginning.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
`length` | No | Integer| The length of the string after truncation. When not specified, the processor will measure the length based on where the string ends.

For the `truncate` processor to run either the `start_at` or `length` options must be present in the configuration. For greater customization, you can define both values inside the configuration.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"For greater customization" is a bit vague (Customization of what? Why?). Can this sentence be rephrased as "You can define both values in the configuration in order to [what specifically the user can do by defining both values]."


## Usage

The following examples show how to configure the `truncate` processor inside the `pipeline.yaml` file:
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

## Example: Minimum configuration

The following example shows the minimum configuration of the `truncate` processor:
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

```yaml
pipeline:
source:
file:
path: "/full/path/to/logs_json.log"
record_type: "event"
format: "json"
processor:
- truncate:
entries:
- source_keys: ["message1", "message2"]
length: 5
- source_keys: ["info"]
length: 6
start_at: 4
- source_keys: ["log"]
start_at: 5
sink:
- stdout:
```

For example, the following event contains several keys with string values:

```json
{"message1": "hello,world", "message2": "test message", "info", "new information", "log": "test log message"}
```

Then, the `truncate` processor produces the following output in which:
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Then" doesn't really work here because we haven't referenced any action, only that the event contains keys with string values. I'd also prefer that we don't introduce the list with an incomplete sentence ending with "in which".


- The `message1` and `message 2` keys have the `start_at` setting set to `0`, indicating the truncation will being the start of the string, with the string itself truncated to a length of `5`.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
- The `info` key has the `start_at` setting set to `4`, indicating the truncation will begin at the `i` letter of the string, with the string truncated to a length of `6`.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
- The `log` key has a `start_at` setting set to `5` with no length indicated, meaning truncation will being at letter `l` of the string.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

```json
{"message1":"hello", "message2":"test ", "info":"inform", "log": "log message"}
natebower marked this conversation as resolved.
Show resolved Hide resolved
```


## Example: Using `truncate_when`

The following example configuration shows the `truncate` processor with the `truncate_when` option configured:

```yaml
pipeline:
source:
file:
path: "/full/path/to/logs_json.log"
record_type: "event"
format: "json"
processor:
- truncate:
entries:
- source_keys: ["message"]
length: 5
start_at: 8
truncate_when: '/id == 1'
sink:
- stdout:
```

The following example contains two events:

```json
{"message": "hello, world", "id": 1}
{"message": "hello, world,not-truncated", "id": 2}
```

When the `truncate` processors runs on the events, only the first event is truncated because the `id` key contains a value of `1`:
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

```json
{"message": "world", "id": 1}
{"message": "hello, world,not-truncated", "id": 2}
```
1 change: 1 addition & 0 deletions _data-prepper/pipelines/configuration/sinks/file.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ The following table describes options you can configure for the `file` sink.
Option | Required | Type | Description
:--- | :--- | :--- | :---
path | Yes | String | Path for the output file (e.g. `logs/my-transformed-log.log`).
append | No | Boolean | When `true`, the sink file is opened in append mode.

## Usage

Expand Down
1 change: 1 addition & 0 deletions _data-prepper/pipelines/configuration/sources/s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,7 @@ buffer_timeout | No | Duration | The amount of time allowed for writing events
`s3_select` | No | [s3_select](#s3_select) | The Amazon S3 Select configuration.
`scan` | No | [scan](#scan) | The S3 scan configuration.
`delete_s3_objects_on_read` | No | Boolean | When `true`, the S3 scan attempts to delete S3 objects after all events from the S3 object are successfully acknowledged by all sinks. `acknowledgments` should be enabled when deleting S3 objects. Default is `false`.
`workers` | No | Integer | The number of worker threads. Default is `1`, with a min of `1` and a max of `1000`. Each worker thread created subscribes to Amazon SQS messages. When a worker receives a message from SQS, that worker processes the message independently from the other workers.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved


## sqs
Expand Down
Loading