Skip to content

Commit

Permalink
Added documentation for select entries, map-to-list, and trucate proc…
Browse files Browse the repository at this point in the history
…essors. (#6660) (#6801)

* Added documentation for select entries and trucate processors. Updated other documents

Signed-off-by: Kondaka <krishkdk@amazon.com>

* Update select-entries.md

Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Update and rename truncate-processor.md to truncate.md

Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Delete _data-prepper/pipelines/configuration/processors/map-to-list.md

Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Update _data-prepper/pipelines/configuration/processors/mutate-event.md

Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Update _data-prepper/pipelines/configuration/buffers/kafka.md

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Update _data-prepper/pipelines/configuration/buffers/kafka.md

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Update _data-prepper/pipelines/configuration/buffers/kafka.md

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Update _data-prepper/pipelines/configuration/processors/mutate-event.md

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Update _data-prepper/pipelines/configuration/sources/s3.md

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Update _data-prepper/pipelines/configuration/processors/truncate.md

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

---------

Signed-off-by: Kondaka <krishkdk@amazon.com>
Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>
Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>
Co-authored-by: Nathan Bower <nbower@amazon.com>
(cherry picked from commit e697522)

Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>
Co-authored-by: Krishna Kondaka <41027584+kkondaka@users.noreply.github.com>
  • Loading branch information
Naarcha-AWS and kkondaka committed Mar 28, 2024
1 parent 1dc6b3a commit f72c14c
Show file tree
Hide file tree
Showing 6 changed files with 176 additions and 9 deletions.
15 changes: 12 additions & 3 deletions _data-prepper/pipelines/configuration/buffers/kafka.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,11 +41,12 @@ Use the following configuration options with the `kafka` buffer.

Option | Required | Type | Description
--- | --- | --- | ---
`bootstrap_servers` | Yes | String list | The host and port for the initial connection to the Kafka cluster. You can configure multiple Kafka brokers by using the IP address or the port number for each broker. When using [Amazon Managed Streaming for Apache Kafka (Amazon MSK)](https://aws.amazon.com/msk/) as your Kafka cluster, the bootstrap server information is obtained from Amazon MSK using the Amazon Resource Name (ARN) provided in the configuration.
`topics` | Yes | List | A list of [topics](#topic) to use. You must supply one topic per buffer.
`authentication` | No | [Authentication](#authentication) | Sets the authentication options for both the pipeline and Kafka. For more information, see [Authentication](#authentication).
`encryption` | No | [Encryption](#encryption) | The encryption configuration for encryption in transit. For more information, see [Encryption](#encryption).
`aws` | No | [AWS](#aws) | The AWS configuration. For more information, see [aws](#aws).
`bootstrap_servers` | Yes | String list | The host and port for the initial connection to the Kafka cluster. You can configure multiple Kafka brokers by using the IP address or the port number for each broker. When using [Amazon Managed Streaming for Apache Kafka (Amazon MSK)](https://aws.amazon.com/msk/) as your Kafka cluster, the bootstrap server information is obtained from Amazon MSK using the Amazon Resource Name (ARN) provided in the configuration.
`encryption` | No | [Encryption](#encryption) | The encryption configuration for encryption in transit. For more information, see [Encryption](#encryption).
`producer_properties` | No | [Producer Properties](#producer_properties) | A list of configurable Kafka producer properties.
`topics` | Yes | List | A list of [topics](#topic) for the buffer to use. You must supply one topic per buffer.


### topic
Expand Down Expand Up @@ -73,6 +74,7 @@ Option | Required | Type | Description
`retry_backoff` | No | Integer | The amount of time to wait before attempting to retry a failed request to a given topic partition. Default is `10s`.
`max_poll_interval` | No | Integer | The maximum delay between invocations of a `poll()` when using group management through Kafka's `max.poll.interval.ms` option. Default is `300s`.
`consumer_max_poll_records` | No | Integer | The maximum number of records returned in a single `poll()` call through Kafka's `max.poll.records` setting. Default is `500`.
`max_message_bytes` | No | Integer | The maximum size of the message, in bytes. Default is 1 MB.


### kms
Expand Down Expand Up @@ -123,6 +125,13 @@ Option | Required | Type | Description
`type` | No | String | The encryption type. Use `none` to disable encryption. Default is `ssl`.
`insecure` | No | Boolean | A Boolean flag used to turn off SSL certificate verification. If set to `true`, certificate authority (CA) certificate verification is turned off and insecure HTTP requests are sent. Default is `false`.

#### producer_properties

Use the following configuration options to configure a Kafka producer.
Option | Required | Type | Description
:--- | :--- | :--- | :---
`max_request_size` | No | Integer | The maximum size of the request that the producer sends to Kafka. Default is 1 MB.


#### aws

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,13 @@ nav_order: 65
Mutate event processors allow you to modify events in Data Prepper. The following processors are available:

* [add_entries]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/add-entries/) allows you to add entries to an event.
* [convert_entry_type]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/convert_entry_type/) allows you to convert value types in an event.
* [copy_values]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/copy-values/) allows you to copy values within an event.
* [delete_entries]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/delete-entries/) allows you to delete entries from an event.
* [rename_keys]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/rename-keys/) allows you to rename keys in an event.
* [convert_entry_type]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/convert_entry_type/) allows you to convert value types in an event.
* [list_to_map]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/list-to-map) allows you to convert list of objects from an event where each object contains a `key` field into a map of target keys.
* [map_to_list]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/map-to-list/) allows you to convert a map of objects from an event, where each object contains a `key` field, into a list of target keys.
* [rename_keys]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/rename-keys/) allows you to rename keys in an event.
* [select_entries]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/select-entries/) allows you to select entries from an event.



Expand Down
51 changes: 51 additions & 0 deletions _data-prepper/pipelines/configuration/processors/select-entries.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
---
layout: default
title: select_entries
parent: Processors
grand_parent: Pipelines
nav_order: 59
---

# select_entries

The `select_entries` processor selects entries from a Data Prepper event. Only the selected entries will remain in the event, and all other entries will be removed from the event.

## Configuration

You can configure the `select_entries` processor using the following options.

| Option | Required | Description |
| :--- | :--- | :--- |
| `include_keys` | Yes | A list of keys to be selected from an event. |
| `select_when` | No | A [conditional expression](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/), such as `/some-key == "test"'`, that will be evaluated to determine whether the processor will be run on the event. |

### Usage

The following example shows how to configure the `select_entries` processor in the `pipeline.yaml` file:

```yaml
pipeline:
source:
...
....
processor:
- select_entries:
entries:
- include_keys: [ "key1", "key2" ]
add_when: '/some_key == "test"'
sink:
```
{% include copy.html %}
For example, when your source contains the following event record:
```json
{"message": "hello", "key1" : "value1", "key2" : "value2", "some_key" : "test"}
```

The `select_entries` processor includes only `key1` and `key2` in the processed output:

```json
{"key1": "value1", "key2": "value2"}
```
107 changes: 107 additions & 0 deletions _data-prepper/pipelines/configuration/processors/truncate.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
---
layout: default
title: truncate
parent: Processors
grand_parent: Pipelines
nav_order: 121
---

# truncate

The `truncate` processor truncates a key's value at the beginning, the end, or on both sides of the value string, based on the processor's configuration. If the key's value is a list, then each member in the string list is truncated. Non-string members of the list are not truncated. When the `truncate_when` option is provided, input is truncated only when the condition specified is `true` for the event being processed.

## Configuration

You can configure the `truncate` processor using the following options.

Option | Required | Type | Description
:--- | :--- | :--- | :---
`entries` | Yes | String list | A list of entries to add to an event.
`source_keys` | No | String list | The list of source keys that will be modified by the processor. The default value is an empty list, which indicates that all values will be truncated.
`truncate_when` | No | Conditional expression | A condition that, when met, determines when the truncate operation is performed.
`start_at` | No | Integer | Where in the string value to start truncation. Default is `0`, which specifies to start truncation at the beginning of each key's value.
`length` | No | Integer| The length of the string after truncation. When not specified, the processor will measure the length based on where the string ends.

Either the `start_at` or `length` options must be present in the configuration in order for the `truncate` processor to run. You can define both values in the configuration in order to further customize where truncation occurs in the string.

## Usage

The following examples show how to configure the `truncate` processor in the `pipeline.yaml` file:

## Example: Minimum configuration

The following example shows the minimum configuration for the `truncate` processor:

```yaml
pipeline:
source:
file:
path: "/full/path/to/logs_json.log"
record_type: "event"
format: "json"
processor:
- truncate:
entries:
- source_keys: ["message1", "message2"]
length: 5
- source_keys: ["info"]
length: 6
start_at: 4
- source_keys: ["log"]
start_at: 5
sink:
- stdout:
```
For example, the following event contains several keys with string values:
```json
{"message1": "hello,world", "message2": "test message", "info", "new information", "log": "test log message"}
```

The `truncate` processor produces the following output, where:

- The `start_at` setting is `0` for the `message1` and `message 2` keys, indicating that truncation will begin at the start of the string, with the string itself truncated to a length of `5`.
- The `start_at` setting is `4` for the `info` key, indicating that truncation will begin at letter `i` of the string, with the string truncated to a length of `6`.
- The `start_at` setting is `5` for the `log` key, with no length specified, indicating that truncation will begin at letter `l` of the string.

```json
{"message1":"hello", "message2":"test ", "info":"inform", "log": "log message"}
```


## Example: Using `truncate_when`

The following example configuration shows the `truncate` processor with the `truncate_when` option configured:

```yaml
pipeline:
source:
file:
path: "/full/path/to/logs_json.log"
record_type: "event"
format: "json"
processor:
- truncate:
entries:
- source_keys: ["message"]
length: 5
start_at: 8
truncate_when: '/id == 1'
sink:
- stdout:
```
The following example contains two events:
```json
{"message": "hello, world", "id": 1}
{"message": "hello, world,not-truncated", "id": 2}
```

When the `truncate` processor runs on the events, only the first event is truncated because the `id` key contains a value of `1`:

```json
{"message": "world", "id": 1}
{"message": "hello, world,not-truncated", "id": 2}
```
1 change: 1 addition & 0 deletions _data-prepper/pipelines/configuration/sinks/file.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ The following table describes options you can configure for the `file` sink.
Option | Required | Type | Description
:--- | :--- | :--- | :---
path | Yes | String | Path for the output file (e.g. `logs/my-transformed-log.log`).
append | No | Boolean | When `true`, the sink file is opened in append mode.

## Usage

Expand Down
5 changes: 1 addition & 4 deletions _data-prepper/pipelines/configuration/sources/s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,11 +104,8 @@ Option | Required | Type | Description
`s3_select` | No | [s3_select](#s3_select) | The Amazon S3 Select configuration.
`scan` | No | [scan](#scan) | The S3 scan configuration.
`delete_s3_objects_on_read` | No | Boolean | When `true`, the S3 scan attempts to delete S3 objects after all events from the S3 object are successfully acknowledged by all sinks. `acknowledgments` should be enabled when deleting S3 objects. Default is `false`.
<<<<<<< HEAD
=======
`workers` | No | Integer | Configures the number of worker threads that the source uses to read data from S3. Leaving this value at the default unless your S3 objects are less than 1MB. Performance may decrease for larger S3 objects. This setting only affects SQS-based sources. Default is `1`.
`workers` | No | Integer | The number of worker threads. Default is `1`, with a min of `1` and a max of `1000`. Each worker thread subscribes to Amazon SQS messages. When a worker receives an SQS message, that worker processes the message independently from the other workers.

>>>>>>> 48651b0d (Data Prepper 2.7 documentation (#6763))


## sqs
Expand Down

0 comments on commit f72c14c

Please sign in to comment.