Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ingest-attachment support for per document indexed_chars limit #31352

Merged
merged 3 commits into from
Jun 16, 2018

Conversation

dadoonet
Copy link
Member

@dadoonet dadoonet commented Jun 15, 2018

Note: this is a backport of #28977 which does not require a review.
As I forgot to backport it, I just like to double check that CI will be happy.

I'll update the PR if CI is unhappy and then will ask for a review.


We today support a global indexed_chars processor parameter. But in some cases, users would like to set this limit depending on the document itself.
It used to be supported in mapper-attachments plugin by extracting the limit value from a meta field in the document sent to indexation process.

We add an option which reads this limit value from the document itself
by adding a setting named indexed_chars_field.

Which allows running:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information. Used to parse pdf and office files",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "indexed_chars_field" : "size"
      }
    }
  ]
}

Then index either:

PUT index/doc/1?pipeline=attachment
{
  "data": "BASE64"
}

Which will use the default value (or the one defined by indexed_chars)

Or

PUT index/doc/2?pipeline=attachment
{
  "data": "BASE64",
  "size": 1000
}

Backport of #28977 in 6.x branch (6.4.0)

We today support a global `indexed_chars` processor parameter. But in some cases, users would like to set this limit depending on the document itself.
It used to be supported in mapper-attachments plugin by extracting the limit value from a meta field in the document sent to indexation process.

We add an option which reads this limit value from the document itself
by adding a setting named `indexed_chars_field`.

Which allows running:

```
PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information. Used to parse pdf and office files",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "indexed_chars_field" : "size"
      }
    }
  ]
}
```

Then index either:

```
PUT index/doc/1?pipeline=attachment
{
  "data": "BASE64"
}
```

Which will use the default value (or the one defined by `indexed_chars`)

Or

```
PUT index/doc/2?pipeline=attachment
{
  "data": "BASE64",
  "size": 1000
}
```

Backport of elastic#28977 in 6.x branch (6.4.0)
@dadoonet dadoonet self-assigned this Jun 15, 2018
@colings86 colings86 added >enhancement :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP labels Jun 15, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@martijnvg
Copy link
Member

@dadoonet The CI build failed because if a checkstyle violation: [ERROR] /var/lib/jenkins/workspace/elastic+elasticsearch+pull-request/plugins/ingest-attachment/src/main/java/org/elasticsearch/ingest/attachment/AttachmentProcessor.java:32:8: Unused import - java.io.IOException. [UnusedImports]

@dadoonet dadoonet merged commit 2df907c into elastic:6.x Jun 16, 2018
@dadoonet dadoonet deleted the backport/28977-6x branch June 16, 2018 03:35
dnhatn added a commit that referenced this pull request Jun 19, 2018
* 6.x:
  Add get stored script and delete stored script to high level REST API
  Increasing skip version for failing test on 6.x
  Skip get_alias tests for 5.x (#31397)
  Fix defaults in GeoShapeFieldMapper output (#31302)
  Test: better error message on failure
  Mute DefaultShardsIT#testDefaultShards test
  Fix reference to XContentBuilder.string() (#31337)
  [DOCS] Adds monitoring breaking change (#31369)
  [DOCS] Adds security breaking change (#31375)
  [DOCS] Backports breaking change (#31373)
  RestAPI: Reject forcemerge requests with a body (#30792)
  Docs: Use the default distribution to test docs (#31251)
  Use system context for cluster state update tasks (#31241)
  [DOCS] Adds testing for security APIs (#31345)
  [DOCS] Removes ML item from release highlights
  [DOCS] Removes breaking change (#31376)
  REST high-level client: add validate query API (#31077)
  Move language analyzers from server to analysis-common module. (#31300)
  Expose lucene's RemoveDuplicatesTokenFilter (#31275)
  [Test] Fix :example-plugins:rest-handler on Windows
  Delete typos in SAML docs (#31199)
  Ensure we don't use a remote profile if cluster name matches (#31331)
  Test: Skip alias tests that failed all weekend
  [DOCS] Fix version in SQL JDBC Maven template
  [DOCS] Improve install and setup section for SQL JDBC
  Add ingest-attachment support for per document `indexed_chars` limit (#31352)
  SQL: Fix rest endpoint names in node stats (#31371)
  [DOCS] Fixes small issue in release notes
  Support for remote path in reindex api Closes #22913
  [ML] Put ML filter API response should contain the filter (#31362)
  Remove trial status info from start trial doc (#31365)
  [DOCS] Added links in breaking changes pages
  [DOCS] Adds links to release notes and highlights
  Docs: Document changes in rest client
  QA: Fix tribe tests to use node selector
  REST Client: NodeSelector for node attributes (#31296)
  LLClient: Fix assertion on windows
  LLClient: Support host selection (#30523)
  Add QA project and fixture based test for discovery-ec2 plugin (#31107)
  [ML] Hold ML filter items in sorted set (#31338)
  [ML] Add description to ML filters (#31330)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement v6.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants