Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest Attachment: Upgrade Tika to 1.18 #31252

Merged
merged 10 commits into from
Jun 24, 2018
Merged

Conversation

jdconrad
Copy link
Contributor

Fixes ES from hanging when a bad zip file is loaded through Tika.

Zip File found here: https://issues.apache.org/jira/browse/COMPRESS-432

Upgrades include
tika: 1.17 -> 1.18
commons-compress: 1.14 -> 1.16

Fixes ES from hanging when a bad zip file is loaded through Tika.
@jdconrad jdconrad added >bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v7.0.0 v6.4.0 v6.3.1 v5.6.11 labels Jun 11, 2018
@jdconrad jdconrad requested a review from jasontedor June 11, 2018 18:53
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@@ -23,7 +23,7 @@ esplugin {
}

versions << [
'tika': '1.17',
'tika': '1.18',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add a note here about the discrepancy between ES's dependency on jackson, and tika's?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@@ -214,6 +214,10 @@ public void testAsciidocDocument() throws Exception {
assertThat(attachmentData.get("content_type").toString(), containsString("text/plain"));
}

public void testZipFileDoesNotHang() throws Exception {
Copy link
Contributor

@talevy talevy Jun 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment here linking to the apache-issue?

so that it is clear where this "bad_tika" file came from

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@jdconrad
Copy link
Contributor Author

@talevy Thanks for the feedback. Added comments.

Copy link
Contributor

@talevy talevy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if the build is happy

@jdconrad
Copy link
Contributor Author

@talevy Thanks again. Will wait for green and then commit.

@jdconrad jdconrad removed the request for review from jasontedor June 11, 2018 20:03
Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left one comment about the transitive dependencies.

@@ -23,7 +23,7 @@ esplugin {
}

versions << [
'tika': '1.17',
'tika': '1.18',
'pdfbox': '2.0.8',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to check if these have been bumped in tika itself as we don't pull in transitive dependencies automatically. If tika bumped its dependency versions here, we should too. The same goes for bouncycastle and poi.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jackson is at 2.9.5, but it seems like we're at 2.8.10? Is there anything I can/should do about this?

For the others --

  • Updated pdfbox 2.08 -> 2.09
  • Bouncycastle is already at 1.55 and the compile dependency is 1.54 (I assume higher is okay?)
  • poi and mime4j are already the same
  • org.tukaani:xz 1.6 -> 1.8
    *commons-io-commons-io 2.5 -> 2.6
  • org.apache.commons:commons-compress 1.14 -> 1.16.1

I only checked against the versions in the gradle file specifically for ingest-attachment. Are there any other places I need to check?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Jackson one is tricky since we inherit the dependency from core and we are version locked there. I think that means we need to leave that one as-is for now. The rest look good to be but I think we should take Bouncy Castle to 1.54 since the Tika POM does not say 1.54+. I would rather play it safe.

@jdconrad
Copy link
Contributor Author

jdconrad commented Jun 11, 2018

@jasontedor Thanks for the review. Please see the response to your comment. Updated the dependencies based on maven as best I could.

@jdconrad
Copy link
Contributor Author

@jasontedor I traced the dependency tree as far as I could based on your suggestions yesterday. Here's what I came up with:

name expects / used

  • bouncycastle 1.54 / 1.55 (we use newer)
  • slf4j 1.7.24 / 1.6.2 (project setting, can't change)
  • httpclient 4.5.4 / 4.5.2 (project setting, can't change)
  • httpcore 4.4.7 / 4.4.5 (project setting, can't change)
  • commons-logging 1.2 / 1.1.3 (project setting, can't change)
  • joda-time 2.2 / 2.9.9 (project setting, can't change)
  • jackson-core 2.9.5 / 2.8.10 (project setting, can't change)
  • junit 4.10 / 4.12 (project setting, can't change)
  • commons-codec 1.9 / 1.10 (project setting, can't change)
  • randomized-test-runner 2.5.2 / 2.6 (project setting, can't change)

I think the dependencies as they are in this PR are the most up-to-date they can be now. Do you see anything that stands out to you?

@jdconrad
Copy link
Contributor Author

jdconrad commented Jun 12, 2018

@jasontedor As for the build failure - looks like it happens with pdfbox 2.0.9 when built with java 10 and run against java 8. Run against java 10 it works fine. I attempted to add the permission that seemed like it was missing, but that did not fix the issue in java 8. Pdfbox 2.0.8 seems to work in all cases. Do you have any recommendations?

Edit: Also tried adding the permission to the special tika permissions with no luck.

Edit 2: So I don't know the security manager very well, but it appears what's happening is the top level doPrivileged domain has the correct permission. As we go further down the stack a bunch of other domains get added without the appropriate permission as it seems the first domain's permissions aren't propagated. This in turn is leading to the failure. Is there a way to ensure the permission is propagated or somehow special case this one? (Note this only is happening when run against Java 8 and Java 9. Java 10 seems to be unaffected.)

Going to pause work on this for now as I need a bit of guidance to move forward.

@jdconrad
Copy link
Contributor Author

The commit that appears to be causing the issue:

apache/pdfbox@a031998

@jdconrad
Copy link
Contributor Author

jdconrad commented Jun 13, 2018

More info:

access: domain that failed ProtectionDomain  (file:/home/jdconrad/.gradle/caches/modules-2/files-2.1/org.apache.pdfbox/pdfbox/2.0.9/d0425578218624388f2ec84a0b3a11efd55df0f5/pdfbox-2.0.9.jar <no signer certificates>)
 jdk.internal.loader.ClassLoaders$AppClassLoader@1b9e1916
 <no principals>
 java.security.Permissions@63be9d18 (
 ("java.io.FilePermission" "/home/jdconrad/.gradle/caches/modules-2/files-2.1/org.apache.pdfbox/pdfbox/2.0.9/d0425578218624388f2ec84a0b3a11efd55df0f5/pdfbox-2.0.9.jar" "read")
 ("java.lang.RuntimePermission" "exitVM")
)

@jdconrad
Copy link
Contributor Author

Looks like this passes in 10 because the class doesn't exist, so it never gets far enough to check permissions during Class.forName

@rjernst
Copy link
Member

rjernst commented Jun 24, 2018

This LGTM and passes CI now (as well as local testing with java 10/8). I will commit and backport now.

@rjernst rjernst merged commit 9efb0fe into elastic:master Jun 24, 2018
rjernst pushed a commit that referenced this pull request Jun 24, 2018
Fixes ES from hanging when a bad zip file is loaded through Tika.
rjernst pushed a commit that referenced this pull request Jun 24, 2018
Fixes ES from hanging when a bad zip file is loaded through Tika.
rjernst pushed a commit that referenced this pull request Jun 24, 2018
Fixes ES from hanging when a bad zip file is loaded through Tika.
colings86 pushed a commit that referenced this pull request Jun 25, 2018
Fixes ES from hanging when a bad zip file is loaded through Tika.
jasontedor added a commit to hub-cap/elasticsearch that referenced this pull request Jun 25, 2018
* elastic/master: (92 commits)
  Reduce number of raw types warnings (elastic#31523)
  Migrate scripted metric aggregation scripts to ScriptContext design (elastic#30111)
  turn GetFieldMappingsResponse to ToXContentObject (elastic#31544)
  Close xcontent parsers (partial) (elastic#31513)
  Ingest Attachment: Upgrade Tika to 1.18 (elastic#31252)
  TEST: Correct the assertion arguments order (elastic#31540)
  Add get field mappings to High Level REST API Client (elastic#31423)
  [DOCS] Updates Watcher examples for code testing (elastic#31152)
  TEST: Add bwc recovery tests with synced-flush index
  [DOCS] Move sql to docs (elastic#31474)
  [DOCS] Move monitoring to docs folder (elastic#31477)
  Core: Combine doExecute methods in TransportAction (elastic#31517)
  IndexShard should not return null stats (elastic#31528)
  fix repository update with the same settings but different type (elastic#31458)
  Fix Mockito trying to mock IOException that isn't thrown by method (elastic#31433) (elastic#31527)
  Node selector per client rather than per request (elastic#31471)
  Core: Combine messageRecieved methods in TransportRequestHandler (elastic#31519)
  Upgrade to Lucene 7.4.0. (elastic#31529)
  [ML] Add ML filter update API (elastic#31437)
  Allow multiple unicast host providers (elastic#31509)
  ...
@jdconrad
Copy link
Contributor Author

@rjernst Thanks for finishing this!

dnhatn added a commit that referenced this pull request Jun 26, 2018
* 6.x:
  Fix broken backport of #31578 by adjusting constructor (#31587)
  ingest: Add ignore_missing property to foreach filter (#22147) (#31578)
  Add package pre-install check for java binary (#31343)
  Docs: Clarify sensitive fields watcher encryption (#31551)
  Watcher: Remove never executed code (#31135)
  Improve test times for tests using `RandomObjects::addFields` (#31556)
  Revert "Remove RestGetAllAliasesAction (#31308)"
  REST high-level client: add simulate pipeline API (#31158)
  Get Mapping API to honour allow_no_indices and ignore_unavailable (#31507)
  Fix Mockito trying to mock IOException that isn't thrown by method (#31433) (#31527)
  [Test] Add full cluster restart test for Rollup (#31533)
  Enhance thread context uniqueness assertion
  fix writeIndex evaluation for aliases (#31562)
  Add x-opaque-id to search slow logs (#31539)
  Watcher: Fix put watch action (#31524)
  [DOCS] Significantly improve SQL docs
  turn GetFieldMappingsResponse to ToXContentObject (#31544)
  TEST: Unmute testHistoryUUIDIsGenerated
  Ingest Attachment: Upgrade Tika to 1.18 (#31252)
  TEST: Correct the assertion arguments order (#31540)
dnhatn added a commit that referenced this pull request Jun 26, 2018
* master:
  ingest: Add ignore_missing property to foreach filter (#22147) (#31578)
  Fix a formatting issue in the docvalue_fields documentation. (#31563)
  reduce log level at gradle configuration time
  [TEST] Close additional clients created while running yaml tests (#31575)
  Docs: Clarify sensitive fields watcher encryption (#31551)
  Watcher: Remove never executed code (#31135)
  Add support for switching distribution for all integration tests (#30874)
  Improve robustness of geo shape parser for malformed shapes (#31449)
  QA: Create xpack yaml features (#31403)
  Improve test times for tests using `RandomObjects::addFields` (#31556)
  [Test] Add full cluster restart test for Rollup (#31533)
  Enhance thread context uniqueness assertion
  [DOCS] Fix heading format errors (#31483)
  fix writeIndex evaluation for aliases (#31562)
  Add x-opaque-id to search slow logs (#31539)
  Watcher: Fix put watch action (#31524)
  Add package pre-install check for java binary (#31343)
  Reduce number of raw types warnings (#31523)
  Migrate scripted metric aggregation scripts to ScriptContext design (#30111)
  turn GetFieldMappingsResponse to ToXContentObject (#31544)
  Close xcontent parsers (partial) (#31513)
  Ingest Attachment: Upgrade Tika to 1.18 (#31252)
  TEST: Correct the assertion arguments order (#31540)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v5.6.11 v6.3.1 v6.4.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants