Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest Attachment: Upgrade Tika to 1.18 #31252

Merged
merged 10 commits into from
Jun 24, 2018
Merged
11 changes: 7 additions & 4 deletions plugins/ingest-attachment/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ esplugin {
}

versions << [
'tika': '1.17',
'tika': '1.18',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add a note here about the discrepancy between ES's dependency on jackson, and tika's?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

'pdfbox': '2.0.8',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to check if these have been bumped in tika itself as we don't pull in transitive dependencies automatically. If tika bumped its dependency versions here, we should too. The same goes for bouncycastle and poi.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jackson is at 2.9.5, but it seems like we're at 2.8.10? Is there anything I can/should do about this?

For the others --

  • Updated pdfbox 2.08 -> 2.09
  • Bouncycastle is already at 1.55 and the compile dependency is 1.54 (I assume higher is okay?)
  • poi and mime4j are already the same
  • org.tukaani:xz 1.6 -> 1.8
    *commons-io-commons-io 2.5 -> 2.6
  • org.apache.commons:commons-compress 1.14 -> 1.16.1

I only checked against the versions in the gradle file specifically for ingest-attachment. Are there any other places I need to check?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Jackson one is tricky since we inherit the dependency from core and we are version locked there. I think that means we need to leave that one as-is for now. The rest look good to be but I think we should take Bouncy Castle to 1.54 since the Tika POM does not say 1.54+. I would rather play it safe.

'bouncycastle': '1.55',
'poi': '3.17',
Expand Down Expand Up @@ -62,7 +62,7 @@ dependencies {
// MS Office
compile "org.apache.poi:poi-scratchpad:${versions.poi}"
// Apple iWork
compile 'org.apache.commons:commons-compress:1.14'
compile 'org.apache.commons:commons-compress:1.16'
// Outlook documents
compile "org.apache.james:apache-mime4j-core:${versions.mime4j}"
compile "org.apache.james:apache-mime4j-dom:${versions.mime4j}"
Expand Down Expand Up @@ -118,6 +118,10 @@ thirdPartyAudit.excludes = [
'com.drew.metadata.jpeg.JpegDirectory',
'com.github.junrar.Archive',
'com.github.junrar.rarfile.FileHeader',
'com.github.luben.zstd.ZstdInputStream',
'com.github.luben.zstd.ZstdOutputStream',
'com.github.openjson.JSONArray',
'com.github.openjson.JSONObject',
'com.google.common.reflect.TypeToken',
'com.google.gson.Gson',
'com.googlecode.mp4parser.DataSource',
Expand Down Expand Up @@ -531,6 +535,7 @@ thirdPartyAudit.excludes = [
'org.apache.commons.exec.PumpStreamHandler',
'org.apache.commons.exec.environment.EnvironmentUtils',
'org.apache.commons.lang.StringUtils',
'org.apache.commons.lang.SystemUtils',
'org.apache.ctakes.typesystem.type.refsem.UmlsConcept',
'org.apache.ctakes.typesystem.type.textsem.IdentifiedAnnotation',
'org.apache.cxf.jaxrs.client.WebClient',
Expand Down Expand Up @@ -635,8 +640,6 @@ thirdPartyAudit.excludes = [
'org.etsi.uri.x01903.v13.impl.UnsignedSignaturePropertiesTypeImpl$1SignatureTimeStampList',
'org.etsi.uri.x01903.v14.ValidationDataType$Factory',
'org.etsi.uri.x01903.v14.ValidationDataType',
'org.json.JSONArray',
'org.json.JSONObject',
'org.json.simple.JSONArray',
'org.json.simple.JSONObject',
'org.json.simple.parser.JSONParser',
Expand Down

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
2d874b2ecf9de74437edcfbd5138b168e9ca0d14
1 change: 0 additions & 1 deletion plugins/ingest-attachment/licenses/tika-core-1.17.jar.sha1

This file was deleted.

1 change: 1 addition & 0 deletions plugins/ingest-attachment/licenses/tika-core-1.18.jar.sha1
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
69556697de96cf0b22df846e970dafd29866eee0

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
7d9b6dea91d783165f3313d320d3aaaa9a4dfc13
Original file line number Diff line number Diff line change
Expand Up @@ -214,6 +214,10 @@ public void testAsciidocDocument() throws Exception {
assertThat(attachmentData.get("content_type").toString(), containsString("text/plain"));
}

public void testZipFileDoesNotHang() throws Exception {
Copy link
Contributor

@talevy talevy Jun 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment here linking to the apache-issue?

so that it is clear where this "bad_tika" file came from

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

expectThrows(Exception.class, () -> parseDocument("bad_tika.zip", processor));
}

public void testParseAsBytesArray() throws Exception {
String path = "/org/elasticsearch/ingest/attachment/test/sample-files/text-in-english.txt";
byte[] bytes;
Expand Down
Binary file not shown.