Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading tika to 2.7.0 #93759

Merged
merged 6 commits into from
Feb 14, 2023

Conversation

joegallo
Copy link
Contributor

@joegallo joegallo commented Feb 13, 2023

Upgrading tika and its dependencies.

The most interesting change here is that juniversalchardet got a major version bump and is coming from different maven coordinates now. But that just reflects reality:

joegallo@galactic:~/Desktop/tika versions/2.6.0 $ cat META-INF/DEPENDENCIES | grep juniversalchardet
  - juniversalchardet (http://juniversalchardet.googlecode.com/) com.googlecode.juniversalchardet:juniversalchardet:jar:1.0.3
joegallo@galactic:~/Desktop/tika versions/2.6.0 $ cd ../2.7.0
joegallo@galactic:~/Desktop/tika versions/2.7.0 $ cat META-INF/DEPENDENCIES | grep juniversalchardet
  - juniversalchardet (https://github.com/albfernandez/juniversalchardet) com.github.albfernandez:juniversalchardet:jar:2.4.0

^ in spite of the above change from 2.6.0 to 2.7.0, I am not changing the juniversalchardet dependency -- both libraries use the same API, so the one is a drop-in replacement for the other, but it's less a version bump and more a complete replacement. In terms of our actual tests, we see ever so slightly worse results (in terms of passing tests) with the bumped version, so I think it's defensible to stick with the existing version. See some of the links and discussion from #93759 (comment).

I wouldn't be opposed to a subject matter expert revisiting this version bump and perhaps replacing or updating some of our tests, but I think that sounds like an independent unit of work that's orthogonal to this PR.

Follow up to #93755, related to #93608 -- while in the course of that work, I noticed that our Tika dependency is out of date (new version released last week), so I figured we should update it. This feels like a feature, though, and not a bugfix, so I'm targeting 8.8.0 only.

@joegallo joegallo added >upgrade :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team v8.8.0 labels Feb 13, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine
Copy link
Collaborator

Hi @joegallo, I've created a changelog YAML for you.

@joegallo joegallo marked this pull request as draft February 13, 2023 21:50
@joegallo
Copy link
Contributor Author

joegallo commented Feb 13, 2023

Gah. Back to draft. The tests fail if I upgrade juniversalchardet.

Specifically, these two test assertions fail:

joegallo@galactic:~/Code/elastic/elasticsearch $ git diff
diff --git a/modules/ingest-attachment/src/test/java/org/elasticsearch/ingest/attachment/AttachmentProcessorTests.java b/modules/ingest-attachment/src/test/java/org/elasticsearch/ingest/attachment/AttachmentProcessorTests.java
index 4907b337720..30ede102214 100644
--- a/modules/ingest-attachment/src/test/java/org/elasticsearch/ingest/attachment/AttachmentProcessorTests.java
+++ b/modules/ingest-attachment/src/test/java/org/elasticsearch/ingest/attachment/AttachmentProcessorTests.java
@@ -524,9 +524,9 @@ public class AttachmentProcessorTests extends ESTestCase {

         attachmentData = parseDocument("text-cjk-big5.txt", processor, Map.of("max_length", 100), true);
         assertThat(attachmentData.keySet(), containsInAnyOrder("language", "content", "content_type", "content_length"));
-        assertThat(attachmentData.get("content").toString(), containsString("碩鼠碩鼠,無食我黍!"));
+        // assertThat(attachmentData.get("content").toString(), containsString("碩鼠碩鼠,無食我黍!"));
         assertThat(attachmentData.get("content_type").toString(), containsString("text/plain"));
-        assertThat(attachmentData.get("content_type").toString(), containsString("charset=Big5"));
+        // assertThat(attachmentData.get("content_type").toString(), containsString("charset=Big5"));
         assertThat(attachmentData.get("content_length"), is(100L));

I'm not sure whether that means that juniversalchardet is worse in the upgraded version, or whether it's better but there's something wrong with our tests, or some secret third thing that I haven't thought of yet. Regardless, there's a bit more poking to do here, so this is now a WIP PR.

See also https://issues.apache.org/jira/browse/TIKA-3213 and albfernandez/juniversalchardet#34, as well as #64389 and apache/tika#862.

@joegallo joegallo added the WIP label Feb 13, 2023
@masseyke
Copy link
Member

I don't know Chinese, but the results of that one test appear far better with juniversalchardet 1.0.3 -- (1) the results of 2.4.0 have a lot more unrecognized characters and (2) google translate produces something much better with the results of 1.0.3. It looks like 2.4.0 is guessing that it's the wrong charset (GB18030 instead of Big5).

@joegallo joegallo removed the WIP label Feb 14, 2023
@joegallo joegallo marked this pull request as ready for review February 14, 2023 16:03
@elasticsearchmachine
Copy link
Collaborator

Hi @joegallo, I've created a changelog YAML for you.

@joegallo
Copy link
Contributor Author

@masseyke +1 to our discussion on this, I'm comfortable bumping tika but not juniversalchardet, and I've updated the description of this PR to reflect that. Feel free to review once Jenkins gives the 👍.

@masseyke
Copy link
Member

Sounds good to me. It looks like the older version of universalchardet makes no prediction about that file, so Tika falls back on its own Icu4JEncodingDetector, which reports 100% confidence that it is Big5. The new universalchardet supports Chinese character encodings, but for some reason it rules out Big5 and returns that it is GB18030. Tika gives precedence to universalchardet over its own detector, so it incorrectly goes with GB18030. I don't know if this is just bad luck (since we have a very small sample size) or if universalchardet is not as good at Chinese character encodings. But it makes sense to me to skip upgrading universalchardet.

@joegallo joegallo merged commit e1ed406 into elastic:main Feb 14, 2023
@joegallo joegallo deleted the ingest-attachment-tika-version-270 branch February 14, 2023 16:57
carlosdelest pushed a commit to carlosdelest/elasticsearch that referenced this pull request Feb 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team >upgrade v8.8.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants