-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrading tika to 2.7.0 #93759
Upgrading tika to 2.7.0 #93759
Conversation
Pinging @elastic/es-data-management (Team:Data Management) |
Hi @joegallo, I've created a changelog YAML for you. |
Gah. Back to draft. The tests fail if I upgrade Specifically, these two test assertions fail: joegallo@galactic:~/Code/elastic/elasticsearch $ git diff
diff --git a/modules/ingest-attachment/src/test/java/org/elasticsearch/ingest/attachment/AttachmentProcessorTests.java b/modules/ingest-attachment/src/test/java/org/elasticsearch/ingest/attachment/AttachmentProcessorTests.java
index 4907b337720..30ede102214 100644
--- a/modules/ingest-attachment/src/test/java/org/elasticsearch/ingest/attachment/AttachmentProcessorTests.java
+++ b/modules/ingest-attachment/src/test/java/org/elasticsearch/ingest/attachment/AttachmentProcessorTests.java
@@ -524,9 +524,9 @@ public class AttachmentProcessorTests extends ESTestCase {
attachmentData = parseDocument("text-cjk-big5.txt", processor, Map.of("max_length", 100), true);
assertThat(attachmentData.keySet(), containsInAnyOrder("language", "content", "content_type", "content_length"));
- assertThat(attachmentData.get("content").toString(), containsString("碩鼠碩鼠,無食我黍!"));
+ // assertThat(attachmentData.get("content").toString(), containsString("碩鼠碩鼠,無食我黍!"));
assertThat(attachmentData.get("content_type").toString(), containsString("text/plain"));
- assertThat(attachmentData.get("content_type").toString(), containsString("charset=Big5"));
+ // assertThat(attachmentData.get("content_type").toString(), containsString("charset=Big5"));
assertThat(attachmentData.get("content_length"), is(100L)); I'm not sure whether that means that See also https://issues.apache.org/jira/browse/TIKA-3213 and albfernandez/juniversalchardet#34, as well as #64389 and apache/tika#862. |
I don't know Chinese, but the results of that one test appear far better with juniversalchardet 1.0.3 -- (1) the results of 2.4.0 have a lot more unrecognized characters and (2) google translate produces something much better with the results of 1.0.3. It looks like 2.4.0 is guessing that it's the wrong charset (GB18030 instead of Big5). |
Hi @joegallo, I've created a changelog YAML for you. |
@masseyke +1 to our discussion on this, I'm comfortable bumping tika but not juniversalchardet, and I've updated the description of this PR to reflect that. Feel free to review once Jenkins gives the 👍. |
Sounds good to me. It looks like the older version of universalchardet makes no prediction about that file, so Tika falls back on its own Icu4JEncodingDetector, which reports 100% confidence that it is Big5. The new universalchardet supports Chinese character encodings, but for some reason it rules out Big5 and returns that it is GB18030. Tika gives precedence to universalchardet over its own detector, so it incorrectly goes with GB18030. I don't know if this is just bad luck (since we have a very small sample size) or if universalchardet is not as good at Chinese character encodings. But it makes sense to me to skip upgrading universalchardet. |
Upgrading tika and its dependencies.
The most interesting change here is thatjuniversalchardet
got a major version bump and is coming from different maven coordinates now. But that just reflects reality:^ in spite of the above change from 2.6.0 to 2.7.0, I am not changing the
juniversalchardet
dependency -- both libraries use the same API, so the one is a drop-in replacement for the other, but it's less a version bump and more a complete replacement. In terms of our actual tests, we see ever so slightly worse results (in terms of passing tests) with the bumped version, so I think it's defensible to stick with the existing version. See some of the links and discussion from #93759 (comment).I wouldn't be opposed to a subject matter expert revisiting this version bump and perhaps replacing or updating some of our tests, but I think that sounds like an independent unit of work that's orthogonal to this PR.
Follow up to #93755, related to #93608 -- while in the course of that work, I noticed that our Tika dependency is out of date (new version released last week), so I figured we should update it. This feels like a feature, though, and not a bugfix, so I'm targeting 8.8.0 only.