Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the validation on TikaDocTests #12887

Closed
sandeshkr419 opened this issue Mar 25, 2024 · 5 comments
Closed

Improve the validation on TikaDocTests #12887

sandeshkr419 opened this issue Mar 25, 2024 · 5 comments
Assignees
Labels
API Issues with external APIs bug Something isn't working good first issue Good for newcomers ingest-pipeline v2.15.0 Issues and PRs related to version 2.15.0 v3.0.0 Issues and PRs related to version 3.0.0

Comments

@sandeshkr419
Copy link
Contributor

sandeshkr419 commented Mar 25, 2024

Describe the issue

https://github.com/opensearch-project/OpenSearch/blob/main/plugins/ingest-attachment/src/test/java/org/opensearch/ingest/attachment/TikaDocTests.java#L79

This test tries to parse a bunch of different documents and asserts whether the resulting parsed content in not null/empty - that's it.

            String parsedContent = TikaImpl.parse(bytes, new Metadata(), -1);
            assertNotNull(parsedContent);
            assertFalse(parsedContent.isEmpty());
            logger.debug("extracted content: {}", parsedContent);

Any dependency change in tika libraries or else which could possibly impact the content of parsed content will be untracked and can introduce unknown bugs.

Recommending storing of parsed content at present version for future comparison by introduction of test which verifies the same. Also, comparing it with the parsed content from an older version to ensure that the content has remain unchanged from older OpenSearch versions.

Related component

Indexing

Additional Details

Plugins
ingest-attachment plugin in core.

@sandeshkr419 sandeshkr419 added bug Something isn't working untriaged labels Mar 25, 2024
@github-actions github-actions bot added the Indexing Indexing, Bulk Indexing and anything related to indexing label Mar 25, 2024
@sandeshkr419 sandeshkr419 added the good first issue Good for newcomers label Mar 25, 2024
@ankitkala ankitkala added ingest-pipeline API Issues with external APIs and removed Indexing Indexing, Bulk Indexing and anything related to indexing labels Apr 11, 2024
@peternied peternied changed the title [Weak Testing] TikaDocTests Improve the validation on TikaDocTests May 1, 2024
@peternied
Copy link
Member

[Triage - attendees 1 2 3 4 5 6 7 8]
@sandeshkr419 Thanks for creating this issue

@finnegancarroll
Copy link
Contributor

@peternied I'de like to work on this issue.

@mch2
Copy link
Member

mch2 commented May 1, 2024

thanks for picking this up @finnegancarroll. I've assigned to you.

@finnegancarroll
Copy link
Contributor

Hi @sandeshkr419! Can I pick your brain a little more regarding this issue? Is there functionality which opensearch adds to apache tika that we hope to capture with these tests? That is to say, how would you recommend I avoid redundancy with tika's own unit tests? Thanks!

@sandeshkr419
Copy link
Contributor Author

Hi @finnegancarroll!

We certainly don't want things to be redundant for sure. I have not been able to look into the tika package to check what testing has been available there.

Consider the scenario - where tika changes the end result of parsed content somehow in a later release and OpenSearch depends on the exact state of the resultant parsed content - is there a way to capture and identify it other than checking non-null value. In the present scenario, we may end up just upgrading the library because we will see no failures.

I understand this may be a futile exercise if we have other mechanisms that can capture it - somehow I am not aware of them at the moment.

Please take my comments with a pinch of salt that I don't have the expertise on this plugin. Probably we should tag in some experts of the area as well to fill in.

@reta reta added v2.15.0 Issues and PRs related to version 2.15.0 v3.0.0 Issues and PRs related to version 3.0.0 labels May 16, 2024
@reta reta closed this as completed May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Issues with external APIs bug Something isn't working good first issue Good for newcomers ingest-pipeline v2.15.0 Issues and PRs related to version 2.15.0 v3.0.0 Issues and PRs related to version 3.0.0
Projects
None yet
Development

No branches or pull requests

6 participants