Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Boolean queries for BASE #412

Merged
merged 14 commits into from
Mar 19, 2020
Merged

Enable Boolean queries for BASE #412

merged 14 commits into from
Mar 19, 2020

Conversation

pkraker
Copy link
Member

@pkraker pkraker commented Mar 13, 2020

This PR enables all Boolean queries described in the BASE Interface Guide (see section 4 on page 13) in verbatim mode. The changes have been extensively tested, but further tests would be very useful. Examples include:

  • 2019-ncov or sars-cov-2
  • (cats and dogs) or (sun and moon)
  • research -science
  • "knowledge domain visualization" or ("knowledge map" and (research or science))

@pkraker pkraker requested a review from chreman March 13, 2020 21:23
@chreman
Copy link
Member

chreman commented Mar 16, 2020

Thank you! In general this leads to a much better retrieval for complex queries, and allows a better fine tuning of search results.
The query parsing can be tricky, and may require more extensive documentation so that users can get the best out of it.
Examples:

  • 2019-ncov or sars-cov-2: fails - here the problem is that - is understood as an operator (textus:2019-textus:ncov or textus:sars-textus:cov-textus:2)
  • "2019-ncov" or "sars-cov-2": success
  • "2019-ncov" and "sars-cov-2": success
  • "2019-ncov" + "sars-cov-2": fails - can be solved, see below
  • "2019-ncov"+"sars-cov-2": success (identical map as with and but different ID)

Likewise, (cats+dogs) or (sun+moon) == (cats and dogs) or (sun and moon) but (cats + dogs) or (sun + moon) fails.
A quick test showed that wrapping the + shorthand with optional space characters (changing the last matching group to \\s?\\+\\s?) leads to the same verbatim query.

Additionally, it may make sense to add to the documentation that negative keywords should be added with consecutive -, e.g. science -research -knowledge -theory, as something like science -(research or knowledge or theory) is not correctly parsed
(-> textus:science -textus:(textus:research or textus:knowledge or textus:theory)

@pkraker
Copy link
Member Author

pkraker commented Mar 17, 2020

Thanks for the thorough review! I updated the regex to cover the problems you described. In addition, I added a preprocessing step that removes unnecessary pluses and minuses. The regex now also deals correctly with multiple parentheses and inverted commas (single and double).

Please check again. There is now a test file in /test that provides a number of test cases.

@chreman
Copy link
Member

chreman commented Mar 19, 2020

Thank you very much, I think this is a great improvement. I took liberty to add functionality to the test script, it now makes a comparison between actual and expected results and notifies of errors.
Please check that the specifications of expected results in the queries list are correct.
I already made at one change here for the following test case:
"Processing fails for a+b . Result: textus:a+textus:b Expected: textus:a textus:b"

@pkraker
Copy link
Member Author

pkraker commented Mar 19, 2020

Thanks, I introduced several changes to the preprocessing to accommodate the queries that did not produce the expected result; as a results, pluses are now always replaced by spaces. I also made sure that phrases are not altered in the preprocessing step to enable the user to force searching for the exact term. Please check again!

@chreman
Copy link
Member

chreman commented Mar 19, 2020

Thank you for the expansion, all tests are now successful. In my view, this is good to go.

@chreman chreman merged commit 87a3c2f into master Mar 19, 2020
@pkraker pkraker deleted the base-boolean-queries branch March 31, 2020 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants