Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make use of Jena3 text index for better performance #273

Closed
osma opened this issue Aug 19, 2015 · 4 comments
Closed

Make use of Jena3 text index for better performance #273

osma opened this issue Aug 19, 2015 · 4 comments

Comments

@osma
Copy link
Member

osma commented Aug 19, 2015

There are significant changes (implemented by Alexis Miara and myself) in the jena-text module of Jena 3.0.0 / Fuseki 1.3.0 / Fuseki 2.3.0. These include

  • support for storing language tags of literals and limiting queries to a specific language
  • support for storing full literal values in the index and accessing them at query time
  • support for deleting obsolete entries from the text index

These together enable a new way of using the text index from Skosmos:

  • Text queries could, in most cases, be limited to a specific language. This avoids false hits from the text index that would have to be filtered out using SPARQL, and should thus speed up queries, particularly the alphabetical display for large vocabularies.
  • Since the text index can return full literal values, there is less need to find out which literal value actually matched the query (using regular expressions or string matching functions, as is done currently). This should make text index related SPARQL queries both simpler and faster.
  • The uidField should be enabled, so that stale entries will be dropped from the text index. Currently the performance of text index related queries deteriorates slightly each time the vocabulary data is updated. This is probably due to stale entries. Cleaning them up should prevent this performance deterioration.

Text index related code in JenaTextSparql (and possibly GenericSparql) will need to be heavily rewritten. Luckily the new code should be simpler than the old one and we already have pretty good unit tests for this functionality, so it is easy to verify what works and what doesn't.

Text index configuration needs to be changed to enable the new features, and text indexes must then be rebuilt. Fuseki 1.3.0/2.3.0 will require Java 8 to be installed on servers, development machines and the Travis CI environment (where it should be available, but not used by default).

(Finto project note: this is a way of implementing FINTO-85: Tuki hyvin suurille tietovarannoille)

@osma
Copy link
Member Author

osma commented Nov 12, 2015

Started work on this in the jena3-text-index branch.

Travis tests are not currently working. Travis doesn't seem to provide an environment that would have both PHP and Java8 support. See travis-ci/travis-ci#4750

@osma
Copy link
Member Author

osma commented Nov 12, 2015

Got the Travis tests working again by switching to the old, non-container-based Ubuntu 12.04 environment and installing Oracle Java 8 via the webupd8 repository installer. It's slow and inelegant (every test run downloads 180MB from Oracle) but works for the moment as a stopgap until we can switch to the Trusty environment, after the PHP issues are fixed by Travis.

@osma
Copy link
Member Author

osma commented Dec 2, 2015

Merged the jena3-text-index branch to master. Still needs documentation and possible bugfixes.

@osma
Copy link
Member Author

osma commented Dec 7, 2015

Documented in wiki: InstallFusekiJenaText and Upgrading

@osma osma closed this as completed Dec 7, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant