Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FILTER keyword #129

Open
rogerlucena opened this issue Aug 7, 2020 · 1 comment · Fixed by #149 or #150
Open

FILTER keyword #129

rogerlucena opened this issue Aug 7, 2020 · 1 comment · Fixed by #149 or #150
Assignees

Comments

@rogerlucena
Copy link
Contributor

rogerlucena commented Aug 7, 2020

It would be very useful to have in BQL a FILTER keyword that could allow us to filter out part of the results of a query in a level closer to the storage (closer to the driver), improving performance.

Some of the functionalities that this FILTER keyword could accept:

  1. Filtering only the triples with immutable predicates for our query result.

The syntax for that could be something like:

FILTER isImmutable(?p)

Coming inside or after the WHERE clause to specify that the predicate bound to ?p in our query should be immutable.

Another function isTemporal could work similarly.

This could come as a solution for what was asked in the Issue #115.

  1. Filtering only the triples with the latest time anchor for our query result.

It is pretty common to be interested only in the latest triple of a time series. Instead of getting all the triples, sorting them by the time anchor in decreasing order and limiting the result to 1, as one may be doing nowadays, we could just use FILTER to do that in a much less expensive way using a syntax like:

FILTER latest(?p)

This is a pretty common use case already highlighted by the Issues #86 and #85.

It also opens the possibility for supporting the opposite: filtering only the earliest time anchor, as illustrated below.

FILTER earliest(?p)

One could also decide for a syntax that uses directly the time binding for filtering, like in:

FILTER latest(?date)
  1. Allow using regular expressions for matching.

We could use regex for filtering too. The syntax could be something like:

FILTER match(?obj, "ab+"^^type:text)

Which also resonates with what was asked by the Issue #122.

  1. Filtering to satisfy comparisons (evaluated as boolean conditions).

For example:

FILTER greaterThan(?obj, "37"^^type:int64)

With the functions lowerThan and equal it should be analogous.

  1. Filtering to satisfy a combination of functions.

For example, one could write something like:

FILTER latest(lowerThan(?date, 2005-01-02T15:04:05.999999999Z07:00))

To get in the query result only the latest element of a time series while also restricting the time interval to be before a given date.  

Another approach for this would be building a function like:

FILTER latestBeforeUpperBound(?date, 2005-01-02T15:04:05.999999999Z07:00)
  1. Others.

Other ideas for filtering functions could be the likes of:

FILTER isToday(?date)

That would compare a binding with a value extracted during runtime (the current day in this example).

These above are just some examples. The FILTER keyword could open space for a number of other functionalities in the future, as we discover new ones that could be handy and implement them as functions for filtering (just like the functions isImmutable and latest above).

The idea is for the FILTER functions all have a signature like below:

FILTER myFunction(?binding, <value>)

With the <value> argument above being optional (depending on the function it is not necessary, isImmutable does not require it for example).

This way, when adding a new function no new changes will be necessary inside the parser or inside lookupOptions (that communicates with the driver, defined in storage.go). All the FILTER functions should be mapped to three variables there: operation, field (for the binding or its position in the clause) and value.

For other general ideas, one could get inspiration from the SPARQL's FILTER keyword.

N.B.: Note how this FILTER keyword differ from the HAVING: the FILTER would work closer to the storage/driver level to improve query performance while filtering the results, while the HAVING would work focusing on aggregated data in a higher level farther from the driver (as when using functionalities such as sum and count to write your HAVING conditions).

@rogerlucena
Copy link
Contributor Author

At the moment, BadWolf already supports the following FILTER functions:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants