Execute a SPARQL basic federated query as if it was a constant to be … #365

JervenBolleman · 2020-11-10T19:58:01Z

…propegated like a values clause

JervenBolleman · 2020-11-10T20:00:38Z

So this is the most basic idea, regarding federated queries. At parse time we invoke the basic federated query and use the results of to populate a BindingSetAssignment. Which is then fed into the normal query machinery.

This won't scale well as these results sets can be humongous and lead to all kinds of out of resource exceptions.
I would like to continue to work on this but would like some feedback on how best to proceed next.

…s. Fix some implementation issues, regarding ordering consistency

JervenBolleman · 2020-11-11T09:18:08Z

So the next step would be to execute this in advance but at a different point, insert the remote query results into a temporary table and join on that.

Then for select databases add a function that would do the federated query. Which would be faster in the common cases.

Either way seems like a bit of an issue on layer violations so would love to discuss how to go about doing it best in the ontop team opinion.

bcogrel · 2020-11-13T07:11:01Z

.../inf/ontop/answering/reformulation/input/translation/impl/RDF4JInputQueryTranslatorImpl.java

+ }
+ }
+ }
+ Set<BindingSet> bs = new LinkedHashSet<>();


Shouldn't it be a List instead of a Set?
In principle, the SPARQL subquery could return duplicates.

Good question. I always assumed it should be a set because the sparql-11-federation mentions "the multiset of solution mappings corresponding to the results of executing query".

Let me confirm that for you.

Multiset is another name for bag, it accepts duplicates (unlike sets)

bcogrel · 2020-11-13T08:05:11Z

Hi Jerven,

Thanks for sharing this interesting work!

The first solution makes sense to me as a first implementation. At the moment, the BindingSetAssignment will be translated into a large union, which is obviously not very efficient. However, we started to work on expanding our internal algebra to an in-memory table (called ValuesNode, see https://github.com/ontop/ontop/tree/feature/values-node), which should help.

In terms of integration, I think I would prefer to have it disabled by default, so as to make sure the endpoint administrator is well aware of the presence of this feature. As you said, this feature could have a significant impact on the performance of the endpoint and on its dimensioning. We could return an explicit error message explaining how to turn on this feature to users issuing a query with a SERVICE clause. We would also need to set up a query timeout when specified.

The second solution based on a temporary table is interesting but definitely challenging. Here are a few points to consider:

Would it be possible to have a DB like PostgreSQL fetching the results from the SPARQL subquery on its own?
Ontop decomposes most of the time joins over IRIs into joins over primary keys, which is essential for performance (and for triggering other optimizations). If the IRI strings are directly stored in the temporary table, we won't be able to take advantage of that.
Ontop extensively uses the information about IRI templates to prune the query so as to perform joins only over compatible templates. Here, the risk is that the subquery will return IRIs coming from heterogeneous templates, which could not be anticipated.
Ontop has typically only a read-only access to the DB. A priori, it should probably have the right to create temporary tables, but we need to check.

So yes, definitely, there is a bit of layer violations. I am curious to see what we can expect to get in terms of performance.

Best,
Benjamin

JervenBolleman · 2020-11-13T09:21:27Z

Hi Jerven,

Thanks for sharing this interesting work!

The first solution makes sense to me as a first implementation. At the moment, the BindingSetAssignment will be translated into a large union, which is obviously not very efficient. However, we started to work on expanding our internal algebra to an in-memory table (called ValuesNode, see https://github.com/ontop/ontop/tree/feature/values-node), which should help.

In terms of integration, I think I would prefer to have it disabled by default, so as to make sure the endpoint administrator is well aware of the presence of this feature. As you said, this feature could have a significant impact on the performance of the endpoint and on its dimensioning. We could return an explicit error message explaining how to turn on this feature to users issuing a query with a SERVICE clause. We would also need to set up a query timeout when specified.
I agree. Also SERVICE SILENT has a slightly different implementation requirement.

The second solution based on a temporary table is interesting but definitely challenging. Here are a few points to consider:
1. Would it be possible to have a DB like PostgreSQL fetching the results from the SPARQL subquery on its own?

Yes. postresql can call rest/http in stored procedures so we should be able to do the same to fetch a sparql query.
I suspect, if the right PL/language mod is installed it could be done in an inline function.

2. Ontop decomposes most of the time joins over IRIs into joins over primary keys, which is essential for performance (and for triggering other optimizations). If the IRI strings are directly stored in the temporary table, we won't be able to take advantage of that.

I think there are ways around this. We know which IRI patterns needs to match in the next result. We could make a temporary table like this.

e.g. we do a federated query like.

...
WHERE
{
  SERVICE <http://example/sparql> {
    ?ex a ?type .
  }
  ?ex a ex:OurType .
}

We have a mapping that says on our side.

[] rr:subjectMap [ rr:template "http://example.org/ours/{id}" ; rr:class ex:OurType ] .

We can generate a temp table with three columns. The first the result, the second if it matches the template as a funtion/virtual, the third the template decomposed as a funtion/virtual column.

ex	ex_matches_template	ex_without_template
http://example.org/ours/1	true	1
http://example.com/ours/lala	true	null

3. Ontop extensively uses the information about IRI templates to prune the query so as to perform joins only over compatible templates. Here, the risk is that the subquery will return IRIs coming from heterogeneous templates, which could not be anticipated.

Indeed. Probably ways around it, but would require some experimentation.

4. Ontop has typically only a read-only access to the DB. A priori, it should probably have the right to create temporary tables, but we need to check.

I think going for the stored procedures will be more successful (performance wise and stability wise)

So yes, definitely, there is a bit of layer violations. I am curious to see what we can expect to get in terms of performance.
Basic query federation is never great ;) but often sufficient. And better than nothing.

Best,
Benjamin
Regards,
Jerven

bcogrel · 2020-11-13T14:38:54Z

Ok, I better see, thanks.

The second solution seems feasible but quite involved.

If I understand correctly, Ontop would propagate the structural constraints coming from the mapping, such as the IRI templates, to the SPARQL subqueries and their corresponding temporary tables.
As for the stored procedures, they would be independent from the input SPARQL queries, am I right?

Ease of deployment would be in my view a crucial aspect for the success of this solution. I have a limited experience with stored procedures, let's see how it will go.

Best,
Benjamin

JervenBolleman · 2020-11-18T16:04:15Z

FYI. I won't have time to work on this for quite a while (last week was Elixir European Biohackathon) but to call a sparql endpoint in a function from postgresql would depend on the basic http/rest call idea as shown in this stack overflow answer

JervenBolleman · 2023-01-30T08:54:00Z

I have not had time to work on this, and it looks unlikely I will :(
Sill wanted to drop a note regarding using SPARQL within a Postgresql procedure as I spotted an implementation.
https://github.com/lacanoid/pgsparql/

Execute a SPARQL basic federated query as if it was a constant to be …

9e4bf0c

…propegated like a values clause

Use the localhost test instance to test basic federated sparql querie…

6ab7f8f

…s. Fix some implementation issues, regarding ordering consistency

bcogrel reviewed Nov 13, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Execute a SPARQL basic federated query as if it was a constant to be … #365

Execute a SPARQL basic federated query as if it was a constant to be … #365

JervenBolleman commented Nov 10, 2020

JervenBolleman commented Nov 10, 2020

JervenBolleman commented Nov 11, 2020

bcogrel Nov 13, 2020

JervenBolleman Nov 13, 2020

bcogrel Nov 13, 2020

bcogrel commented Nov 13, 2020

JervenBolleman commented Nov 13, 2020

bcogrel commented Nov 13, 2020

JervenBolleman commented Nov 18, 2020 •

edited

JervenBolleman commented Jan 30, 2023

Execute a SPARQL basic federated query as if it was a constant to be … #365

Are you sure you want to change the base?

Execute a SPARQL basic federated query as if it was a constant to be … #365

Conversation

JervenBolleman commented Nov 10, 2020

JervenBolleman commented Nov 10, 2020

JervenBolleman commented Nov 11, 2020

bcogrel Nov 13, 2020

Choose a reason for hiding this comment

JervenBolleman Nov 13, 2020

Choose a reason for hiding this comment

bcogrel Nov 13, 2020

Choose a reason for hiding this comment

bcogrel commented Nov 13, 2020

JervenBolleman commented Nov 13, 2020

bcogrel commented Nov 13, 2020

JervenBolleman commented Nov 18, 2020 • edited

JervenBolleman commented Jan 30, 2023

JervenBolleman commented Nov 18, 2020 •

edited