Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Execute a SPARQL basic federated query as if it was a constant to be … #365

Draft
wants to merge 2 commits into
base: version4
Choose a base branch
from

Conversation

JervenBolleman
Copy link
Contributor

…propegated like a values clause

@JervenBolleman
Copy link
Contributor Author

So this is the most basic idea, regarding federated queries. At parse time we invoke the basic federated query and use the results of to populate a BindingSetAssignment. Which is then fed into the normal query machinery.

This won't scale well as these results sets can be humongous and lead to all kinds of out of resource exceptions.
I would like to continue to work on this but would like some feedback on how best to proceed next.

…s. Fix some implementation issues, regarding ordering consistency
@JervenBolleman
Copy link
Contributor Author

So the next step would be to execute this in advance but at a different point, insert the remote query results into a temporary table and join on that.

Then for select databases add a function that would do the federated query. Which would be faster in the common cases.

Either way seems like a bit of an issue on layer violations so would love to discuss how to go about doing it best in the ontop team opinion.

}
}
}
Set<BindingSet> bs = new LinkedHashSet<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be a List instead of a Set?
In principle, the SPARQL subquery could return duplicates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I always assumed it should be a set because the sparql-11-federation mentions "the multiset of solution mappings corresponding to the results of executing query".

Let me confirm that for you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiset is another name for bag, it accepts duplicates (unlike sets)

@bcogrel
Copy link
Member

bcogrel commented Nov 13, 2020

Hi Jerven,

Thanks for sharing this interesting work!

The first solution makes sense to me as a first implementation. At the moment, the BindingSetAssignment will be translated into a large union, which is obviously not very efficient. However, we started to work on expanding our internal algebra to an in-memory table (called ValuesNode, see https://github.com/ontop/ontop/tree/feature/values-node), which should help.

In terms of integration, I think I would prefer to have it disabled by default, so as to make sure the endpoint administrator is well aware of the presence of this feature. As you said, this feature could have a significant impact on the performance of the endpoint and on its dimensioning. We could return an explicit error message explaining how to turn on this feature to users issuing a query with a SERVICE clause. We would also need to set up a query timeout when specified.

The second solution based on a temporary table is interesting but definitely challenging. Here are a few points to consider:

  1. Would it be possible to have a DB like PostgreSQL fetching the results from the SPARQL subquery on its own?
  2. Ontop decomposes most of the time joins over IRIs into joins over primary keys, which is essential for performance (and for triggering other optimizations). If the IRI strings are directly stored in the temporary table, we won't be able to take advantage of that.
  3. Ontop extensively uses the information about IRI templates to prune the query so as to perform joins only over compatible templates. Here, the risk is that the subquery will return IRIs coming from heterogeneous templates, which could not be anticipated.
  4. Ontop has typically only a read-only access to the DB. A priori, it should probably have the right to create temporary tables, but we need to check.

So yes, definitely, there is a bit of layer violations. I am curious to see what we can expect to get in terms of performance.

Best,
Benjamin

@JervenBolleman
Copy link
Contributor Author

Hi Jerven,

Thanks for sharing this interesting work!

The first solution makes sense to me as a first implementation. At the moment, the BindingSetAssignment will be translated into a large union, which is obviously not very efficient. However, we started to work on expanding our internal algebra to an in-memory table (called ValuesNode, see https://github.com/ontop/ontop/tree/feature/values-node), which should help.

In terms of integration, I think I would prefer to have it disabled by default, so as to make sure the endpoint administrator is well aware of the presence of this feature. As you said, this feature could have a significant impact on the performance of the endpoint and on its dimensioning. We could return an explicit error message explaining how to turn on this feature to users issuing a query with a SERVICE clause. We would also need to set up a query timeout when specified.
I agree. Also SERVICE SILENT has a slightly different implementation requirement.

The second solution based on a temporary table is interesting but definitely challenging. Here are a few points to consider:

1. Would it be possible to have a DB like PostgreSQL fetching the results from the SPARQL subquery on its own?

Yes. postresql can call rest/http in stored procedures so we should be able to do the same to fetch a sparql query.
I suspect, if the right PL/language mod is installed it could be done in an inline function.

2. Ontop decomposes most of the time joins over IRIs into joins over primary keys, which is essential for performance (and for triggering other optimizations). If the IRI strings are directly stored in the temporary table, we won't be able to take advantage of that.

I think there are ways around this. We know which IRI patterns needs to match in the next result. We could make a temporary table like this.

e.g. we do a federated query like.

...
WHERE
{
  SERVICE <http://example/sparql> {
    ?ex a ?type .
  }
  ?ex a ex:OurType .
}

We have a mapping that says on our side.

[] rr:subjectMap [ rr:template "http://example.org/ours/{id}" ; rr:class ex:OurType ] .

We can generate a temp table with three columns. The first the result, the second if it matches the template as a funtion/virtual, the third the template decomposed as a funtion/virtual column.

ex ex_matches_template ex_without_template
http://example.org/ours/1 true 1
http://example.com/ours/lala true null
3. Ontop extensively uses the information about IRI templates to prune the query so as to perform joins only over compatible templates. Here, the risk is that the subquery will return IRIs coming from heterogeneous templates, which could not be anticipated.

Indeed. Probably ways around it, but would require some experimentation.

4. Ontop has typically only a read-only access to the DB. A priori, it should probably have the right to create temporary tables, but we need to check.

I think going for the stored procedures will be more successful (performance wise and stability wise)

So yes, definitely, there is a bit of layer violations. I am curious to see what we can expect to get in terms of performance.
Basic query federation is never great ;) but often sufficient. And better than nothing.

Best,
Benjamin
Regards,
Jerven

@bcogrel
Copy link
Member

bcogrel commented Nov 13, 2020

Ok, I better see, thanks.

The second solution seems feasible but quite involved.

If I understand correctly, Ontop would propagate the structural constraints coming from the mapping, such as the IRI templates, to the SPARQL subqueries and their corresponding temporary tables.
As for the stored procedures, they would be independent from the input SPARQL queries, am I right?

Ease of deployment would be in my view a crucial aspect for the success of this solution. I have a limited experience with stored procedures, let's see how it will go.

Best,
Benjamin

@JervenBolleman
Copy link
Contributor Author

JervenBolleman commented Nov 18, 2020

FYI. I won't have time to work on this for quite a while (last week was Elixir European Biohackathon) but to call a sparql endpoint in a function from postgresql would depend on the basic http/rest call idea as shown in this stack overflow answer

@JervenBolleman
Copy link
Contributor Author

I have not had time to work on this, and it looks unlikely I will :(
Sill wanted to drop a note regarding using SPARQL within a Postgresql procedure as I spotted an implementation.
https://github.com/lacanoid/pgsparql/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants