Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a Blaze-PostgreSQL adapter #1543

Open
davidmscholz opened this issue Mar 11, 2024 · 8 comments
Open

Implement a Blaze-PostgreSQL adapter #1543

davidmscholz opened this issue Mar 11, 2024 · 8 comments
Assignees
Labels
question Further information is requested

Comments

@davidmscholz
Copy link

davidmscholz commented Mar 11, 2024

As stated in the documentation (https://github.com/samply/blaze/blob/master/docs/architecture.md), it should be possible to use PostgreSQL as the Resource Store for Blaze.

I think having that option would be very beneficial to us if it would enable users to perform SQL queries on the resources in order to more easily assess data quality in their Blaze instance.

On the other hand, using PostgreSQL as the Resource Store might come with decreased CQL-query performance. I imagine the performance hit wouldn't be too large for us since the datasets we (CCP) use are typically not very large (< 1GB) and are not expected to grow very rapidly.

Could we implement a Blaze-PostgreSQL adapter and compare the CQL-query performances on different datasets between two instances of Blaze? One instance would use RocksDB as the Resource Store as a gold standard, and the other would use PostgreSQL as the Resource Store.

@alexanderkiel
Copy link
Member

Hi David,

it is perfectly possible to implement a Postgres resource store adapter. Blaze is designed to support more than one implementation. For example, the distributed storage variant uses Cassandra as resource store.

However I would not use the resource store for queries without the index. Let me explain that. In the resource store, blaze stores versions of resources by their content hash. That means that a resource with an update appears twice in the resource store and there is no way to tell which version is the current one. That information is only available in the index.

For data quality queries, I would suggest to use CQL directly. I have some examples in the CQL Documentation. More in depth data quality queries should be possible with CQL.

Another way, I could imagine is to implement SQL on FHIR in Blaze.

So at the end, I don't see a performance problem using Postgres as resource store, but the way Blaze stores resources in the resource store is not applicable for queries.

@alexanderkiel alexanderkiel added the question Further information is requested label Mar 11, 2024
@alexanderkiel alexanderkiel self-assigned this Mar 11, 2024
@davidmscholz
Copy link
Author

davidmscholz commented Mar 11, 2024

Hi Alex, I didn't know about SQL on FHIR. That sounds definitely useful! However, would that mean that we would need to duplicate the Blaze contents in order to be able to analyze them using standard SQL tools?

If so wouldn't it be possible to "just" store some version-counter or timestamp together with the hash and FHIR resource upon insert when using Postgres as Resource Store?

@alexanderkiel
Copy link
Member

SQL on FHIR should be implemented on top of the Blaze storage architecture, the same as I did with CQL. So no duplicate storage.

Why isn't CQL an option for the data quality queries?

@davidmscholz
Copy link
Author

davidmscholz commented Mar 12, 2024

Very nice.
SQL is just what people are used to and the official CQL docs aren't great to be honest.

@alexanderkiel
Copy link
Member

Yes ok, got the point with SQL. But implementing SQL on FHIR will not go fast, even if you like to do it. So can I help you with solving your data quality queries with CQL?

@davidmscholz
Copy link
Author

Yes, I can imagine that implementing SQL on FHIR will not be accomplished overnight. Thank you for offering to help with the CQL queries. I have built a workaround solution for the time being. That was mainly motivated by other limitations that have nothing to do with Blaze. My main goal is to make the raw data more accessible to our users (bridgehead responsibles) so that they can more easily perform exploratory analyses themselves.

@alexanderkiel
Copy link
Member

I would try to tech people CQL. We already did one such event at the TMF last autumn and we plan a new event for this year. So I would be interested in examples of data quality queries we can use for the lectures. In my opinion CQL has advantages over SQL for FHIR. It's also available in HAPI and Firefly Server. So I would suggest not to ignore it.

@davidmscholz
Copy link
Author

I'd be interested in such a CQL event. Especially if it was possible to participate remotely since I currently don't have much time for travel. I will try to write down the most important questions and examples and send them to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants