Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of DICOMs3Client based on DICOMwebProtocol #57

Open
hackermd opened this issue Nov 16, 2021 · 27 comments
Open

Implementation of DICOMs3Client based on DICOMwebProtocol #57

hackermd opened this issue Nov 16, 2021 · 27 comments
Assignees
Labels
enhancement New feature or request

Comments

@hackermd
Copy link
Collaborator

@pieper @chafey As discussed in #56, it would be great to implement a DICOMs3Client based on DICOMwebProtocol.

In general, I think we should consider the following aspects for the design of an S3 client:

  1. Index of metadata for efficient query: DICOM attributes needed for response payloads of DICOMweb Search Transaction (Study Resource, Series Resource, and Instance Resource)
  2. Offset to pixel data and individual frames for efficient frame retrieval: for natively encoded Pixel Data, the client can just compute the offset of individual frames, but for an encapsulated Pixel Data element, the client would require the Basic Offset Table or Extended Offset Table
  3. Read and write: DICOMweb Store Transaction

(1) I've been thinking that we could put the metadata into the S3 Object Metadata. The PUT request header is limited to 8 KB in size, but that should be sufficient for the metadata. This could also include the offset to the Pixel Data, Float Pixel Data, or Double Float Pixel Data element (see 2).

(2) If the Pixel Data element contains a Basic Offset Table item, then the client could just read its value and cache the offsets locally. Otherwise the client would have to determine the offset of frame items by reading the header of each frame item. That could result in a lot of HTTP calls, so ideally we would make sure that the Basic Offset Table gets included into image data sets.

(3) To support writes (including parallel writes), we shouldn't store any aggregate study- or series-level information such as Number of Study Related Series in the object metadata, but dynamically compute it client-side.

What do you think?

@hackermd
Copy link
Collaborator Author

cc @cgorman @CPBridge

@chafey
Copy link

chafey commented Nov 16, 2021

It isn't clear to me what you are trying to do here - can you ELI5? An architecture diagram would help (sorry I am not familiar with this project at all)

@chafey
Copy link

chafey commented Nov 16, 2021

As @pieper mentioned, I am working on a DICOMweb based archive format as well as a serverless DICOMweb implementation. Here are some links:
https://github.com/chafey/dicomweb-archive-direct
https://github.com/chafey/dicomp10-to-dicomweb-js
https://github.com/chafey/dicomweb-dump

@ntenenz
Copy link
Contributor

ntenenz commented Nov 16, 2021

@chafey, (AFAIK) this would be a python client for fetching data from a serverless DICOMweb implementation (rather than specifying paths directly).

@chafey
Copy link

chafey commented Nov 16, 2021

@ntenenz i am still confused - this is a client library which speaks dicomweb - shouldn't it work as is with any dicomweb implementation regardless of its architecture? I feel like I am missing something here

@ntenenz
Copy link
Contributor

ntenenz commented Nov 16, 2021

I don't have enough context on the proposal to answer your question intelligently (does it match the existing DICOMweb spec and simply leverage a different form of implementation via static S3 blobs?). @pieper referenced it here: #56 (comment), so perhaps he's best positioned to discuss what he's thinking?

@chafey
Copy link

chafey commented Nov 16, 2021

The concept is still coming together, but the general idea is to leverage object storage as much as possible for scalability. Storing pregenerated DICOMweb responses in object storage enables many access use cases and layering on serverless compute (lambda, dynamodb, object lambda) to implement the other DICOMweb functionality. This has been a regular topic of the monday morning meetings: https://github.com/chafey/medical-imaging-community

@ntenenz
Copy link
Contributor

ntenenz commented Nov 16, 2021

I've only been able to attend a couple meetings thus far due to a standing conflict. That being said, if the proposal ends up being API-compatible with the existing DICOMweb standard (i.e., a new method of implementation), then the existing DICOMwebClient would be sufficient. If not however, a new client would be required.

@chafey
Copy link

chafey commented Nov 16, 2021

In terms of this specific issue/request, I suppose another approach could be to implement the existing API, but have it run directly use AWS (S3, maybe other services). I am not sure if this is really worth it though as we should be able to get most if not all of DICOMweb implemented using servless as described above..

@pieper
Copy link
Member

pieper commented Nov 16, 2021

I think the pre-computed metadata works for archives (rarely changing data, like IDC where we plan monthly versions).

@hackermd mentioned he wanted to use this client in a dynamic environment and was concerned about keeping static content in sync.

@chafey
Copy link

chafey commented Nov 16, 2021

We do want to trigger rebuilding of the metadata any time the source data changes so that shouldn't be a problem. Another option is to store DICOM P10 in S3 with wado-uri paths and then implement DICOMweb using serverless on top of it. It wouldn't be as fast, but it should be just as scalable.

@hackermd
Copy link
Collaborator Author

@chafey the DICOMwebClient class is a Python client for interfacing with a DICOMweb server over HTTP. The new DICOMfileClient allows users to access DICOM data sets that are stored locally as DICOM Part10 files via the same Python API (DICOMwebProtocol). As a next step, I would like to extend this concept to DICOM data sets stored in an S2 bucket.

@chafey
Copy link

chafey commented Nov 16, 2021

Maybe you could start by taking a similar approach, just use S3 ListObjects API and parse each one to build the master list. After that is done, you can figure out how to optimize it (e.g. cache offsets/tags in the metadata as you suggested using an object lambda?)

@hackermd
Copy link
Collaborator Author

hackermd commented Nov 16, 2021

Maybe you could start by taking a similar approach, just use S3 ListObjects API and parse each one to build the master list. After that is done, you can figure out how to optimize it (e.g. cache offsets/tags in the metadata as you suggested using an object lambda?)

The main question to me is to which extent the data should be pre-indexed versus indexed dynamically by the client.

The DICOMfileClient internally uses a SQLite database to index the DICOM metadata and caches the index on disk along with the files. This index operation is potentially expense (and is skipped by default), but the advantage is that the DICOMfileClient does not depend on a particular directory structure or file naming convention. In that sense it is truly serverless.

@hackermd
Copy link
Collaborator Author

I was hoping that we could provide the same functionality for S3 buckets via the DICOMs3Client, i.e., allow a Python program to point it to any bucket that contains DICOM data sets and then query, retrieve, and store data via the Python DICOMwebProtocol interface. The DICOMs3Client would use the S3 API under the covers, but expose the data using DICOMweb semantics.

@chafey
Copy link

chafey commented Nov 16, 2021

Better to pre-index it since S3 is a shared resource and building the index is expensive. Ideally this architecture would be portable to other languages so an index that is universally understandable would be good. Alternatively, you could package sqllite in a WASM module and create bindings for each language. Just thinking out loud here...

@chafey
Copy link

chafey commented Nov 16, 2021

Also - if you have < 100,000 studies you could simply store the contents of the index in a JSON file and build the indexes on the fly in memory

@hackermd
Copy link
Collaborator Author

Better to pre-index it since S3 is a shared resource and building the index is expensive. Ideally this architecture would be portable to other languages so an index that is universally understandable would be good. Alternatively, you could package sqllite in a WASM module and create bindings for each language. Just thinking out loud here...

In principle, the DICOM standard already provides such indices via DICOMDIR files. They are not pleasant to work with, but are very compact and already supported by most DICOM libraries.

The Study Resource, Series Resource, and Instance Resource are (more or less) a JSON representation of DICOMDIR files.

Either way, I agree that it would be highly desirable to be able to share the indices between applications. We could just add methods to DICOMfileClient and DICOMs3Client to marshal and unmarshal the index to/from JSON (or binary DICOMDIR).

That would even be very useful for DICOMweb origin servers, because they could then also just be pointed to an S3 bucket and could import the index into their database without having to re-import all the data!

@ntenenz
Copy link
Contributor

ntenenz commented Nov 17, 2021

@chafey, could you elaborate on the motivation (either here on in Monday's meeting) for not leveraging a database for metadata? I completely understand how serving of pixel data could fully utilize a PACS resources. Is the assumption that QIDO requests (could) do the same?

@chafey
Copy link

chafey commented Nov 17, 2021

Databases are typically the performance and scalability choke point for medical image archives. You want to avoid them when possible and minimize them when you can't avoid them completely. For smallish archives (< 100,000 studies), they are not needed at all as the entire data set can easily fit in memory and you can implement the queries using simple scans (e.g. iterate over the series in a study to implement QIDO-RS StudySeries queries) or an in memory index (e.g. a map of patientIds to implement QIDO-RS Search Studies queries). Any queries that are constrained to a single study can be done entirely in memory without a database. The only time you need an index is when doing Study level queries for large archives and something like redis or no-sql are far better options than a RDBMS

@hackermd
Copy link
Collaborator Author

hackermd commented Nov 17, 2021

For smallish archives (< 100,000 studies), they are not needed at all as the entire data set can easily fit in memory

That is a big assumption. In pathology, not even a single study may fit in memory!

Or do you just refer to the indexed metadata?

@chafey
Copy link

chafey commented Nov 17, 2021

That is a big assumption. In pathology, not even a single study may fit in memory!

Or do you just refer to the indexed metadata?

I am just referring to metadata

@ntenenz
Copy link
Contributor

ntenenz commented Nov 17, 2021

Databases are typically the performance and scalability choke point for medical image archives. You want to avoid them when possible and minimize them when you can't avoid them completely. For smallish archives (< 100,000 studies), they are not needed at all as the entire data set can easily fit in memory and you can implement the queries using simple scans (e.g. iterate over the series in a study to implement QIDO-RS StudySeries queries) or an in memory index (e.g. a map of patientIds to implement QIDO-RS Search Studies queries). Any queries that are constrained to a single study can be done entirely in memory without a database. The only time you need an index is when doing Study level queries for large archives and something like redis or no-sql are far better options than a RDBMS

A few questions:

  • Is the DB bottleneck from QIDO-RS queries or metadata needed for series/study retrieval? If it's the latter, would this bottleneck disappear if the data were pre-computed?
  • Does this bottleneck remain in a scalable, sharded DB or is this something that is limited to a more vanilla clinical deployment?
  • If the metadata were held in memory client side, how would the client become aware of the latest studies? Synchronization seems like it could be tough.

I'm wondering if an easier solution would be to lean on a vanilla datastore, either database or redis, for DICOM metadata (e.g., QIDO-RS queries) and pre-compute the instances/series/studies (or a subset with lambdas for aggregation/de-aggregation). I'd be curious to see whether any clinical or research workflow could strain such a setup.

@chafey
Copy link

chafey commented Nov 17, 2021

Many PACS systems store everything they need to respond to QIDO-RS/CFIND in an RDBMS. This is problematic because the DB becomes very large and also a chokepoint for data ingestion (each SOP Instance is an insert possibly causing rebalancing of indexes). The strategy I have used to reach incredibly high scale is to store the metadata on disk and read from it to handle queries below the study level - this can be done via pre-generation of metadata as we are discussing with the dicomweb-static project. When you take that approach, you just need to provide indexes for study level attributes (patient name, patient id, study date, etc) and can use any technology you want (RDBMS, NoSQL, Reddis, etc) as the problem is much simpler.

The strategy is independent of DB technology - object storage is always cheaper than DB and performs well enough for the target use cases. The other advantage is you can do all sorts of analytics on the storage tier with map reduce without having to ETL/Sync it to another reporting db.

Synchronization from server to client is a solved problem - there are many solutions for this (basically use pub/sub to distribute data store changes to clients over WebSockets and such)

@hackermd
Copy link
Collaborator Author

Many PACS systems store everything they need to respond to QIDO-RS/CFIND in an RDBMS. This is problematic because the DB becomes very large and also a chokepoint for data ingestion (each SOP Instance is an insert possibly causing rebalancing of indexes). The strategy I have used to reach incredibly high scale is to store the metadata on disk and read from it to handle queries below the study level - this can be done via pre-generation of metadata as we are discussing with the dicomweb-static project.

I am fully on board with storing instance metadata as documents and keeping those in memory for fast queries. We've run into several issues with RDBMS-based implementations when storing many instances via the DICOMweb Store Transaction.

The main question to me is whether those documents should be cached (relative to the full instances) and whether other applications (in particular serverless clients such as the DICOMs3Client) should be able to leverage them, too.
That's basically what DICOMDIR files of the DICOM File Service are intended for. Unfortunately, they are a pain to work with and difficult to update (binary offsets) and I am wondering whether we want the DICOMweb response JSON documents to serve as a DICOMDIR alternative.

@ntenenz
Copy link
Contributor

ntenenz commented Nov 17, 2021

Many PACS systems store everything they need to respond to QIDO-RS/CFIND in an RDBMS. This is problematic because the DB becomes very large and also a chokepoint for data ingestion (each SOP Instance is an insert possibly causing rebalancing of indexes). The strategy I have used to reach incredibly high scale is to store the metadata on disk and read from it to handle queries below the study level - this can be done via pre-generation of metadata as we are discussing with the dicomweb-static project. When you take that approach, you just need to provide indexes for study level attributes (patient name, patient id, study date, etc) and can use any technology you want (RDBMS, NoSQL, Reddis, etc) as the problem is much simpler.

The strategy is independent of DB technology - object storage is always cheaper than DB and performs well enough for the target use cases. The other advantage is you can do all sorts of analytics on the storage tier with map reduce without having to ETL/Sync it to another reporting db.

Synchronization from server to client is a solved problem - there are many solutions for this (basically use pub/sub to distribute data store changes to clients over WebSockets and such)

Let's try to find some time to discuss on Monday if the agenda isn't already spoken for.

@hackermd
Copy link
Collaborator Author

@chafey we may want to consider Supplement 223: Inventory IOD and Related Services. The proposed Inventory IOD could be used to represent and share a database index.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants