Implementation of DICOMs3Client based on DICOMwebProtocol #57

hackermd · 2021-11-16T20:24:25Z

@pieper @chafey As discussed in #56, it would be great to implement a DICOMs3Client based on DICOMwebProtocol.

In general, I think we should consider the following aspects for the design of an S3 client:

Index of metadata for efficient query: DICOM attributes needed for response payloads of DICOMweb Search Transaction (Study Resource, Series Resource, and Instance Resource)
Offset to pixel data and individual frames for efficient frame retrieval: for natively encoded Pixel Data, the client can just compute the offset of individual frames, but for an encapsulated Pixel Data element, the client would require the Basic Offset Table or Extended Offset Table
Read and write: DICOMweb Store Transaction

(1) I've been thinking that we could put the metadata into the S3 Object Metadata. The PUT request header is limited to 8 KB in size, but that should be sufficient for the metadata. This could also include the offset to the Pixel Data, Float Pixel Data, or Double Float Pixel Data element (see 2).

(2) If the Pixel Data element contains a Basic Offset Table item, then the client could just read its value and cache the offsets locally. Otherwise the client would have to determine the offset of frame items by reading the header of each frame item. That could result in a lot of HTTP calls, so ideally we would make sure that the Basic Offset Table gets included into image data sets.

(3) To support writes (including parallel writes), we shouldn't store any aggregate study- or series-level information such as Number of Study Related Series in the object metadata, but dynamically compute it client-side.

What do you think?

The text was updated successfully, but these errors were encountered:

hackermd · 2021-11-16T20:28:31Z

cc @cgorman @CPBridge

chafey · 2021-11-16T20:48:15Z

It isn't clear to me what you are trying to do here - can you ELI5? An architecture diagram would help (sorry I am not familiar with this project at all)

chafey · 2021-11-16T20:50:41Z

As @pieper mentioned, I am working on a DICOMweb based archive format as well as a serverless DICOMweb implementation. Here are some links:
https://github.com/chafey/dicomweb-archive-direct
https://github.com/chafey/dicomp10-to-dicomweb-js
https://github.com/chafey/dicomweb-dump

ntenenz · 2021-11-16T22:08:24Z

@chafey, (AFAIK) this would be a python client for fetching data from a serverless DICOMweb implementation (rather than specifying paths directly).

chafey · 2021-11-16T22:19:27Z

@ntenenz i am still confused - this is a client library which speaks dicomweb - shouldn't it work as is with any dicomweb implementation regardless of its architecture? I feel like I am missing something here

ntenenz · 2021-11-16T22:26:38Z

I don't have enough context on the proposal to answer your question intelligently (does it match the existing DICOMweb spec and simply leverage a different form of implementation via static S3 blobs?). @pieper referenced it here: #56 (comment), so perhaps he's best positioned to discuss what he's thinking?

chafey · 2021-11-16T22:39:07Z

The concept is still coming together, but the general idea is to leverage object storage as much as possible for scalability. Storing pregenerated DICOMweb responses in object storage enables many access use cases and layering on serverless compute (lambda, dynamodb, object lambda) to implement the other DICOMweb functionality. This has been a regular topic of the monday morning meetings: https://github.com/chafey/medical-imaging-community

ntenenz · 2021-11-16T22:47:05Z

I've only been able to attend a couple meetings thus far due to a standing conflict. That being said, if the proposal ends up being API-compatible with the existing DICOMweb standard (i.e., a new method of implementation), then the existing DICOMwebClient would be sufficient. If not however, a new client would be required.

chafey · 2021-11-16T22:47:45Z

In terms of this specific issue/request, I suppose another approach could be to implement the existing API, but have it run directly use AWS (S3, maybe other services). I am not sure if this is really worth it though as we should be able to get most if not all of DICOMweb implemented using servless as described above..

pieper · 2021-11-16T22:49:09Z

I think the pre-computed metadata works for archives (rarely changing data, like IDC where we plan monthly versions).

@hackermd mentioned he wanted to use this client in a dynamic environment and was concerned about keeping static content in sync.

chafey · 2021-11-16T22:53:21Z

We do want to trigger rebuilding of the metadata any time the source data changes so that shouldn't be a problem. Another option is to store DICOM P10 in S3 with wado-uri paths and then implement DICOMweb using serverless on top of it. It wouldn't be as fast, but it should be just as scalable.

hackermd · 2021-11-16T22:53:28Z

@chafey the DICOMwebClient class is a Python client for interfacing with a DICOMweb server over HTTP. The new DICOMfileClient allows users to access DICOM data sets that are stored locally as DICOM Part10 files via the same Python API (DICOMwebProtocol). As a next step, I would like to extend this concept to DICOM data sets stored in an S2 bucket.

chafey · 2021-11-16T22:57:28Z

Maybe you could start by taking a similar approach, just use S3 ListObjects API and parse each one to build the master list. After that is done, you can figure out how to optimize it (e.g. cache offsets/tags in the metadata as you suggested using an object lambda?)

hackermd · 2021-11-16T23:00:57Z

Maybe you could start by taking a similar approach, just use S3 ListObjects API and parse each one to build the master list. After that is done, you can figure out how to optimize it (e.g. cache offsets/tags in the metadata as you suggested using an object lambda?)

The main question to me is to which extent the data should be pre-indexed versus indexed dynamically by the client.

The DICOMfileClient internally uses a SQLite database to index the DICOM metadata and caches the index on disk along with the files. This index operation is potentially expense (and is skipped by default), but the advantage is that the DICOMfileClient does not depend on a particular directory structure or file naming convention. In that sense it is truly serverless.

hackermd · 2021-11-16T23:06:56Z

I was hoping that we could provide the same functionality for S3 buckets via the DICOMs3Client, i.e., allow a Python program to point it to any bucket that contains DICOM data sets and then query, retrieve, and store data via the Python DICOMwebProtocol interface. The DICOMs3Client would use the S3 API under the covers, but expose the data using DICOMweb semantics.

chafey · 2021-11-16T23:15:28Z

Better to pre-index it since S3 is a shared resource and building the index is expensive. Ideally this architecture would be portable to other languages so an index that is universally understandable would be good. Alternatively, you could package sqllite in a WASM module and create bindings for each language. Just thinking out loud here...

chafey · 2021-11-16T23:16:35Z

Also - if you have < 100,000 studies you could simply store the contents of the index in a JSON file and build the indexes on the fly in memory

hackermd · 2021-11-17T14:50:48Z

Better to pre-index it since S3 is a shared resource and building the index is expensive. Ideally this architecture would be portable to other languages so an index that is universally understandable would be good. Alternatively, you could package sqllite in a WASM module and create bindings for each language. Just thinking out loud here...

In principle, the DICOM standard already provides such indices via DICOMDIR files. They are not pleasant to work with, but are very compact and already supported by most DICOM libraries.

The Study Resource, Series Resource, and Instance Resource are (more or less) a JSON representation of DICOMDIR files.

Either way, I agree that it would be highly desirable to be able to share the indices between applications. We could just add methods to DICOMfileClient and DICOMs3Client to marshal and unmarshal the index to/from JSON (or binary DICOMDIR).

That would even be very useful for DICOMweb origin servers, because they could then also just be pointed to an S3 bucket and could import the index into their database without having to re-import all the data!

ntenenz · 2021-11-17T15:16:47Z

@chafey, could you elaborate on the motivation (either here on in Monday's meeting) for not leveraging a database for metadata? I completely understand how serving of pixel data could fully utilize a PACS resources. Is the assumption that QIDO requests (could) do the same?

chafey · 2021-11-17T15:32:08Z

Databases are typically the performance and scalability choke point for medical image archives. You want to avoid them when possible and minimize them when you can't avoid them completely. For smallish archives (< 100,000 studies), they are not needed at all as the entire data set can easily fit in memory and you can implement the queries using simple scans (e.g. iterate over the series in a study to implement QIDO-RS StudySeries queries) or an in memory index (e.g. a map of patientIds to implement QIDO-RS Search Studies queries). Any queries that are constrained to a single study can be done entirely in memory without a database. The only time you need an index is when doing Study level queries for large archives and something like redis or no-sql are far better options than a RDBMS

hackermd · 2021-11-17T15:57:15Z

For smallish archives (< 100,000 studies), they are not needed at all as the entire data set can easily fit in memory

That is a big assumption. In pathology, not even a single study may fit in memory!

Or do you just refer to the indexed metadata?

chafey · 2021-11-17T15:58:29Z

That is a big assumption. In pathology, not even a single study may fit in memory!

Or do you just refer to the indexed metadata?

I am just referring to metadata

ntenenz · 2021-11-17T16:39:26Z

Databases are typically the performance and scalability choke point for medical image archives. You want to avoid them when possible and minimize them when you can't avoid them completely. For smallish archives (< 100,000 studies), they are not needed at all as the entire data set can easily fit in memory and you can implement the queries using simple scans (e.g. iterate over the series in a study to implement QIDO-RS StudySeries queries) or an in memory index (e.g. a map of patientIds to implement QIDO-RS Search Studies queries). Any queries that are constrained to a single study can be done entirely in memory without a database. The only time you need an index is when doing Study level queries for large archives and something like redis or no-sql are far better options than a RDBMS

A few questions:

Is the DB bottleneck from QIDO-RS queries or metadata needed for series/study retrieval? If it's the latter, would this bottleneck disappear if the data were pre-computed?
Does this bottleneck remain in a scalable, sharded DB or is this something that is limited to a more vanilla clinical deployment?
If the metadata were held in memory client side, how would the client become aware of the latest studies? Synchronization seems like it could be tough.

I'm wondering if an easier solution would be to lean on a vanilla datastore, either database or redis, for DICOM metadata (e.g., QIDO-RS queries) and pre-compute the instances/series/studies (or a subset with lambdas for aggregation/de-aggregation). I'd be curious to see whether any clinical or research workflow could strain such a setup.

chafey · 2021-11-17T18:45:18Z

Many PACS systems store everything they need to respond to QIDO-RS/CFIND in an RDBMS. This is problematic because the DB becomes very large and also a chokepoint for data ingestion (each SOP Instance is an insert possibly causing rebalancing of indexes). The strategy I have used to reach incredibly high scale is to store the metadata on disk and read from it to handle queries below the study level - this can be done via pre-generation of metadata as we are discussing with the dicomweb-static project. When you take that approach, you just need to provide indexes for study level attributes (patient name, patient id, study date, etc) and can use any technology you want (RDBMS, NoSQL, Reddis, etc) as the problem is much simpler.

The strategy is independent of DB technology - object storage is always cheaper than DB and performs well enough for the target use cases. The other advantage is you can do all sorts of analytics on the storage tier with map reduce without having to ETL/Sync it to another reporting db.

Synchronization from server to client is a solved problem - there are many solutions for this (basically use pub/sub to distribute data store changes to clients over WebSockets and such)

hackermd · 2021-11-17T22:00:14Z

Many PACS systems store everything they need to respond to QIDO-RS/CFIND in an RDBMS. This is problematic because the DB becomes very large and also a chokepoint for data ingestion (each SOP Instance is an insert possibly causing rebalancing of indexes). The strategy I have used to reach incredibly high scale is to store the metadata on disk and read from it to handle queries below the study level - this can be done via pre-generation of metadata as we are discussing with the dicomweb-static project.

I am fully on board with storing instance metadata as documents and keeping those in memory for fast queries. We've run into several issues with RDBMS-based implementations when storing many instances via the DICOMweb Store Transaction.

The main question to me is whether those documents should be cached (relative to the full instances) and whether other applications (in particular serverless clients such as the DICOMs3Client) should be able to leverage them, too.
That's basically what DICOMDIR files of the DICOM File Service are intended for. Unfortunately, they are a pain to work with and difficult to update (binary offsets) and I am wondering whether we want the DICOMweb response JSON documents to serve as a DICOMDIR alternative.

ntenenz · 2021-11-17T22:08:55Z

Many PACS systems store everything they need to respond to QIDO-RS/CFIND in an RDBMS. This is problematic because the DB becomes very large and also a chokepoint for data ingestion (each SOP Instance is an insert possibly causing rebalancing of indexes). The strategy I have used to reach incredibly high scale is to store the metadata on disk and read from it to handle queries below the study level - this can be done via pre-generation of metadata as we are discussing with the dicomweb-static project. When you take that approach, you just need to provide indexes for study level attributes (patient name, patient id, study date, etc) and can use any technology you want (RDBMS, NoSQL, Reddis, etc) as the problem is much simpler.

The strategy is independent of DB technology - object storage is always cheaper than DB and performs well enough for the target use cases. The other advantage is you can do all sorts of analytics on the storage tier with map reduce without having to ETL/Sync it to another reporting db.

Synchronization from server to client is a solved problem - there are many solutions for this (basically use pub/sub to distribute data store changes to clients over WebSockets and such)

Let's try to find some time to discuss on Monday if the agenda isn't already spoken for.

hackermd · 2021-11-24T12:56:49Z

@chafey we may want to consider Supplement 223: Inventory IOD and Related Services. The proposed Inventory IOD could be used to represent and share a database index.

hackermd added the enhancement New feature or request label Nov 16, 2021

hackermd self-assigned this Nov 16, 2021

hackermd mentioned this issue Nov 16, 2021

DICOMfileClient for working with locally stored DICOM Part10 files #56

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of DICOMs3Client based on DICOMwebProtocol #57

Implementation of DICOMs3Client based on DICOMwebProtocol #57

hackermd commented Nov 16, 2021

hackermd commented Nov 16, 2021

chafey commented Nov 16, 2021

chafey commented Nov 16, 2021

ntenenz commented Nov 16, 2021 •

edited

chafey commented Nov 16, 2021

ntenenz commented Nov 16, 2021

chafey commented Nov 16, 2021

ntenenz commented Nov 16, 2021

chafey commented Nov 16, 2021

pieper commented Nov 16, 2021

chafey commented Nov 16, 2021

hackermd commented Nov 16, 2021

chafey commented Nov 16, 2021

hackermd commented Nov 16, 2021 •

edited

hackermd commented Nov 16, 2021

chafey commented Nov 16, 2021

chafey commented Nov 16, 2021

hackermd commented Nov 17, 2021

ntenenz commented Nov 17, 2021

chafey commented Nov 17, 2021

hackermd commented Nov 17, 2021 •

edited

chafey commented Nov 17, 2021 •

edited

ntenenz commented Nov 17, 2021

chafey commented Nov 17, 2021

hackermd commented Nov 17, 2021

ntenenz commented Nov 17, 2021

hackermd commented Nov 24, 2021

Implementation of DICOMs3Client based on DICOMwebProtocol #57

Implementation of DICOMs3Client based on DICOMwebProtocol #57

Comments

hackermd commented Nov 16, 2021

hackermd commented Nov 16, 2021

chafey commented Nov 16, 2021

chafey commented Nov 16, 2021

ntenenz commented Nov 16, 2021 • edited

chafey commented Nov 16, 2021

ntenenz commented Nov 16, 2021

chafey commented Nov 16, 2021

ntenenz commented Nov 16, 2021

chafey commented Nov 16, 2021

pieper commented Nov 16, 2021

chafey commented Nov 16, 2021

hackermd commented Nov 16, 2021

chafey commented Nov 16, 2021

hackermd commented Nov 16, 2021 • edited

hackermd commented Nov 16, 2021

chafey commented Nov 16, 2021

chafey commented Nov 16, 2021

hackermd commented Nov 17, 2021

ntenenz commented Nov 17, 2021

chafey commented Nov 17, 2021

hackermd commented Nov 17, 2021 • edited

chafey commented Nov 17, 2021 • edited

ntenenz commented Nov 17, 2021

chafey commented Nov 17, 2021

hackermd commented Nov 17, 2021

ntenenz commented Nov 17, 2021

hackermd commented Nov 24, 2021

ntenenz commented Nov 16, 2021 •

edited

hackermd commented Nov 16, 2021 •

edited

hackermd commented Nov 17, 2021 •

edited

chafey commented Nov 17, 2021 •

edited