WsiCaption: Multiple Instance Generation of Pathology Reports for Gigapixel Whole Slide Images [MICCAI2024]

=====

WsiCaption: Multiple Instance Generation of Pathology Reports for Gigapixel Whole Slide Images. [Link]
Pingyi Chen, Honglin Li, Chenglu Zhu, Sunyi Zheng, Lin Yang

Summary:1. We propose a pipeline to curate high-quality WSI-text pairs from TCGA. The dataset TCGA-PathoText contains about ten thousand pairs which will be publicly accessible. It can potentially promote the development of visual-language models in pathology. 2. We design a multiple instance generation framework(MI-Gen). By incorporating the position-aware module, our model is more sensitive to the spatial information in WSIs.

Pre-requisites:

We will share our collected slide-level captions but WSIs still need to be downloaded due to their large resolution.

Downloading TCGA Slides

To download diagnostic WSIs (formatted as .svs files), please refer to the NIH Genomic Data Commons Data Portal. WSIs for each cancer type can be downloaded using the GDC Data Transfer Tool.

Processing Whole Slide Images

To process WSIs, first, the tissue regions in each biopsy slide are segmented using Otsu's Segmentation on a downsampled WSI using OpenSlide. The 256 x 256 patches without spatial overlapping are extracted from the segmented tissue regions at the desired magnification. Consequently, a pretrained truncated ResNet50 is used to encode raw image patches into 1024-dim feature vectors, which we then save as .pt files for each WSI. We achieve the pre-processing of WSIs by using CLAM

TCGA-PathoText: Slide-Text captions

We notice that TCGA includes scanning copies of pathology reports in the format of PDF1. But they are too long with redundant information and present in a complex structure. Therefore, we propose a pipeline to extract and clean pathological texts from TCGA, which can convert complex PDF files to concise WSI-text pairs with the assistance of large language models (LLM). We also use a classifier to remove the pairs with bad quality.

Our dataset can be downloaded online now. The following folder structure is assumed for the TCGA-PathoText:

TCGA-PathoText/
    └──TCGA_BLCA/
        ├── case_1
              ├──annotation ##(slide-level captions we obtained by ocr and GPT)
              ├──case_1.pdf ##(softlink to the corresponding raw TCGA report)
              └── ...
        ├── case_2
        └── ...
    └──TCGA_BRCA/
        ├── case_1
        ├── case_2
        └── ...
    ...

TCGA-Slide-Features/
    └──TCGA_BLCA/
        ├── case_1.pt
        ├── case_2.pt
        └── ...
    └──TCGA_BRCA/
        ├── case_1.pt
        ├── case_2.pt
        └── ...
    ...

TCGA-PathoText contains the captions and TCGA-Slide-Features includes the extracted features of WSIs.

More details about the dataset are shown below. . (a) Histogram of text lengths. It shows that TCGA-PathoText includes longer pathology reports compared to ARCH which only describes small patches. (b) Word cloud showing 100 most frequent tokens.

Running Experiments

Experiments can be run using the following generic command-line:

Training model

python main.py --mode 'Train' --n_gpu <GPUs to be used, e.g '0,1,2,3' for 4 cards training> --image_dir <SLIDE FEATURE PATH> --ann_path <CAPTION PATH> --split_path <PATH to the directory containing the train/val/test splits>

Testing model

python main.py --mode 'Test' --image_dir <SLIDE FEATURE PATH> --ann_path <CAPTION PATH> --split_path <PATH to the directory containing the train/val/test splits> --checkpoint_dir <PATH TO CKPT>

Basic Environment

Linux (Tested on Ubuntu 18.04)
NVIDIA GPU (Tested on Nvidia GeForce A100) with CUDA 12.0
Python (3.8)

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
baselines		baselines
models		models
modules		modules
ocr		ocr
pics		pics
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WsiCaption: Multiple Instance Generation of Pathology Reports for Gigapixel Whole Slide Images [MICCAI2024]

Pre-requisites:

Downloading TCGA Slides

Processing Whole Slide Images

TCGA-PathoText: Slide-Text captions

Running Experiments

Training model

Testing model

Basic Environment

About

Releases

Packages

Languages

License

cpystan/Wsi-Caption

Folders and files

Latest commit

History

Repository files navigation

WsiCaption: Multiple Instance Generation of Pathology Reports for Gigapixel Whole Slide Images [MICCAI2024]

Pre-requisites:

Downloading TCGA Slides

Processing Whole Slide Images

TCGA-PathoText: Slide-Text captions

Running Experiments

Training model

Testing model

Basic Environment

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages