Skip to content

Official Inplementation of 《WsiCaption: Multiple Instance Generation of Pathology Reports for Gigapixel Whole Slide Images》(MICCAI 2024)

License

Notifications You must be signed in to change notification settings

cpystan/Wsi-Caption

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WsiCaption: Multiple Instance Generation of Pathology Reports for Gigapixel Whole Slide Images [MICCAI2024]

=====

WsiCaption: Multiple Instance Generation of Pathology Reports for Gigapixel Whole Slide Images. [Link]
Pingyi Chen, Honglin Li, Chenglu Zhu, Sunyi Zheng, Lin Yang
Summary:1. We propose a pipeline to curate high-quality WSI-text pairs from TCGA. The dataset TCGA-PathoText contains about ten thousand pairs which will be publicly accessible. It can potentially promote the development of visual-language models in pathology. 2. We design a multiple instance generation framework(MI-Gen). By incorporating the position-aware module, our model is more sensitive to the spatial information in WSIs.

Pre-requisites:

We will share our collected slide-level captions but WSIs still need to be downloaded due to their large resolution.

Downloading TCGA Slides

To download diagnostic WSIs (formatted as .svs files), please refer to the NIH Genomic Data Commons Data Portal. WSIs for each cancer type can be downloaded using the GDC Data Transfer Tool.

Processing Whole Slide Images

To process WSIs, first, the tissue regions in each biopsy slide are segmented using Otsu's Segmentation on a downsampled WSI using OpenSlide. The 256 x 256 patches without spatial overlapping are extracted from the segmented tissue regions at the desired magnification. Consequently, a pretrained truncated ResNet50 is used to encode raw image patches into 1024-dim feature vectors, which we then save as .pt files for each WSI. We achieve the pre-processing of WSIs by using CLAM

TCGA-PathoText: Slide-Text captions

We notice that TCGA includes scanning copies of pathology reports in the format of PDF1. But they are too long with redundant information and present in a complex structure. Therefore, we propose a pipeline to extract and clean pathological texts from TCGA, which can convert complex PDF files to concise WSI-text pairs with the assistance of large language models (LLM). We also use a classifier to remove the pairs with bad quality.

dataset construction

Our dataset can be downloaded online now. The following folder structure is assumed for the TCGA-PathoText:

TCGA-PathoText/
    └──TCGA_BLCA/
        ├── case_1
              ├──annotation ##(slide-level captions we obtained by ocr and GPT)
              ├──case_1.pdf ##(softlink to the corresponding raw TCGA report)
              └── ...
        ├── case_2
        └── ...
    └──TCGA_BRCA/
        ├── case_1
        ├── case_2
        └── ...
    ...

TCGA-Slide-Features/
    └──TCGA_BLCA/
        ├── case_1.pt
        ├── case_2.pt
        └── ...
    └──TCGA_BRCA/
        ├── case_1.pt
        ├── case_2.pt
        └── ...
    ...

TCGA-PathoText contains the captions and TCGA-Slide-Features includes the extracted features of WSIs.

More details about the dataset are shown below. . (a) Histogram of text lengths. It shows that TCGA-PathoText includes longer pathology reports compared to ARCH which only describes small patches. (b) Word cloud showing 100 most frequent tokens.

Running Experiments

Experiments can be run using the following generic command-line:

Training model

python main.py --mode 'Train' --n_gpu <GPUs to be used, e.g '0,1,2,3' for 4 cards training> --image_dir <SLIDE FEATURE PATH> --ann_path <CAPTION PATH> --split_path <PATH to the directory containing the train/val/test splits> 

Testing model

python main.py --mode 'Test' --image_dir <SLIDE FEATURE PATH> --ann_path <CAPTION PATH> --split_path <PATH to the directory containing the train/val/test splits> --checkpoint_dir <PATH TO CKPT>

Basic Environment

  • Linux (Tested on Ubuntu 18.04)
  • NVIDIA GPU (Tested on Nvidia GeForce A100) with CUDA 12.0
  • Python (3.8)

About

Official Inplementation of 《WsiCaption: Multiple Instance Generation of Pathology Reports for Gigapixel Whole Slide Images》(MICCAI 2024)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages