t-SNE optimization using scikit-learn-intelex #3061

ashish615 · 2024-05-14T18:25:10Z

This pull request accelerates t-SNE using the scikit-learn-intelex library, resulting in approximately a 10x runtime improvement for the t-SNE implementation in the package for the given example below. The experiment was run on AWS r7i.24xlarge.

import time
import numpy as np

import pandas as pd

import scanpy as sc
from sklearn.cluster import KMeans

import os
import wget

import warnings



warnings.filterwarnings('ignore', 'Expected ')
warnings.simplefilter('ignore')
input_file = "./1M_brain_cells_10X.sparse.h5ad"

if not os.path.exists(input_file):
    print('Downloading import file...')
    wget.download('https://rapids-single-cell-examples.s3.us-east-2.amazonaws.com/1M_brain_cells_10X.sparse.h5ad',input_file)


# marker genes
MITO_GENE_PREFIX = "mt-" # Prefix for mitochondrial genes to regress out
markers = ["Stmn2", "Hes1", "Olig1"] # Marker genes for visualization

# filtering cells
min_genes_per_cell = 200 # Filter out cells with fewer genes than this expressed
max_genes_per_cell = 6000 # Filter out cells with more genes than this expressed

# filtering genes
min_cells_per_gene = 1 # Filter out genes expressed in fewer cells than this
n_top_genes = 4000 # Number of highly variable genes to retain

# PCA
n_components = 50 # Number of principal components to compute

# t-SNE
tsne_n_pcs = 20 # Number of principal components to use for t-SNE

# k-means
k = 35 # Number of clusters for k-means

# Gene ranking

ranking_n_top_genes = 50 # Number of differential genes to compute for each cluster

# Number of parallel jobs
sc._settings.ScanpyConfig.n_jobs = os.cpu_count()

start=time.time()
tr=time.time()
adata = sc.read(input_file)
adata.var_names_make_unique()
adata.shape
print("Total read time : %s" % (time.time()-tr))



tr=time.time()
# To reduce the number of cells:
USE_FIRST_N_CELLS = 1300000
adata = adata[0:USE_FIRST_N_CELLS]
adata.shape

sc.pp.filter_cells(adata, min_genes=min_genes_per_cell)
sc.pp.filter_cells(adata, max_genes=max_genes_per_cell)
sc.pp.filter_genes(adata, min_cells=min_cells_per_gene)
sc.pp.normalize_total(adata, target_sum=1e4)
print("Total filter and normalize time : %s" % (time.time()-tr))


tr=time.time()
sc.pp.log1p(adata)
print("Total log time : %s" % (time.time()-tr))


# Select highly variable genes
sc.pp.highly_variable_genes(adata, n_top_genes=n_top_genes, flavor = "cell_ranger")

# Retain marker gene expression
for marker in markers:
        adata.obs[marker + "_raw"] = adata.X[:, adata.var.index == marker].toarray().ravel()

# Filter matrix to only variable genes
adata = adata[:, adata.var.highly_variable]

ts=time.time()
#Regress out confounding factors (number of counts, mitochondrial gene expression)
mito_genes = adata.var_names.str.startswith(MITO_GENE_PREFIX)
n_counts = np.array(adata.X.sum(axis=1))
adata.obs['percent_mito'] = np.array(np.sum(adata[:, mito_genes].X, axis=1)) / n_counts
adata.obs['n_counts'] = n_counts


sc.pp.regress_out(adata, ['n_counts', 'percent_mito'])
print("Total regress out time : %s" % (time.time()-ts))

#scale

ts=time.time()
sc.pp.scale(adata, max_value=10)
print("Total scale time : %s" % (time.time()-ts))
t0=time.time()

sc.tl.pca(adata, n_comps=n_components)

t0=time.time()
sc._settings.ScanpyConfig.n_jobs = os.cpu_count()
sc.tl.tsne(adata, n_pcs=tsne_n_pcs, use_fast_tsne=True)

'''
from sklearn.manifold import TSNE
from scanpy.tools._utils import _choose_representation
X = _choose_representation(adata, n_pcs=tsne_n_pcs)
X_tsne = TSNE().fit_transform(X.astype(np.float32))
adata.obsm['X_tsne'] = X_tsne
'''
print("Tsne time:", time.time()-t0)

codecov · 2024-05-14T18:42:16Z

Codecov Report

Attention: Patch coverage is 66.66667% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 75.19%. Comparing base (3ba3f46) to head (d8d4763).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3061      +/-   ##
==========================================
- Coverage   75.87%   75.19%   -0.68%     
==========================================
  Files         110      110              
  Lines       12533    12536       +3     
==========================================
- Hits         9509     9427      -82     
- Misses       3024     3109      +85

Files	Coverage Δ
scanpy/tools/_tsne.py	`68.08% <66.66%> (-25.10%)`	⬇️

... and 5 files with indirect coverage changes

sanchit-misra · 2024-05-15T15:42:52Z

For the above code, the time spent in tSNE went down from 2252 secs to 210 secs due to this PR.

flying-sheep · 2024-05-16T10:09:31Z

For the above code, the time spent in tSNE went down from 2252 secs to 210 secs due to this PR.

What’s the comparison to MulticoreTSNE?

Defaults

It would probably make sense to use a flavor: Literal['auto', 'sklearn', 'intelex', 'multicore'] = 'auto' parameter here, where auto would try to import the speedup packages one-by-one and use the preferred one.

use_fast_tsne could be deprecated and made to default to None, with this logic (too bad we can’t use match yet)

if use_fast_tsne is not None:
    warnings.warn("...", FutureWarning)
match (use_fast_tsne, flavor):
    case (None, 'auto'): ...  # try importing 'intelex', fall back to 'sklearn'
    case (None, _): ...  # use specified flavor
    case (True, 'auto'): ...  # use 'multicore'
    case (True, 'sklearn'): ...  # throw error
    case (True, _): ...  # use specified flavor
    case (False, 'auto' | 'sklearn'): ...  # Use 'sklearn'
    case (False, _): ...  # Throw error
    case _: ... raise AssertionError()

In the future, we can change 'auto' to try both intelex and MulticoreTSNE

@ilan-gold what do you think?

ilan-gold · 2024-05-16T10:23:04Z

This seems reasonable to me @flying-sheep . Does using the patched version change results over the unpatched @ashish615 i.e., for a given random seed, unpatched and patched are the same? If the two are the same for a given seed/state, then I think what @flying-sheep is proposing could be done separately (even if we make the dependency optional IMO). However, if the new version does change results, we will need the handling that @flying-sheep describes.

ashish615 · 2024-05-16T14:21:25Z

What’s the comparison to MulticoreTSNE?

This is the compare is while using use_fast_tsne flag which using MulticoreTSNE.

ashish615 · 2024-05-16T14:25:48Z

This seems reasonable to me @flying-sheep . Does using the patched version change results over the unpatched @ashish615 i.e., for a given random seed, unpatched and patched are the same? If the two are the same for a given seed/state, then I think what @flying-sheep is proposing could be done separately (even if we make the dependency optional IMO). However, if the new version does change results, we will need the handling that @flying-sheep describes.

@ilan-gold , I didn't check that. I will check that and let you guys know.

narendrachaudhary51 · 2024-05-17T04:37:57Z

These t-SNE optimizations are mentioned in the following paper. Adding it here for reference.
https://arxiv.org/abs/2212.11506
Accelerating Barnes-Hut t-SNE Algorithm by Efficient Parallelization on Multi-Core CPUs
N Chaudhary, A Pivovar, P Yakovlev, A Gorshkov… - arXiv preprint arXiv:2212.11506, 2022

ilan-gold · 2024-05-17T09:04:48Z

@narendrachaudhary51 You can add the reference here https://github.com/scverse/scanpy/blob/main/docs/references.bib (thank you @flying-sheep for improving this!)

scanpy/preprocessing/__init__.py

ashish615 · 2024-05-21T10:30:15Z

Hi @flying-sheep @ilan-gold ,
Based on our previous discussion, we observed that applying and then removing a patch while fixing the seed causes the t-SNE output to change. In our experiment, we used 1.3 million data points to run t-SNE and compared the results of the patched and unpatched versions by examining the KL Divergence from both runs. The results are summarized in the table below.

In the above code use USE_FIRST_N_CELLS to set number of records and use sc.tl.tsne(adata, n_pcs=tsne_n_pcs, use_fast_tsne=False) to run optimized run with latest commit. You can get KL divergence numbers by logging kl_divergence_

sanchit-misra · 2024-05-21T11:40:12Z

As can be seen with the KL divergence values in the above table, while the output of Intel optimized t-SNE is different, it is equivalent in quality.

ilan-gold · 2024-05-28T08:37:12Z

@sanchit-misra @ashish615 one thing that we thought - why does this need to be in scanpy? Can't users turn this on/off outside scanpy?

ilan-gold · 2024-05-28T08:42:27Z

Notwithstanding the above question, since your method produces different results than the default, we will need to adopt the conventions outlined: #3061 (comment)

This will ensure continued reproducibility with defaults.

ashish615 · 2024-05-29T07:41:01Z

Hi @ilan-gold,

Regarding your thought in this comment, we can enable or disable Intel optimization from outside the code. However, users might not be aware of how to use this feature. Instead, if we add it to scanpy directly, all scanpy users will know the same option available.

If we agree with the option discussed in this comment, I can proceed with updating the t-SNE file.

ashish615 added 3 commits May 6, 2024 10:42

scikit-learn-intelex library added

a308331

adding sklearanex patch for tsne

23d4f0c

Merge branch 'main' into scikit-learn-intelex

de2ec22

ashish615 closed this May 15, 2024

ashish615 reopened this May 15, 2024

ilan-gold reviewed May 17, 2024

View reviewed changes

scanpy/preprocessing/__init__.py Outdated Show resolved Hide resolved

ashish615 added 2 commits May 21, 2024 04:57

Merge branch 'main' into scikit-learn-intelex

d88a9ce

intel optimization applied on only skelearn TSNE

d8d4763

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

t-SNE optimization using scikit-learn-intelex #3061

t-SNE optimization using scikit-learn-intelex #3061

ashish615 commented May 14, 2024 •

edited by flying-sheep

codecov bot commented May 14, 2024 •

edited

sanchit-misra commented May 15, 2024

flying-sheep commented May 16, 2024 •

edited

ilan-gold commented May 16, 2024

ashish615 commented May 16, 2024 •

edited by flying-sheep

ashish615 commented May 16, 2024 •

edited by flying-sheep

narendrachaudhary51 commented May 17, 2024

ilan-gold commented May 17, 2024

ashish615 commented May 21, 2024

sanchit-misra commented May 21, 2024

ilan-gold commented May 28, 2024

ilan-gold commented May 28, 2024

ashish615 commented May 29, 2024

t-SNE optimization using scikit-learn-intelex #3061

Are you sure you want to change the base?

t-SNE optimization using scikit-learn-intelex #3061

Conversation

ashish615 commented May 14, 2024 • edited by flying-sheep

codecov bot commented May 14, 2024 • edited

Codecov Report

sanchit-misra commented May 15, 2024

flying-sheep commented May 16, 2024 • edited

Defaults

ilan-gold commented May 16, 2024

ashish615 commented May 16, 2024 • edited by flying-sheep

ashish615 commented May 16, 2024 • edited by flying-sheep

narendrachaudhary51 commented May 17, 2024

ilan-gold commented May 17, 2024

ashish615 commented May 21, 2024

sanchit-misra commented May 21, 2024

ilan-gold commented May 28, 2024

ilan-gold commented May 28, 2024

ashish615 commented May 29, 2024

ashish615 commented May 14, 2024 •

edited by flying-sheep

codecov bot commented May 14, 2024 •

edited

flying-sheep commented May 16, 2024 •

edited

ashish615 commented May 16, 2024 •

edited by flying-sheep

ashish615 commented May 16, 2024 •

edited by flying-sheep