Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

t-SNE optimization using scikit-learn-intelex #3061

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

ashish615
Copy link

@ashish615 ashish615 commented May 14, 2024

This pull request accelerates t-SNE using the scikit-learn-intelex library, resulting in approximately a 10x runtime improvement for the t-SNE implementation in the package for the given example below. The experiment was run on AWS r7i.24xlarge.

import time
import numpy as np

import pandas as pd

import scanpy as sc
from sklearn.cluster import KMeans

import os
import wget

import warnings



warnings.filterwarnings('ignore', 'Expected ')
warnings.simplefilter('ignore')
input_file = "./1M_brain_cells_10X.sparse.h5ad"

if not os.path.exists(input_file):
    print('Downloading import file...')
    wget.download('https://rapids-single-cell-examples.s3.us-east-2.amazonaws.com/1M_brain_cells_10X.sparse.h5ad',input_file)


# marker genes
MITO_GENE_PREFIX = "mt-" # Prefix for mitochondrial genes to regress out
markers = ["Stmn2", "Hes1", "Olig1"] # Marker genes for visualization

# filtering cells
min_genes_per_cell = 200 # Filter out cells with fewer genes than this expressed
max_genes_per_cell = 6000 # Filter out cells with more genes than this expressed

# filtering genes
min_cells_per_gene = 1 # Filter out genes expressed in fewer cells than this
n_top_genes = 4000 # Number of highly variable genes to retain

# PCA
n_components = 50 # Number of principal components to compute

# t-SNE
tsne_n_pcs = 20 # Number of principal components to use for t-SNE

# k-means
k = 35 # Number of clusters for k-means

# Gene ranking

ranking_n_top_genes = 50 # Number of differential genes to compute for each cluster

# Number of parallel jobs
sc._settings.ScanpyConfig.n_jobs = os.cpu_count()

start=time.time()
tr=time.time()
adata = sc.read(input_file)
adata.var_names_make_unique()
adata.shape
print("Total read time : %s" % (time.time()-tr))



tr=time.time()
# To reduce the number of cells:
USE_FIRST_N_CELLS = 1300000
adata = adata[0:USE_FIRST_N_CELLS]
adata.shape

sc.pp.filter_cells(adata, min_genes=min_genes_per_cell)
sc.pp.filter_cells(adata, max_genes=max_genes_per_cell)
sc.pp.filter_genes(adata, min_cells=min_cells_per_gene)
sc.pp.normalize_total(adata, target_sum=1e4)
print("Total filter and normalize time : %s" % (time.time()-tr))


tr=time.time()
sc.pp.log1p(adata)
print("Total log time : %s" % (time.time()-tr))


# Select highly variable genes
sc.pp.highly_variable_genes(adata, n_top_genes=n_top_genes, flavor = "cell_ranger")

# Retain marker gene expression
for marker in markers:
        adata.obs[marker + "_raw"] = adata.X[:, adata.var.index == marker].toarray().ravel()

# Filter matrix to only variable genes
adata = adata[:, adata.var.highly_variable]

ts=time.time()
#Regress out confounding factors (number of counts, mitochondrial gene expression)
mito_genes = adata.var_names.str.startswith(MITO_GENE_PREFIX)
n_counts = np.array(adata.X.sum(axis=1))
adata.obs['percent_mito'] = np.array(np.sum(adata[:, mito_genes].X, axis=1)) / n_counts
adata.obs['n_counts'] = n_counts


sc.pp.regress_out(adata, ['n_counts', 'percent_mito'])
print("Total regress out time : %s" % (time.time()-ts))

#scale

ts=time.time()
sc.pp.scale(adata, max_value=10)
print("Total scale time : %s" % (time.time()-ts))
t0=time.time()

sc.tl.pca(adata, n_comps=n_components)

t0=time.time()
sc._settings.ScanpyConfig.n_jobs = os.cpu_count()
sc.tl.tsne(adata, n_pcs=tsne_n_pcs, use_fast_tsne=True)

'''
from sklearn.manifold import TSNE
from scanpy.tools._utils import _choose_representation
X = _choose_representation(adata, n_pcs=tsne_n_pcs)
X_tsne = TSNE().fit_transform(X.astype(np.float32))
adata.obsm['X_tsne'] = X_tsne
'''
print("Tsne time:", time.time()-t0)

Copy link

codecov bot commented May 14, 2024

Codecov Report

Attention: Patch coverage is 66.66667% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 75.19%. Comparing base (3ba3f46) to head (d8d4763).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3061      +/-   ##
==========================================
- Coverage   75.87%   75.19%   -0.68%     
==========================================
  Files         110      110              
  Lines       12533    12536       +3     
==========================================
- Hits         9509     9427      -82     
- Misses       3024     3109      +85     
Files Coverage Δ
scanpy/tools/_tsne.py 68.08% <66.66%> (-25.10%) ⬇️

... and 5 files with indirect coverage changes

@ashish615 ashish615 closed this May 15, 2024
@ashish615 ashish615 reopened this May 15, 2024
@sanchit-misra
Copy link

For the above code, the time spent in tSNE went down from 2252 secs to 210 secs due to this PR.

@flying-sheep
Copy link
Member

flying-sheep commented May 16, 2024

For the above code, the time spent in tSNE went down from 2252 secs to 210 secs due to this PR.

What’s the comparison to MulticoreTSNE?

Defaults

It would probably make sense to use a flavor: Literal['auto', 'sklearn', 'intelex', 'multicore'] = 'auto' parameter here, where auto would try to import the speedup packages one-by-one and use the preferred one.

use_fast_tsne could be deprecated and made to default to None, with this logic (too bad we can’t use match yet)

if use_fast_tsne is not None:
    warnings.warn("...", FutureWarning)
match (use_fast_tsne, flavor):
    case (None, 'auto'): ...  # try importing 'intelex', fall back to 'sklearn'
    case (None, _): ...  # use specified flavor
    case (True, 'auto'): ...  # use 'multicore'
    case (True, 'sklearn'): ...  # throw error
    case (True, _): ...  # use specified flavor
    case (False, 'auto' | 'sklearn'): ...  # Use 'sklearn'
    case (False, _): ...  # Throw error
    case _: ... raise AssertionError()

In the future, we can change 'auto' to try both intelex and MulticoreTSNE

@ilan-gold what do you think?

@ilan-gold
Copy link
Contributor

This seems reasonable to me @flying-sheep . Does using the patched version change results over the unpatched @ashish615 i.e., for a given random seed, unpatched and patched are the same? If the two are the same for a given seed/state, then I think what @flying-sheep is proposing could be done separately (even if we make the dependency optional IMO). However, if the new version does change results, we will need the handling that @flying-sheep describes.

@ashish615
Copy link
Author

ashish615 commented May 16, 2024

What’s the comparison to MulticoreTSNE?

This is the compare is while using use_fast_tsne flag which using MulticoreTSNE.

@ashish615
Copy link
Author

ashish615 commented May 16, 2024

This seems reasonable to me @flying-sheep . Does using the patched version change results over the unpatched @ashish615 i.e., for a given random seed, unpatched and patched are the same? If the two are the same for a given seed/state, then I think what @flying-sheep is proposing could be done separately (even if we make the dependency optional IMO). However, if the new version does change results, we will need the handling that @flying-sheep describes.

@ilan-gold , I didn't check that. I will check that and let you guys know.

@narendrachaudhary51
Copy link

These t-SNE optimizations are mentioned in the following paper. Adding it here for reference.
https://arxiv.org/abs/2212.11506
Accelerating Barnes-Hut t-SNE Algorithm by Efficient Parallelization on Multi-Core CPUs
N Chaudhary, A Pivovar, P Yakovlev, A Gorshkov… - arXiv preprint arXiv:2212.11506, 2022

@ilan-gold
Copy link
Contributor

@narendrachaudhary51 You can add the reference here https://github.com/scverse/scanpy/blob/main/docs/references.bib (thank you @flying-sheep for improving this!)

@ashish615
Copy link
Author

Hi @flying-sheep @ilan-gold ,
Based on our previous discussion, we observed that applying and then removing a patch while fixing the seed causes the t-SNE output to change. In our experiment, we used 1.3 million data points to run t-SNE and compared the results of the patched and unpatched versions by examining the KL Divergence from both runs. The results are summarized in the table below.

In the above code use USE_FIRST_N_CELLS to set number of records and use sc.tl.tsne(adata, n_pcs=tsne_n_pcs, use_fast_tsne=False) to run optimized run with latest commit. You can get KL divergence numbers by logging kl_divergence_

image

@sanchit-misra
Copy link

As can be seen with the KL divergence values in the above table, while the output of Intel optimized t-SNE is different, it is equivalent in quality.

@ilan-gold
Copy link
Contributor

@sanchit-misra @ashish615 one thing that we thought - why does this need to be in scanpy? Can't users turn this on/off outside scanpy?

@ilan-gold
Copy link
Contributor

Notwithstanding the above question, since your method produces different results than the default, we will need to adopt the conventions outlined: #3061 (comment)

This will ensure continued reproducibility with defaults.

@ashish615
Copy link
Author

Hi @ilan-gold,

Regarding your thought in this comment, we can enable or disable Intel optimization from outside the code. However, users might not be aware of how to use this feature. Instead, if we add it to scanpy directly, all scanpy users will know the same option available.

If we agree with the option discussed in this comment, I can proceed with updating the t-SNE file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants