Score genes - control genes index of wrong format #2153

Hrovatin · 2022-02-23T20:48:29Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of scanpy.
(optional) I have confirmed this bug exists on the master branch of scanpy.

We ran across a case where score genes does not work for one specific gene, but works for others; the gene is in anndata. The same code works on other datasets preprocessed in the same way.

genes=['INS']
score_name='ins'
'INS' in adata_rawnorm.var_names
True
IndexError                                Traceback (most recent call last)
Input In [109], in <module>
----> 1 sc.tl.score_genes(adata_rawnorm, gene_list=genes, score_name=score_name+'_score',  use_raw=False)

File ~/miniconda3/envs/block/lib/python3.8/site-packages/scanpy/tools/_score_genes.py:167, in score_genes(adata, gene_list, ctrl_size, gene_pool, n_bins, score_name, random_state, copy, use_raw)
    164 else:
    165     X_list = np.nanmean(X_list, axis=1, dtype='float64')
--> 167 X_control = _adata[:, control_genes].X
    168 if issparse(X_control):
    169     X_control = np.array(_sparse_nanmean(X_control, axis=1)).flatten()

File ~/miniconda3/envs/block/lib/python3.8/site-packages/anndata/_core/anndata.py:625, in AnnData.X(self)
    622         X = _subset(X, (self._oidx, self._vidx))
    623 elif self.is_view:
    624     X = as_view(
--> 625         _subset(self._adata_ref.X, (self._oidx, self._vidx)),
    626         ElementRef(self, "X"),
    627     )
    628 else:
    629     X = self._X

File ~/miniconda3/envs/block/lib/python3.8/functools.py:875, in singledispatch.<locals>.wrapper(*args, **kw)
    871 if not args:
    872     raise TypeError(f'{funcname} requires at least '
    873                     '1 positional argument')
--> 875 return dispatch(args[0].__class__)(*args, **kw)

File ~/miniconda3/envs/block/lib/python3.8/site-packages/anndata/_core/index.py:127, in _subset(a, subset_idx)
    125 if all(isinstance(x, cabc.Iterable) for x in subset_idx):
    126     subset_idx = np.ix_(*subset_idx)
--> 127 return a[subset_idx]

IndexError: arrays used as indices must be of integer (or boolean) type

padix-key · 2024-01-25T15:14:46Z

I encountered the same problem, when I created a small artificial AnnData with a single gene in gene_list for some unit test. Here is my analysis of the problem:

In this line

scanpy/scanpy/tools/_score_genes.py

Line 182 in 6314190

X_control = _adata[:, control_genes].X

control_genes was actually empty, hence the index error.
The reason for the empty control_genes genes is

scanpy/scanpy/tools/_score_genes.py

Line 173 in 6314190

control_genes = list(control_genes - gene_list)

control_genes contained some genes before (in my artificial case only one), but they are removed here, since the genes in control_genes also appeared in gene_list.

I think this is where the bug resides:
I assume control_genes should not contain genes from gene_list in the first place. Hence, this line

scanpy/scanpy/tools/_score_genes.py

Line 167 in 6314190

r_genes = np.array(obs_cut[obs_cut == cut].index)

would need to be changed/complemented:
An additional filter for not being a gene in gene_list should fix this issue, if I understand this code correctly.

That being said, I suppose that this issue appears rather rarely in the realistically sized datasets. I assume, that the probability of accidentally picking genes from gene_list as control_genes decreases with increasing number of genes.
At least I have not encountered this exception in my experimental datasets.
Furthermore, this issue does not make the result wrong, as far as I understand the algorithm, because the control genes are selected randomly anyway.

mumichae · 2024-02-21T14:11:05Z

This error happened to me too when working on a small dataset and scoring a single gene with ctrl_size=1. This happens at random in the following line:

scanpy/scanpy/tools/_score_genes.py

Line 168 in 383a61b

np.random.shuffle(r_genes)

I was working on a small test dataset with limited features and calling the score_genes_cell_cycle function, where only 1 cell cycle gene is left and ctrl_size is set as follows:

scanpy/scanpy/tools/_score_genes.py

Line 258 in 383a61b

ctrl_size = min(len(s_genes), len(g2m_genes))

mumichae · 2024-02-21T14:40:55Z

I’d consider the issue to be a bug, can we label it as such?

lazappi · 2024-03-14T08:05:00Z

I agree that this could be handled but does it actually make sense to score only 1 gene using this function? I feel like it assumes there is a group of genes.

jfoster17 mentioned this issue Oct 12, 2022

Module score does not work with backed data gluesolutions/glue-single-cell#37

Closed

mumichae linked a pull request Feb 21, 2024 that will close this issue

Bugfix: Gene score edge case where gene_list gene is chosen as control gene #2875

Open

3 tasks

flying-sheep self-assigned this Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Score genes - control genes index of wrong format #2153

Score genes - control genes index of wrong format #2153

Hrovatin commented Feb 23, 2022 •

edited by flying-sheep

padix-key commented Jan 25, 2024 •

edited

mumichae commented Feb 21, 2024

mumichae commented Feb 21, 2024

lazappi commented Mar 14, 2024

Score genes - control genes index of wrong format #2153

Score genes - control genes index of wrong format #2153

Comments

Hrovatin commented Feb 23, 2022 • edited by flying-sheep

padix-key commented Jan 25, 2024 • edited

mumichae commented Feb 21, 2024

mumichae commented Feb 21, 2024

lazappi commented Mar 14, 2024

Hrovatin commented Feb 23, 2022 •

edited by flying-sheep

padix-key commented Jan 25, 2024 •

edited