Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Score genes - control genes index of wrong format #2153

Open
1 of 3 tasks
Hrovatin opened this issue Feb 23, 2022 · 4 comments · May be fixed by #2875
Open
1 of 3 tasks

Score genes - control genes index of wrong format #2153

Hrovatin opened this issue Feb 23, 2022 · 4 comments · May be fixed by #2875
Assignees

Comments

@Hrovatin
Copy link
Contributor

Hrovatin commented Feb 23, 2022

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of scanpy.
  • (optional) I have confirmed this bug exists on the master branch of scanpy.

We ran across a case where score genes does not work for one specific gene, but works for others; the gene is in anndata. The same code works on other datasets preprocessed in the same way.

genes=['INS']
score_name='ins'
'INS' in adata_rawnorm.var_names
True
IndexError                                Traceback (most recent call last)
Input In [109], in <module>
----> 1 sc.tl.score_genes(adata_rawnorm, gene_list=genes, score_name=score_name+'_score',  use_raw=False)

File ~/miniconda3/envs/block/lib/python3.8/site-packages/scanpy/tools/_score_genes.py:167, in score_genes(adata, gene_list, ctrl_size, gene_pool, n_bins, score_name, random_state, copy, use_raw)
    164 else:
    165     X_list = np.nanmean(X_list, axis=1, dtype='float64')
--> 167 X_control = _adata[:, control_genes].X
    168 if issparse(X_control):
    169     X_control = np.array(_sparse_nanmean(X_control, axis=1)).flatten()

File ~/miniconda3/envs/block/lib/python3.8/site-packages/anndata/_core/anndata.py:625, in AnnData.X(self)
    622         X = _subset(X, (self._oidx, self._vidx))
    623 elif self.is_view:
    624     X = as_view(
--> 625         _subset(self._adata_ref.X, (self._oidx, self._vidx)),
    626         ElementRef(self, "X"),
    627     )
    628 else:
    629     X = self._X

File ~/miniconda3/envs/block/lib/python3.8/functools.py:875, in singledispatch.<locals>.wrapper(*args, **kw)
    871 if not args:
    872     raise TypeError(f'{funcname} requires at least '
    873                     '1 positional argument')
--> 875 return dispatch(args[0].__class__)(*args, **kw)

File ~/miniconda3/envs/block/lib/python3.8/site-packages/anndata/_core/index.py:127, in _subset(a, subset_idx)
    125 if all(isinstance(x, cabc.Iterable) for x in subset_idx):
    126     subset_idx = np.ix_(*subset_idx)
--> 127 return a[subset_idx]

IndexError: arrays used as indices must be of integer (or boolean) type
@padix-key
Copy link
Contributor

padix-key commented Jan 25, 2024

I encountered the same problem, when I created a small artificial AnnData with a single gene in gene_list for some unit test. Here is my analysis of the problem:

In this line

X_control = _adata[:, control_genes].X

control_genes was actually empty, hence the index error.
The reason for the empty control_genes genes is
control_genes = list(control_genes - gene_list)

control_genes contained some genes before (in my artificial case only one), but they are removed here, since the genes in control_genes also appeared in gene_list.

I think this is where the bug resides:
I assume control_genes should not contain genes from gene_list in the first place. Hence, this line

r_genes = np.array(obs_cut[obs_cut == cut].index)

would need to be changed/complemented:
An additional filter for not being a gene in gene_list should fix this issue, if I understand this code correctly.

That being said, I suppose that this issue appears rather rarely in the realistically sized datasets. I assume, that the probability of accidentally picking genes from gene_list as control_genes decreases with increasing number of genes.
At least I have not encountered this exception in my experimental datasets.
Furthermore, this issue does not make the result wrong, as far as I understand the algorithm, because the control genes are selected randomly anyway.

@mumichae
Copy link
Contributor

This error happened to me too when working on a small dataset and scoring a single gene with ctrl_size=1. This happens at random in the following line:

np.random.shuffle(r_genes)

I was working on a small test dataset with limited features and calling the score_genes_cell_cycle function, where only 1 cell cycle gene is left and ctrl_size is set as follows:

ctrl_size = min(len(s_genes), len(g2m_genes))

@mumichae
Copy link
Contributor

I’d consider the issue to be a bug, can we label it as such?

@lazappi
Copy link
Contributor

lazappi commented Mar 14, 2024

I agree that this could be handled but does it actually make sense to score only 1 gene using this function? I feel like it assumes there is a group of genes.

@flying-sheep flying-sheep self-assigned this Jun 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants