Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better expose anndata dataframe in the single-cell dataloaders #267

Open
amva13 opened this issue May 19, 2024 · 0 comments
Open

better expose anndata dataframe in the single-cell dataloaders #267

amva13 opened this issue May 19, 2024 · 0 comments
Labels
good first issue Good for newcomers high-priority-post-neurips high priority issue. but will be completed after neurips new-feature new-function Request new data function.

Comments

@amva13
Copy link
Member

amva13 commented May 19, 2024

Describe the problem
Though self.adata exists, there is no obvious getter method. also, the splits don't provide an anndata option

Describe the solution you'd like
getter method(s); also implement splits for anndata as well

Additional context
from slack

Oh Is there a function to load that already? Because I checked when we download the raw file it is in the adata format

11:12 AM
yes
11:12
https://github.com/mims-harvard/TDC/blob/main/tdc/multi_pred/[anndata_dataset.py](https://github.com/mims-harvard/TDC/blob/main/tdc/multi_pred/anndata_dataset.py)#L10

anndata_dataset.py
self.adata = self.df # this is in AnnData format
https://github.com/[mims-harvard/TDC](https://github.com/mims-harvard/TDC)|mims-harvard/TDCmims-harvard/TDC | Added by GitHub
11:12
self.adata will contain the anndata dataframe (edited)
11:12
apologies, i should expose that better via a getter function or something
11:14
The existing loader for perturboutcome inherist from the anndata loader
11:14
https://github.com/mims-harvard/TDC/blob/main/tdc/multi_pred/single_cell.py#L11

single_cell.py
class CellXGeneTemplate(DataLoader):
https://github.com/mims-harvard/TDC|mims-harvard/TDCmims-harvard/TDC | Added by GitHub
11:14
https://github.com/mims-harvard/TDC/blob/main/tdc/multi_pred/perturboutcome.py#L16

perturboutcome.py
class PerturbOutcome(CellXGeneTemplate):
https://github.com/[mims-harvard/TDC](https://github.com/mims-harvard/TDC)|mims-harvard/TDCmims-harvard/TDC | Added by GitHub
11:15
so self.adata will be anndata 🙂
11:17
though i suppose for the benchmark, the splits are not implemented for anndata

11:18 AM
Ah I see,I can take the data split from the split function and then return a dictionary of train Val test adata
11:18
Do you think it makes sense to set this as a default for the benchmark? Since I believe most method developer are using adata for model training

11:18 AM
I see
11:19
Ok. Well, let’s make a flag for use_anndata and set it to True by default?

11:19 AM
Sounds good
11:19
I will do that

11:19 AM
I’d rather not get rid of the pandas code
11:19
cool!

11:20 AM
Sounds good

11:20 AM
Sorry for these discrepancies, the lab has been moving away from anndata, so I forget we still currently have some dependencies on it

11:24 AM
I see - no worries but I do want to point out that for most single cell analysis/ML models people still use adata. because there are indeed lots of cell observations (e.g. perturbation) metadata and gene meta data that need to stored. For the ease of use, I feel like we can still prepare an adata flag if people need them!

11:25 AM
absolutely. i’ll add an action item to better expose the getters for anndata

@amva13 amva13 added good first issue Good for newcomers new-function Request new data function. new-feature high-priority-post-neurips high priority issue. but will be completed after neurips labels May 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers high-priority-post-neurips high priority issue. but will be completed after neurips new-feature new-function Request new data function.
Projects
None yet
Development

No branches or pull requests

1 participant