Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature selection taking magnitutes longer than it should #1073

Open
Sarius2009 opened this issue May 19, 2024 · 7 comments
Open

Feature selection taking magnitutes longer than it should #1073

Sarius2009 opened this issue May 19, 2024 · 7 comments
Labels

Comments

@Sarius2009
Copy link

Sarius2009 commented May 19, 2024

When extracting data from the same dataset and selecting from the extracted features, one set of parameters to extract data, which results in 154 classes, 10400 TSs, longest TS of 10000 datapoints and 1.2 GB of json data works fine with EfficientFCParameters, but another one, which results in <90 classes, 550 sessions and 8MB of data with takes 40seconds to extract the features, as expected, but 3 hours to select the features.. This was the smallest sample I could create which exibited this beahviour, larger samples (~200MB) will take >60GB of RAM an crash the program. Using n_jobs=0 did not help.

Profiling my example below, almost all of the time seems to be taken up by _recv_bytes and _get_more_data.

Running W11 on a system with 16GB tsfresh 0.20.2 installed via pip, tested code in a jupyter notebook and normal python 3.12.

Minimal example with provided data:
problem_features_eff.csv
problem_id_to_userid_eff.csv

from tsfresh import select_features
import pandas as pd
if __name__ == '__main__':
    X = pd.read_csv('problem_features_eff.csv')
    id_to_userID = pd.read_csv('problem_id_to_userID_eff.csv').T.iloc[1]
    print('Starting')
    select_features(X, id_to_userID, ml_task='classification')

Edit: Changed Parameters to efficient, attached problematic data
Edit: Found out it wasn't inherently a memory issue, adjusted title and text accordingly

@Sarius2009 Sarius2009 added the bug label May 19, 2024
@Sarius2009 Sarius2009 changed the title Feature selection works fine on bigger data, but causes MemoryError on smaller Feature selection works fine on bigger data, but causes MemoryError on smaller(Infinite loop/Memory leak?) May 21, 2024
@Sarius2009 Sarius2009 changed the title Feature selection works fine on bigger data, but causes MemoryError on smaller(Infinite loop/Memory leak?) Feature selection taking magnitutes longer than it should May 21, 2024
@nils-braun
Copy link
Collaborator

Hi @Sarius2009!
Your feature selection is taking so long, because your id_to_userID (the series you use as y in the select_features method) contains more than two distinct values and you selected "classification" as your ml task. This means, tsfresh will perform a 1-vs-all feature selection for all your distinct values and as you have 47 distinct values, this will take quite some time...

Or maybe I misunderstood: Are you referring to the number of distinct values in id_to_userID as your classes?

@Sarius2009
Copy link
Author

Sarius2009 commented May 26, 2024

Hi @nils-braun,
Yes, the values in the id_to_userID are what I refered to as classes. I seem to have attached the wrong file. It is updated now and should hopefully make sense, I noticed it might look confusing and confused myself, the comma does not mean its floating points, the userIDs just start with leading 0s eg. the features for id 0 belong to user 000000324, as do the ones for id 2, etc...
As for your description of the problem, it sounds like using multiclass=True should solve the problem, and even though I think I have already tried this, I am open to try again. I didn't until now, as I later on select the 200 most relevant features and having p-values split by class makes this more annoying.
Also, that wouldn't really explain why with more classes and more time series, it finishes much faster.
Update: multiclass=True did not fix the issue.

@nils-braun
Copy link
Collaborator

I am indeed a bit confused (but not because of your question or data, but because of the issue you see). I played around with the data a bit.
This function

%%time
select_features(X, (id_to_userID <= 324), ml_task="classification");

finishes in about 3s on my laptop (note: I purposely reduced the problem to a binary classification), but this

%%time
select_features(X, (id_to_userID < 324), ml_task="classification");

(only change is the <) takes forever!

@Sarius2009
Copy link
Author

Good to know I am not the only one confused. I can also confirm you observations, and 3s is right around what I would expect. Also, the reverse happens for 12899, and in general often if a 1-3 users are put together as one side of the binary classification, the less the more likely the problem is to occur, basically guarented with just one user.

new = []
for x in id_to_userID:
    new.append(x in [66, 17953, 668]) #alternative: [558, 559], [1082]
id_to_userID = pd.Series(new)

@Sarius2009
Copy link
Author

@nils-braun
I looked further into this, and it seems to be related to this issue in scipy: scipy/scipy#19692
When I posted this issue, I was runnin scipy 1.12.0, the version that supposedly fixed the issue, but both 1.13.0, as well as 1.14.0rc1 mention futher improvements to the mannwhiteyu test and after updating to 1.14.0rc1, the issue is fixed and even the original data, which was to big to attach here, works as expected, so maybe the cofig should be updated to require a newer scipy version?

@nils-braun
Copy link
Collaborator

Oh this is really great to hear. Thanks for looking further into this! Yes, let's update the requirements. Would you like to do the PR (because you found it)?

@Sarius2009
Copy link
Author

As I only tested with 1.14.0, I will wait 2 weeks for the full release, and then do the PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants