Feature selection taking magnitutes longer than it should #1073

Sarius2009 · 2024-05-19T19:18:38Z

When extracting data from the same dataset and selecting from the extracted features, one set of parameters to extract data, which results in 154 classes, 10400 TSs, longest TS of 10000 datapoints and 1.2 GB of json data works fine with EfficientFCParameters, but another one, which results in <90 classes, 550 sessions and 8MB of data with takes 40seconds to extract the features, as expected, but 3 hours to select the features.. This was the smallest sample I could create which exibited this beahviour, larger samples (~200MB) will take >60GB of RAM an crash the program. Using n_jobs=0 did not help.

Profiling my example below, almost all of the time seems to be taken up by _recv_bytes and _get_more_data.

Running W11 on a system with 16GB tsfresh 0.20.2 installed via pip, tested code in a jupyter notebook and normal python 3.12.

Minimal example with provided data:
problem_features_eff.csv
problem_id_to_userid_eff.csv

from tsfresh import select_features
import pandas as pd
if __name__ == '__main__':
    X = pd.read_csv('problem_features_eff.csv')
    id_to_userID = pd.read_csv('problem_id_to_userID_eff.csv').T.iloc[1]
    print('Starting')
    select_features(X, id_to_userID, ml_task='classification')

Edit: Changed Parameters to efficient, attached problematic data
Edit: Found out it wasn't inherently a memory issue, adjusted title and text accordingly

The text was updated successfully, but these errors were encountered:

nils-braun · 2024-05-26T14:12:39Z

Hi @Sarius2009!
Your feature selection is taking so long, because your id_to_userID (the series you use as y in the select_features method) contains more than two distinct values and you selected "classification" as your ml task. This means, tsfresh will perform a 1-vs-all feature selection for all your distinct values and as you have 47 distinct values, this will take quite some time...

Or maybe I misunderstood: Are you referring to the number of distinct values in id_to_userID as your classes?

Sarius2009 · 2024-05-26T14:35:25Z

Hi @nils-braun,
Yes, the values in the id_to_userID are what I refered to as classes. ~~I seem to have attached the wrong file. It is updated now and should hopefully make sense,~~ I noticed it might look confusing and confused myself, the comma does not mean its floating points, the userIDs just start with leading 0s eg. the features for id 0 belong to user 000000324, as do the ones for id 2, etc...
As for your description of the problem, it sounds like using multiclass=True should solve the problem, and even though I think I have already tried this, I am open to try again. I didn't until now, as I later on select the 200 most relevant features and having p-values split by class makes this more annoying.
Also, that wouldn't really explain why with more classes and more time series, it finishes much faster.
Update: multiclass=True did not fix the issue.

nils-braun · 2024-05-26T15:10:09Z

I am indeed a bit confused (but not because of your question or data, but because of the issue you see). I played around with the data a bit.
This function

%%time
select_features(X, (id_to_userID <= 324), ml_task="classification");

finishes in about 3s on my laptop (note: I purposely reduced the problem to a binary classification), but this

%%time
select_features(X, (id_to_userID < 324), ml_task="classification");

(only change is the <) takes forever!

Sarius2009 · 2024-05-26T16:58:01Z

Good to know I am not the only one confused. I can also confirm you observations, and 3s is right around what I would expect. Also, the reverse happens for 12899, and in general often if a 1-3 users are put together as one side of the binary classification, the less the more likely the problem is to occur, basically guarented with just one user.

new = []
for x in id_to_userID:
    new.append(x in [66, 17953, 668]) #alternative: [558, 559], [1082]
id_to_userID = pd.Series(new)

Sarius2009 · 2024-06-13T18:41:17Z

@nils-braun
I looked further into this, and it seems to be related to this issue in scipy: scipy/scipy#19692
When I posted this issue, I was runnin scipy 1.12.0, the version that supposedly fixed the issue, but both 1.13.0, as well as 1.14.0rc1 mention futher improvements to the mannwhiteyu test and after updating to 1.14.0rc1, the issue is fixed and even the original data, which was to big to attach here, works as expected, so maybe the cofig should be updated to require a newer scipy version?

nils-braun · 2024-06-13T19:26:31Z

Oh this is really great to hear. Thanks for looking further into this! Yes, let's update the requirements. Would you like to do the PR (because you found it)?

Sarius2009 · 2024-06-13T21:01:39Z

As I only tested with 1.14.0, I will wait 2 weeks for the full release, and then do the PR

Sarius2009 added the bug label May 19, 2024

Sarius2009 changed the title ~~Feature selection works fine on bigger data, but causes MemoryError on smaller~~ Feature selection works fine on bigger data, but causes MemoryError on smaller(Infinite loop/Memory leak?) May 21, 2024

Sarius2009 changed the title ~~Feature selection works fine on bigger data, but causes MemoryError on smaller(Infinite loop/Memory leak?)~~ Feature selection taking magnitutes longer than it should May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature selection taking magnitutes longer than it should #1073

Feature selection taking magnitutes longer than it should #1073

Sarius2009 commented May 19, 2024 •

edited

nils-braun commented May 26, 2024

Sarius2009 commented May 26, 2024 •

edited

nils-braun commented May 26, 2024

Sarius2009 commented May 26, 2024

Sarius2009 commented Jun 13, 2024

nils-braun commented Jun 13, 2024

Sarius2009 commented Jun 13, 2024

Feature selection taking magnitutes longer than it should #1073

Feature selection taking magnitutes longer than it should #1073

Comments

Sarius2009 commented May 19, 2024 • edited

nils-braun commented May 26, 2024

Sarius2009 commented May 26, 2024 • edited

nils-braun commented May 26, 2024

Sarius2009 commented May 26, 2024

Sarius2009 commented Jun 13, 2024

nils-braun commented Jun 13, 2024

Sarius2009 commented Jun 13, 2024

Sarius2009 commented May 19, 2024 •

edited

Sarius2009 commented May 26, 2024 •

edited