Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extract_relevant_features hangs on specific data #1063

Open
bulldog5046 opened this issue Feb 18, 2024 · 1 comment
Open

extract_relevant_features hangs on specific data #1063

bulldog5046 opened this issue Feb 18, 2024 · 1 comment
Labels

Comments

@bulldog5046
Copy link

bulldog5046 commented Feb 18, 2024

session_3000.csv
The problem:

Using extract_relevant_features on a financial data set I've hit a roadblock where the test_features_significance seems to fail to ever return. Tested and repeatable on multiple systems and always on the same data column (Volume).

Anything else we need to know?:

EfficientFCParameters

/home/ryan/xgboost/venv/lib/python3.10/site-packages/tsfresh/utilities/dataframe_functions.py:198: RuntimeWarning: The columns ['Volume__ar_coefficient__coeff_0__k_10'
 'Volume__ar_coefficient__coeff_1__k_10'
 'Volume__ar_coefficient__coeff_2__k_10'
 'Volume__ar_coefficient__coeff_3__k_10'
 'Volume__ar_coefficient__coeff_4__k_10'
 'Volume__ar_coefficient__coeff_5__k_10'
 'Volume__ar_coefficient__coeff_6__k_10'
 'Volume__ar_coefficient__coeff_7__k_10'
 'Volume__ar_coefficient__coeff_8__k_10'
 'Volume__ar_coefficient__coeff_9__k_10'
 'Volume__query_similarity_count__query_None__threshold_0.0'] did not have any finite values. Filling with zeros.
  warnings.warn(
/home/ryan/xgboost/venv/lib/python3.10/site-packages/tsfresh/feature_selection/relevance.py:222: RuntimeWarning: [test_feature_significance] Constant features: Volume__symmetry_looking__r_0.0, Volume__large_standard_deviation__r_0.5, Volume__large_standard_deviation__r_0.55, Volume__large_standard_deviation__r_0.6000000000000001, Volume__large_standard_deviation__r_0.65, Volume__large_standard_deviation__r_0.7000000000000001, Volume__large_standard_deviation__r_0.75, Volume__large_standard_deviation__r_0.8, Volume__large_standard_deviation__r_0.8500000000000001, Volume__large_standard_deviation__r_0.9, Volume__large_standard_deviation__r_0.9500000000000001, Volume__partial_autocorrelation__lag_0, Volume__number_peaks__n_10, Volume__number_peaks__n_50, Volume__ar_coefficient__coeff_0__k_10, Volume__ar_coefficient__coeff_1__k_10, Volume__ar_coefficient__coeff_2__k_10, Volume__ar_coefficient__coeff_3__k_10, Volume__ar_coefficient__coeff_4__k_10, Volume__ar_coefficient__coeff_5__k_10, Volume__ar_coefficient__coeff_6__k_10, Volume__ar_coefficient__coeff_7__k_10, Volume__ar_coefficient__coeff_8__k_10, Volume__ar_coefficient__coeff_9__k_10, Volume__ar_coefficient__coeff_10__k_10, Volume__value_count__value_0, Volume__value_count__value_-1, Volume__range_count__max_1__min_-1, Volume__range_count__max_0__min_-1000000000000.0, Volume__number_crossing_m__m_0, Volume__number_crossing_m__m_-1, Volume__count_above__t_0, Volume__count_below__t_0, Volume__query_similarity_count__query_None__threshold_0.0
  warnings.warn(

The log seems incomplete, but this is all the logging i have been able to get output before the hang.

I've attempted to remove some of the features that caused significant repetitive errors but the issue persists:

params = EfficientFCParameters()
del params['fft_coefficient']
del params['agg_linear_trend']
del params['ratio_beyond_r_sigma']

Minimal Example

from tsfresh.utilities.dataframe_functions import make_forecasting_frame
from tsfresh import extract_relevant_features, feature_extraction
from tsfresh.feature_extraction import EfficientFCParameters
import pandas as pd

data = pd.read_csv('session_3000.csv')

column = 'Volume'
x = data[['id', 'Datetime', column]].rename(columns={column: 'value'})
x['kind'] = column  # Add kind column to differentiate between series

df_shifted, y = make_forecasting_frame(x['value'], kind=column, max_timeshift=20, rolling_direction=1)
extracted_features = extract_relevant_features(
                df_shifted,
                y=y,
                default_fc_parameters=EfficientFCParameters(),
                column_id='id',
                column_sort='time',
                column_kind='kind',
                column_value='value',
                n_jobs=16,
            )
kind_to_fc_parameters = feature_extraction.settings.from_columns(extracted_features)

Environment:

  • Python version: 3.10.12
  • Operating System: Ubuntu 22.04.3 LTS & WLS2
  • tsfresh version: 0.20.2
  • Install method (conda, pip, source): pip
@nils-braun
Copy link
Collaborator

Hi @bulldog5046 - sorry for the late response.
The problem in your case is, that your target is integer-valued, but has many different values. Our internal automatic ml target deduction thinks, you want to do a classification task with a multiclass target, and we need to do many 1-vs-rest comparisons (and probably do hundreds of feature selection runs). By just setting the ml_task="regression", you can tell tsfresh to treat your problem as a regression problem (what it is) and feature selection will finish much faster :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants