Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Even 512 GiB Memory not enough for extract_features on 7895800 rows × 28 columns ? #947

Open
dsstex opened this issue Jun 2, 2022 · 5 comments
Labels

Comments

@dsstex
Copy link

dsstex commented Jun 2, 2022

The problem:

This is my code.

import pandas as pd
from tsfresh import extract_features
from tsfresh.feature_extraction import (
    ComprehensiveFCParameters
)
import multiprocessing


rolled = pd.read_csv('rolled.csv', index_col=0, parse_dates=['time'])
rolled.set_index("time", inplace=True)

X = extract_features(
    rolled,
    column_id="id",
    default_fc_parameters=ComprehensiveFCParameters(),
    n_jobs=multiprocessing.cpu_count()
)

X.to_csv("extracted.csv", index=True, header=True)

rolled.csv contains data that's been rolled using max_timeshift=96, min_timeshift=96. It contains 7895800 rows × 28 columns

rolled = roll_time_series(df, 
                          column_id='id', 
                          rolling_direction=1,
                          column_kind=None,
                          column_sort='time',
                          max_timeshift=96,
                          min_timeshift=96).reset_index(drop=True)
rolled.to_csv("rolled.csv", index=True, header=True)

the input df before rolling had around 82500 rows. That resulted in 7895800 rows × 28 columns.

After feature extraction, I'm expecting ~ 82500 rows x 21438 columns

I have tested with 10 extracted rows. The size for 10 rows x 21438 columns is 3.5 MB.

So for 82500 rows, I presume I need ~ 30 GB disk space.

I have tried to extract the features using AWS EC2 r6i.16xlarge instance. It comes up with 64 vCPU and 512 GiB Memory. I have also added 100 GB EBS gp3 volume. I thought that was enough.

The problems:

(1) Only 12.5% of the CPU got used. 87.5% was idle. [Is it because n_jobs=multiprocessing.cpu_count()? Do I have to use like this? n_jobs=multiprocessing.cpu_count() - 1 ?]

(2) Feature extraction progressed until 75%. After that, the script got terminated due to lack of memory. Is 512 GiB memory not enough?

Anything else we need to know?:

Yes, It took my script 4 hours to progress to 75%. Since I'm using r6i.16xlarge that's expensive in my case.

Since I'm using max_timeshift=96 and min_timeshift=96, during prediction/inference stage, i'll be only having 96 rows to extract features for a single prediction/inference. So I'm wondering why 512 GiB and 4 hours time is not enough for feature extraction, when it takes only 1 second to extract features for a single inference (96 rows).

If it takes 1 second, then for 82500 rows with 64 vCPU, 82500 / 64 = ~ 1290 Second (21.5 Minutes). So I think anything less than 30 Minutes is normal in my case.

I could use LocalDaskDistributor. However, according to this comment, it's not for production use.

Is there any way, we can estimate the system requirements (e.g. Memory) and time (e.g. based on vCPU count) using the input dataframe?

Environment:

  • Python version: 3.7
  • Operating System: Amazon Linux 2
  • tsfresh version: 0.19.0
  • Install method (conda, pip, source): pip
  • Cloud: AWS EC2
  • Instance: r6i.16xlarge
  • Memory: 512 GiB
  • vCPU: 64
  • Storage: 100 GB gp3 Volume
  • n_jobs: multiprocessing.cpu_count()
@dsstex dsstex added the bug label Jun 2, 2022
@dsstex
Copy link
Author

dsstex commented Jun 2, 2022

Just tried without the n_jobs parameter. Which seems like utilising 50% of the available CPU by default. I'm using r6i.24xlarge at the moment. It comes with 96 vCPU and 768 GiB Memory

I can confirm, tsfresh not utilising the CPU well.

Most of the time, CPU utilisation stays below 12.5%.

More than 87.5% of the CPU stays idle always. Also as you can see below, I have sufficient memory.

top - 10:13:46 up 31 min,  2 users,  load average: 11.62, 13.63, 16.19
Tasks: 813 total,  11 running, 381 sleeping,   0 stopped,   0 zombie
%Cpu(s): 11.4 us,  0.0 sy,  0.0 ni, 88.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 78017574+total, 52574860+free, 25019536+used,  4231816 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 52583718+avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                 
18604 root      20   0 8117352   5.1g  21312 R 100.0  0.7  14:01.11 python3                                                                 
18606 root      20   0 7997288   4.9g  21312 R 100.0  0.7  13:18.86 python3                                                                 
18489 root      20   0   94.6g  91.1g  99204 S 100.0 12.2  24:27.62 python3                                                                 
18562 root      20   0 7333480   4.3g  21312 R 100.0  0.6  17:16.38 python3                                                                 
18563 root      20   0 7694440   4.7g  21312 R 100.0  0.6  18:44.78 python3                                                                 
18565 root      20   0 7058792   4.1g  21312 R 100.0  0.5  15:54.73 python3                                                                 
18567 root      20   0 7213672   4.2g  21312 R 100.0  0.6  16:42.11 python3                                                                 
18568 root      20   0 7526248   4.5g  21312 R 100.0  0.6  17:57.02 python3                                                                 
18569 root      20   0 6890088   3.9g  21312 R 100.0  0.5  15:17.46 python3                                                                 
18573 root      20   0 6727272   3.7g  21312 R 100.0  0.5  14:29.92 python3                                                                 
18608 root      20   0 7791208   4.7g  21312 R 100.0  0.6  12:28.75 python3                                                                 
   14 root      20   0       0      0      0 I   0.4  0.0   0:00.32 rcu_sched                                                               
18837 ec2-user  20   0  171848   5064   3704 R   0.4  0.0   0:00.81 top                                                                     
    1 root      20   0  191096   5472   3900 S   0.0  0.0   0:01.72 systemd                                                                 
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.02 kthreadd                                                                
    3 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 rcu_gp                                                                  
    4 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 rcu_par_gp                                                              
    6 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kworker/0:0H-kb                                                         
    7 root      20   0       0      0      0 I   0.0  0.0   0:00.00 kworker/0:1-rcu                                                         
    8 root      20   0       0      0      0 I   0.0  0.0   0:00.00 kworker/u192:0-                                                         
   10 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 mm_percpu_wq                                                            
   11 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_tasks_rude_                                                         
   12 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_tasks_trace                                                         
   13 root      20   0       0      0      0 S   0.0  0.0   0:00.00 ksoftirqd/0                                                             
   15 root      rt   0       0      0      0 S   0.0  0.0   0:00.01 migration/0                                                             
   16 root      20   0       0      0      0 S   0.0  0.0   0:00.00 cpuhp/0                                                                 
   17 root      20   0       0      0      0 S   0.0  0.0   0:00.00 cpuhp/1                                                                 
   18 root      rt   0       0      0      0 S   0.0  0.0   0:00.24 migration/1   

@dsstex
Copy link
Author

dsstex commented Jun 2, 2022

This line seems like the issue.

return_df = data.pivot(result)

https://github.com/blue-yonder/tsfresh/blob/main/tsfresh/feature_extraction/extraction.py#L304

@dsstex
Copy link
Author

dsstex commented Jun 2, 2022

It took 7 hours on r6i.24xlarge. [96 vCPU and 768 GiB Memory].

Output: extracted.csv file size is 20 GB for 7895800 rows × 28 columns

Hope that info help someone.

Thanks.

@b-y-f
Copy link

b-y-f commented Feb 27, 2023

How many features were extracted? Facing the same problem, long time series data(only 3 ids) memory overflows in 16GB laptop.

@nils-braun
Copy link
Collaborator

Thanks @dsstex for the analysis and the posted numbers (and really sorry for the long delay).
How did you know that pivoting is the issue? Have you tried running without it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants