Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

percent_change fails on top of aggregation primitives #2634

Open
enfeizhan opened this issue Nov 16, 2023 · 0 comments
Open

percent_change fails on top of aggregation primitives #2634

enfeizhan opened this issue Nov 16, 2023 · 0 comments
Labels
bug Something isn't working

Comments

@enfeizhan
Copy link

[A clear and concise description of what the bug is.]
The transform primitive percent_change fails to work on top of an aggregation primitive, returning NaN or seemingly arbitrary results.

Code Sample, a copy-pastable example to reproduce your bug.

These are the transactions:

transaction_id | customer_id | quantity | transaction_date
101758183 | abc | 15 | 2021-12-15
101862984 | abc | 15 | 2022-01-15
101960142 | abc | 15 | 2022-02-15
102062271 | abc | 15 | 2022-03-15
102179828 | abc | 15 | 2022-04-15
102301689 | abc | 15 | 2022-05-15
102434267 | abc | 15 | 2022-06-15
102540706 | abc | 15 | 2022-07-15
102662863 | abc | 15 | 2022-08-15
102783888 | abc | 15 | 2022-09-15
102901638 | abc | 15 | 2022-10-15
103041277 | abc | 15 | 2022-11-15
103199236 | abc | 15 | 2022-12-15
103336795 | abc | 15 | 2023-01-15
103478291 | abc | 15 | 2023-02-15
103604244 | abc | 15 | 2023-03-15
103738142 | abc | 15 | 2023-04-15
103895757 | abc | 15 | 2023-05-15
104073119 | abc | 15 | 2023-06-15
104233610 | abc | 15 | 2023-07-15

Creating the lables:
def is_churned(df):
return len(df) == 0

label_maker = cp.LabelMaker(
target_dataframe_index='customer_id',
time_index='transaction_date',
labeling_function=is_churned,
window_size='60d'
)

labels = label_maker.search(
df=tt1,
num_examples_per_instance=-1,
gap='1MS',
drop_empty=False,
minimum_data=6,
verbose=True
)
Labels will be like:
customer_id time is_churned
abc 2022-06-15 False
abc 2022-07-01 False
abc 2022-08-01 False
abc 2022-09-01 False
abc 2022-10-01 False
abc 2022-11-01 False
abc 2022-12-01 False
abc 2023-01-01 False
abc 2023-02-01 False
abc 2023-03-01 False
abc 2023-04-01 False
abc 2023-05-01 False
abc 2023-06-01 False
abc 2023-07-01 False

Create the EntitySet:
es = ft.EntitySet('bug')

es.add_dataframe(
dataframe=tt1,
dataframe_name='transactions',
time_index='transaction_date',
index='transaction_id'
)

es.normalize_dataframe(
base_dataframe_name='transactions',
new_dataframe_name='persons',
index='customer_id',
make_time_index=True
)

es.add_last_time_indexes()

Creating the features:
fm, fd = ft.dfs(
entityset=es,
target_dataframe_name='persons',
agg_primitives=['count'],
trans_primitives=['percent_change'],
cutoff_time=labels,
max_depth=2,
cutoff_time_in_index=True,
include_cutoff_time=False,
verbose=True,
)

customer_id time COUNT(transactions) PERCENT_CHANGE(COUNT(transactions)) is_churned
abc 2022-06-15 6 NaN False
abc 2022-07-01 7 NaN False
abc 2022-08-01 8 NaN False
abc 2022-09-01 9 NaN False
abc 2022-10-01 10 NaN False
abc 2022-11-01 11 NaN False
abc 2022-12-01 12 NaN False
abc 2023-01-01 13 NaN False
abc 2023-02-01 14 NaN False
abc 2023-03-01 15 NaN False
abc 2023-04-01 16 NaN False
abc 2023-05-01 17 NaN False
abc 2023-06-01 18 NaN False
abc 2023-07-01 19 NaN False
Notice the feature PERCENT_CHANGE(COUNT(transactions)) has all NaN, which should be 0 for the first row and a value roughly 0.05 ~ 0.2.

I also noticed the result can be quite random in a large real dataset, which is hard to be reproduced here.

# Your code here

Output of featuretools.show_info()

[paste the output of featuretools.show_info() here below this line]
Featuretools version: 1.28.0
Featuretools installation directory: /Users/feizhan/Installs/miniconda3/envs/generic/lib/python3.9/site-packages/featuretools

SYSTEM INFO

python: 3.9.13.final.0
python-bits: 64
OS: Darwin
OS-release: 23.0.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_AU.UTF-8
LOCALE: en_AU.UTF-8

INSTALLED VERSIONS

numpy: 1.23.4
pandas: 1.5.1
tqdm: 4.65.0
cloudpickle: 2.2.1
dask: 2023.3.2
distributed: 2023.3.2
psutil: 5.9.3
pip: 22.3
setuptools: 65.5.0

@enfeizhan enfeizhan added the bug Something isn't working label Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant