Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve accuracy #145

Open
ClimbsRocks opened this issue Dec 14, 2015 · 1 comment
Open

improve accuracy #145

ClimbsRocks opened this issue Dec 14, 2015 · 1 comment

Comments

@ClimbsRocks
Copy link
Owner

there's a bug somewhere that's killing the accuracy.

scores on homesite and telstra are lagging dramatically behind what others are getting with xgboost alone.

my initial reaction was that we needed to tweak the parameters we're tuning for each algorithm, but i don't think that alone would justify the huge expanse between others scores and mine.

i have a feeling it's something in data-formatter.

i am introducing overfitting at the moment by calculating summary statistics on the entire dataset, rather than on each fold for cross-validation.
this is particularly true for the groupBy columns. i think it's probably alright for the imputing missing values script, but groupBy is probably introducing a lot of overfitting.

there could also just be a bug in data-formatter somewhere. in particular, check that train and test have the same columns in the same order. they should, but with the flexibility of having or not having the output column or any ignored column, in the test dataset, we might be off by 1.

if possible, look into calculating stats on each cv fold individually. this would apply just to groupBy i guess.

steps:

  1. manually run xgboost on the raw dataset.
  2. manually run xgboost on the results from data-formatter
    this should help us narrow down whether the error is coming from data-formatter or xgboost
  3. run again and remove groupBy
@ClimbsRocks
Copy link
Owner Author

yeah, assuming it's something in data-formatter, just follow the standard debugging process: comment out the parts we think might be introducing the error, run it, and see if it does any better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant