Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for reordering the contents in M1 and M2 #398

Closed
ArturoAmorQ opened this issue Jul 16, 2021 · 4 comments
Closed

Proposal for reordering the contents in M1 and M2 #398

ArturoAmorQ opened this issue Jul 16, 2021 · 4 comments

Comments

@ArturoAmorQ
Copy link
Collaborator

ArturoAmorQ commented Jul 16, 2021

Proposal to be voted in v.2: Some changes in the ordering of the topics in M1 and M2 may make the content clearer.

Motivation:

  • Having cross-validation introduced in M1 makes the contents heavier. Having a train-test split could suffice as a starting point
  • M1 focuses on data handling whereas M2 focuses on scoring a model. CV seems to fit better in the latter
  • Q6 of the wrap-up quiz in M1 is too heavy in a concept that was barely motivated (though there are easier solutions to this issue)
  • CV is revisited anyway in section Overfitting and underfitting, giving a sense or being redundant.
  • CV gives a smooth transition for introducing the notion of improvement/deterioration for selecting models
  • Having CV presented as subtopics of sections Fitting a scikit-learn model on numerical data and Overfitting and underfitting in two different modules seems to reduce its actual importance and makes it harder to find for those who want to revisit the topic

Notation:

Module

  • Section
    • Topics covered in that section in exactly that order of appearance

Actual contents:

Module 1. The Predictive Modeling Pipeline

  • Module overview
  • Tabular data exploration
    • Features vs target, categorical vs numerical, visual inspection, decision rules
  • Fitting a scikit-learn model on numerical data
    • Train-test split, score, (linear vs non-linear models), class imbalance (majority class, DummyClassifier), scaling, pipelines, cross-validation
  • Handling categorical data
    • Encoding, selection based on types, pipeline, compare two models (gradient boosting vs linear), handling data for tree-based models
  • Wrap-up quiz
    • Tabular data exploration, features, numerical data, ordinal categories, pipeline with imputer and cross-validation, pipeline with imputer + cross-validation and score distributions
  • Main takeaway

Module 2. Selecting the Best Model

  • Module overview
  • Overfitting and underfitting
    • Model complexity, train and test errors, cross-validation in detail, error distributions, target distribution
  • Validation and learning curves
    • Inductive bias and complexity, overfitting and underfitting, validation curve, learning curve
  • Bias vs variance trade-off
  • Wrap-up quiz
    • Classification vs regression, class imbalance, balanced accuracy, scaling, validation curve, underfitting, overfitting, generalizing
  • Main takeaway

Proposed contents:

Module 1. The Predictive Modeling Pipeline

  • Module overview
  • Tabular data exploration
    • Features vs target, categorical vs numerical, visual inspection, decision rules
  • Numerical data preprocessing
    • Train-test split, score, (linear vs non-linear models), class imbalance (majority class, DummyClassifier), scaling, pipelines, cross-validation
  • Categorical data preprocessing
    • Encoding, selection based on types, pipeline, compare two models (gradient boosting vs linear), handling data for tree-based models
  • Wrap-up quiz
    • Tabular data exploration, features, numerical data, ordinal categories, pipeline with imputer and cross-validation, pipeline with imputer + cross-validation and score distributions, (🔺)more on handling missing data
  • Main takeaway

Module 2. Selecting the Best Model

  • Module overview
  • (🔺)Cross-validation and scores distributions
    • Train and test errors, notion of improvement/deterioration for selecting models
  • Overfitting and underfitting
    • Model complexity, train and test errors, cross-validation in detail, error distributions, target distribution
  • Validation and learning curves
    • Inductive bias and complexity, overfitting and underfitting, validation curve, learning curve
  • Bias vs variance trade-off
  • Wrap-up quiz
    • Classification vs regression, class imbalance, balanced accuracy, scaling, validation curve, underfitting, overfitting, generalizing
  • Main takeaway
@ArturoAmorQ
Copy link
Collaborator Author

This proposal would potentially address the issues in #124, #340, #361 and #366

@lesteve
Copy link
Collaborator

lesteve commented Jul 20, 2021

So I see at least four different things here:

Removing cross-validation from M1

With M1 we aimed at a tight session to go as efficiently as possible from knowing almost nothing about scikit-learn to a realistic scikit-learn pipeline (i.e. something that we would use in practice). Removing cross-validation completely from M1 does not seem great from this point of view. Basically, people involved in this MOOC will tell you cross-validation is super important.

CV is revisited anyway in section Overfitting and underfitting, giving a sense or being redundant.

Generally speaking, repeating things or covering the same thing in a slightly different way is completely fine and I would argue actually a good thing pedagogically. So I don't see a huge problem with this, especially given the cost associated to moving things around (in our repo, changing the exercise that use cross-validation, checking all the notebooks that may mention cross-validation, deciding where to move this or say "in the next module we will see this in more details" and probably also in FUN for the quizzes).

Light refactoring within M2

It seems like you want to move train and test scores to its own lesson, why not? Can you explain a bit more why you don't like the way it is done currently? I guess at one point we had in mind that the videos would come first to give the intuitions and the code later to reinforce the intuitions.

Adding content about score distributions and model improvement/deterioration

Mentioned in #366, let's keep the discussion there.

More content on handling missing data

Let's do it in #361.

@glemaitre
Copy link
Collaborator

glemaitre commented Jul 23, 2021

Regarding the cross-validation and evaluation => #415

  1. add a new notebook between the notebook 1 and 2 to introduce the model evaluation. It should discuss train_test_split and thus the cross_validate -> "Evaluate your first model".
  2. Look at reducing the discussions since we don't use a Pipeline anymore.
  3. If necessary, add a small section only speaking about the Pipeline inside cross_validate.
  4. "Exercise M1.03" -> check if we should add an additional exercise that uses cross-validation.

Missing values

  1. rework the first wrap-up quiz removing the missing values => Simplify first module wrap-up quiz to not need a SimpleImputer #361
  2. await to see if we do a new section => Add "advanced pipeline", missing value, imputing module, maybe more #414

Module 2 => #416

The proposal of Arturo is good:

  1. move score distribution and variations at the beginning
  2. variability between 2 models
  3. and see the advancement of underfitting-overfitting, etc...

@lesteve
Copy link
Collaborator

lesteve commented Jul 23, 2021

I opened several issue to split the work we agreed on, so closing this one.

@lesteve lesteve closed this as completed Jul 23, 2021
@lesteve lesteve mentioned this issue Jul 23, 2021
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants