Proposal for reordering the contents in M1 and M2 #398

ArturoAmorQ · 2021-07-16T08:41:47Z

Proposal to be voted in v.2: Some changes in the ordering of the topics in M1 and M2 may make the content clearer.

Motivation:

Having cross-validation introduced in M1 makes the contents heavier. Having a train-test split could suffice as a starting point
M1 focuses on data handling whereas M2 focuses on scoring a model. CV seems to fit better in the latter
Q6 of the wrap-up quiz in M1 is too heavy in a concept that was barely motivated (though there are easier solutions to this issue)
CV is revisited anyway in section Overfitting and underfitting, giving a sense or being redundant.
CV gives a smooth transition for introducing the notion of improvement/deterioration for selecting models
Having CV presented as subtopics of sections Fitting a scikit-learn model on numerical data and Overfitting and underfitting in two different modules seems to reduce its actual importance and makes it harder to find for those who want to revisit the topic

Notation:

Module

Section
- Topics covered in that section in exactly that order of appearance

Actual contents:

Module 1. The Predictive Modeling Pipeline

Module overview
Tabular data exploration
- Features vs target, categorical vs numerical, visual inspection, decision rules
Fitting a scikit-learn model on numerical data
- Train-test split, score, (linear vs non-linear models), class imbalance (majority class, DummyClassifier), scaling, pipelines, cross-validation
Handling categorical data
- Encoding, selection based on types, pipeline, compare two models (gradient boosting vs linear), handling data for tree-based models
Wrap-up quiz
- Tabular data exploration, features, numerical data, ordinal categories, pipeline with imputer and cross-validation, pipeline with imputer + cross-validation and score distributions
Main takeaway

Module 2. Selecting the Best Model

Module overview
Overfitting and underfitting
- Model complexity, train and test errors, cross-validation in detail, error distributions, target distribution
Validation and learning curves
- Inductive bias and complexity, overfitting and underfitting, validation curve, learning curve
Bias vs variance trade-off
Wrap-up quiz
- Classification vs regression, class imbalance, balanced accuracy, scaling, validation curve, underfitting, overfitting, generalizing
Main takeaway

Proposed contents:

Module 1. The Predictive Modeling Pipeline

Module overview
Tabular data exploration
- Features vs target, categorical vs numerical, visual inspection, decision rules
Numerical data preprocessing
- Train-test split, score, (linear vs non-linear models), class imbalance (majority class, DummyClassifier), scaling, pipelines, ~~cross-validation~~
Categorical data preprocessing
- Encoding, selection based on types, pipeline, compare two models (gradient boosting vs linear), handling data for tree-based models
Wrap-up quiz
- Tabular data exploration, features, numerical data, ordinal categories, pipeline with imputer ~~and cross-validation~~, ~~pipeline with imputer + cross-validation and score distributions~~, (🔺)more on handling missing data
Main takeaway

Module 2. Selecting the Best Model

Module overview
(🔺)Cross-validation and scores distributions
- Train and test errors, notion of improvement/deterioration for selecting models
Overfitting and underfitting
- Model complexity, ~~train and test errors~~, ~~cross-validation in detail~~, error distributions, target distribution
Validation and learning curves
- Inductive bias and complexity, overfitting and underfitting, validation curve, learning curve
Bias vs variance trade-off
Wrap-up quiz
- Classification vs regression, class imbalance, balanced accuracy, scaling, validation curve, underfitting, overfitting, generalizing
Main takeaway

ArturoAmorQ · 2021-07-19T14:27:09Z

This proposal would potentially address the issues in #124, #340, #361 and #366

lesteve · 2021-07-20T09:35:22Z

So I see at least four different things here:

Removing cross-validation from M1

With M1 we aimed at a tight session to go as efficiently as possible from knowing almost nothing about scikit-learn to a realistic scikit-learn pipeline (i.e. something that we would use in practice). Removing cross-validation completely from M1 does not seem great from this point of view. Basically, people involved in this MOOC will tell you cross-validation is super important.

CV is revisited anyway in section Overfitting and underfitting, giving a sense or being redundant.

Generally speaking, repeating things or covering the same thing in a slightly different way is completely fine and I would argue actually a good thing pedagogically. So I don't see a huge problem with this, especially given the cost associated to moving things around (in our repo, changing the exercise that use cross-validation, checking all the notebooks that may mention cross-validation, deciding where to move this or say "in the next module we will see this in more details" and probably also in FUN for the quizzes).

Light refactoring within M2

It seems like you want to move train and test scores to its own lesson, why not? Can you explain a bit more why you don't like the way it is done currently? I guess at one point we had in mind that the videos would come first to give the intuitions and the code later to reinforce the intuitions.

Adding content about score distributions and model improvement/deterioration

Mentioned in #366, let's keep the discussion there.

Regarding the cross-validation and evaluation => #415

add a new notebook between the notebook 1 and 2 to introduce the model evaluation. It should discuss train_test_split and thus the cross_validate -> "Evaluate your first model".
Look at reducing the discussions since we don't use a Pipeline anymore.
If necessary, add a small section only speaking about the Pipeline inside cross_validate.
"Exercise M1.03" -> check if we should add an additional exercise that uses cross-validation.

Missing values

rework the first wrap-up quiz removing the missing values => Simplify first module wrap-up quiz to not need a SimpleImputer #361
await to see if we do a new section => Add "advanced pipeline", missing value, imputing module, maybe more #414

Module 2 => #416

The proposal of Arturo is good:

move score distribution and variations at the beginning
variability between 2 models
and see the advancement of underfitting-overfitting, etc...

lesteve · 2021-07-23T14:07:31Z

I opened several issue to split the work we agreed on, so closing this one.

ArturoAmorQ mentioned this issue Jul 19, 2021

High-level overview of the pedagogical structure #332

Closed

lesteve closed this as completed Jul 23, 2021

lesteve mentioned this issue Jul 23, 2021

Restructuring of module 2 #416

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for reordering the contents in M1 and M2 #398

Proposal for reordering the contents in M1 and M2 #398

ArturoAmorQ commented Jul 16, 2021 •

edited

ArturoAmorQ commented Jul 19, 2021

lesteve commented Jul 20, 2021

glemaitre commented Jul 23, 2021 •

edited by lesteve

lesteve commented Jul 23, 2021

Proposal for reordering the contents in M1 and M2 #398

Proposal for reordering the contents in M1 and M2 #398

Comments

ArturoAmorQ commented Jul 16, 2021 • edited

Module

Actual contents:

Module 1. The Predictive Modeling Pipeline

Module 2. Selecting the Best Model

Proposed contents:

Module 1. The Predictive Modeling Pipeline

Module 2. Selecting the Best Model

ArturoAmorQ commented Jul 19, 2021

lesteve commented Jul 20, 2021

Removing cross-validation from M1

Light refactoring within M2

Adding content about score distributions and model improvement/deterioration

More content on handling missing data

glemaitre commented Jul 23, 2021 • edited by lesteve

Regarding the cross-validation and evaluation => #415

Missing values

Module 2 => #416

lesteve commented Jul 23, 2021

ArturoAmorQ commented Jul 16, 2021 •

edited

glemaitre commented Jul 23, 2021 •

edited by lesteve