Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow ModelVisualizers to wrap Pipeline objects #498

Open
4 tasks
bbengfort opened this issue Jul 13, 2018 · 4 comments · May be fixed by #955
Open
4 tasks

Allow ModelVisualizers to wrap Pipeline objects #498

bbengfort opened this issue Jul 13, 2018 · 4 comments · May be fixed by #955
Assignees
Labels
level: intermediate python coding expertise required priority: medium can wait until after next release type: feature a new visualizer or utility for yb
Milestone

Comments

@bbengfort
Copy link
Member

Describe the solution you'd like

Our model visualizers expect to wrap classifiers, regressors, or clusters in order to visualize the model under the hood; they even do checks to ensure the right estimator is passed in. Unfortunately in many cases, passing a pipeline object as the model in question does not allow the visualizer to work, even though the model is acceptable as a pipeline, e.g. it is a classifier for classification score visualizers (more on this below). This is primarily because the Pipeline wrapper masks the attributes needed by the visualizer.

I propose that we modify the ModelVisualizer to change the ModelVisualizer.estimator attribute to a @property - when setting the estimator property, we can perform a check to ensure that the Pipeline has a final_estimator attribute (e.g. that it is not a transformer pipeline). When getting the estimator property, we can return the final estimator instead of the entire Pipeline. This should ensure that we can use pipelines in our model visualizers.

NOTE however that we will still have to fit(), predict(), and score() on the entire pipeline, so this is a bit more nuanced than it seems on first glance. There will probably have to be is_pipeline() checking and other estimator access utilities.

Is your feature request related to a problem? Please describe.

Consider the following, fairly common code:

from sklearn.pipeline import Pipeline
from sklearn.neural_network import MLPClassifier 
from sklearn.feature_extraction.text import TfidfVectorizer 

from yellowbrick.classifier import ClassificationReport 

model = Pipeline([
    ('tfidf', TfidfVectorizer()), 
    ('mlp', MLPClassifier()), 
]) 

oz = ClassificationReport(model)
oz.fit(X_train, y_train)
oz.score(X_test, y_test)
oz.poof() 

This seems to be a valid model for a classification report, unfortunately the classification report is not able to access the MLPClassiifer's classes_ attribute since the Pipeline doesn't know how to pass that on to the final estimator.

I think the original idea for the ScoreVisualizers was that they would be inside of Pipelines, e.g.

model = Pipeline([
    ('tfidf', TfidfVectorizer()), 
    ('clf', ClassificationReport(MLPClassifier())), 
]) 

model.fit(X, y)
model.score(X_test, y_test)
model.named_steps['clf'].poof() 

But this makes it difficult to use more than one visualizer; e.g. ROCAUC visualizer and CR visualizer.

Definition of Done

  • Update ModelVisualizer class with pipeline helpers
  • Ensure current tests pass
  • Add test to all model visualizer subclasses to pass in a pipeline as the estimator
  • Add documentation about using visualizers with pipelines
@bbengfort bbengfort added type: feature a new visualizer or utility for yb priority: medium can wait until after next release level: intermediate python coding expertise required labels Jul 13, 2018
@bbengfort bbengfort added this to the v1.0 pre-work milestone Jan 2, 2019
@Yogayu
Copy link

Yogayu commented Apr 7, 2019

Dear Mentor @bbengfort ,

I am very interested in this Project for GSoC2019. So I spend two days to understand the core concept of yellowbrick and the struct of the yellowbrick library. However, there is something confusing to me about this idea, so if you can discuss with me, it will greatly help.

Brief introduction

First of all, I would like to introduce my understanding of the project. Since I am just beginning to get familiar with this project, there may be some mistakes in my understanding. If there is something wrong with my understanding, I really hope you can correct it.

Yellowbrick's core concept is to better assist decision-making in various processes of machine learning through a visual approach. So the visual object is the core foundation.

The root of the visual object Visualizer inherits from Scikit-Learn's BaseEstimator class. It adds Visualizer interface to enable visualization.

ModelVisualizer inherits from Visualizer and Wrapper. It's kind of like the adaptor in the adaptor pattern(also known as a wrapper). ModelVisualizer wraps model. And there are more classes inherits from ModelVisualizer, such as ScoreVisualizer.

The relationship of them in UML is like this:

relationship

The Feather: Allow ModelVisualizers to wrap Pipeline objects

In the above, you describe that the ModelVisualizer can wrap the model. Also, in many cases, a pipeline object can be passed to the ModelVisualizer as a model. However, the problem is that Pipeline encapsulates the real model final_estimator, so the ModelVisualizer can't directly interact with final_estimator, but only via Pipeline.

The relationship of them is like this:

assess

Current Status

I read the source code, and the code shows that the current implementation is: when passing a pipeline object to the ModelVisualizers, the pipeline is directly assigned to the ModelVisualizers's estimator. The related fit(), predict(), and score() methods are also directly called methods in the Pipeline class.

My Question

Q1: Do you mean we want that ModelVisualizers provide an interfaces (get final_estimator), which then the subclass can assess the final_estimator's classes_ attribute?

code

Q2: The source code of Pipeline shows we can get final_estimator through pipeline.steps[-1][1]. Is this the right way?
Q3: Under what circumstances do we need to use classes_ attribute? Could you give me an example?
Q4: The Pipeline itself already has fit(), predict(), and score() function. Therefore, when called in ModelVisualizer class and its subclasses, is it also directly using the Pipeline method. Or we have to implementation by ourselves?

I'm sorry that I wrote a little too much, but it is for the purpose of clearer expression and communication. If my question is too simple, please forgive me. You know, everyone has a beginning.

Thank you for your time and patient. Looking for your replay.


Thank you and best regards!
Xinyu You

@Yogayu
Copy link

Yogayu commented Apr 8, 2019

Respected Mentors @rebeccabilbro @bbengfort @lwgray @ndanielsen @pdamodaran @wagner2010,
Sorry to bother again. Is there anyone who can help me with this problem? There is little time left for GSoC Proposal to be submitted.

@wagner2010
Copy link
Contributor

Hi @Yogayu wow this is great! Thanks so much for your work here in such a short period of time. I also did see your posts on the Google group which I will be responding to shortly. Because, as you mentioned, time is running short and because we’re not available to respond in detail please go ahead and incorporate this into your proposal and submit it. We believe you’re on the right track for a strong proposal and we’re not expecting perfection so much as seeing your code and your vision for this problem. Regardless of the outcome as it relates to GSoC we welcome your involvement with Yellowbrick! Thanks and stay well.

@Yogayu
Copy link

Yogayu commented Apr 8, 2019

@wagner2010 I understand. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
level: intermediate python coding expertise required priority: medium can wait until after next release type: feature a new visualizer or utility for yb
Projects
None yet
3 participants