Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds a semi-supervised (specifically a combination of supervised and weakly-supervised data) version of weak algorithms #268

Closed
wants to merge 9 commits into from

Conversation

RobinVogel
Copy link
Contributor

@RobinVogel RobinVogel commented Nov 28, 2019

Closes #233

For now I only wrote what I believe to be expected for #233 for the RCA algorithm.
It is a simple modification of the supervised version of the RCA. The test is very basic as well.

It is just based on concatenating the weakly supervised information and the weakly supervised information of the transformed labeled data (strongly supervised information).

It is convenient but increases the volume of the code and documentation.
There is a random_state parameter passed to the fit function in RCA, it is marked
as deprecated and augments the volume of tests needed for the Semi Supervised algorithms.
I will check whether a random_state is present in other algorithms, to understand its relevance.

I will do the other algorithms and better tests if we agree on this structure.

chunks = cons.chunks(num_chunks=20)
rca_semisupervised.fit(X[:n], y[:n],
X[n:], chunks)
rca_semisupervised.fit(X[:n], y[:n],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably add more tests around what rca_semisupervised looks like after fitting

@bellet
Copy link
Member

bellet commented Dec 5, 2019

Just a quick reminder: "solves" is not part of the keywords that GitHub recognizes to automatically close issues ;-)

@bellet
Copy link
Member

bellet commented Dec 5, 2019

I think this creates a major API problem due to the fact that fit takes as input 4 arguments X, y, X_u, chunks where X and y do not generally have the same number of rows as X_u and chunks. This likely breaks compatibility with model selection routines from sklearn.

Furthermore, this strong supervision + weak supervision is not a major use-case in practice. So indeed the overhead induced by introducing new classes, having to test and document them etc, is probably too large compared to the benefits.

I would favor a solution based on helper functions which combine pairs/quadruplets/chunks provided by the user with those generated from labeled data so that users can then easily fit RCA with the output of this helper function. So essentially something similar to what you wrote for RCA but without creating a new class. We can then add a short paragraph to mention the existence of such helper functions in the doc and we're good.

Note: as pointed out by @hansen7 on #233, semi-supervised is probably not the right term to describe this. This is more a combination of supervised and weakly supervised.

@bellet
Copy link
Member

bellet commented Dec 5, 2019

Of course I am happy to hear whether @terrytangyuan @perimosocordiae @wdevazelhes have a different opinion

@terrytangyuan
Copy link
Member

I agree. In this case API compatibility is more important, especially now that we are in scikit-learn-contrib. We can start with the helper function and if it becomes popular to users we can then re-consider this.

@RobinVogel RobinVogel changed the title Adds a semi-supervised version of weak algorithms Adds a semi-supervised (specifically a combination of supervised and weakly-supervised data) version of weak algorithms Dec 10, 2019
@RobinVogel RobinVogel closed this by deleting the head repository Jun 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Combination of supervised and weakly-supervised data
3 participants