-
Notifications
You must be signed in to change notification settings - Fork 515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] UMAP random_state doesn't provide consistency #5892
Comments
The UMAP docstring indicates that @dantegd , possible we have a bug or the documentation is wrong? import cuml
from sklearn.datasets import make_blobs
N = 1000
X, y = make_blobs(
n_samples=N
)
NREP = 3
for i in range(NREP):
reducer = cuml.manifold.umap.UMAP(
random_state=12
)
X_t = reducer.fit_transform(X)
print(reducer.random_state)
print(X_t[:5])
print()
662124363
[[-2.5505848 -0.63661003]
[-5.3669243 -0.07881355]
[-4.428316 1.4433041 ]
[-0.9989338 10.929661 ]
[ 6.8667793 -9.262173 ]]
662124363
[[ -1.9667425 -2.6903896 ]
[ -3.396501 -0.25006104]
[ -1.6785622 0.13145828]
[ 3.3643045 11.314904 ]
[ -2.0715647 -11.898888 ]]
662124363
[[ 0.3823166 2.5653324 ]
[ 0.5335636 -0.0426445 ]
[ 2.2950068 0.81112003]
[ -7.4286957 10.400803 ]
[ 8.3242235 -10.5068655 ]] |
Dear cuml team, Another cuml-related issue has just popped up: I need to know topic distribution of each document so I follow BERTopic instructions to implement approximate_distribution, but it returns with a ndarray containing nothing but 0s. I have just realized that this issue may be due to cuml. approximate_distribution can generate topic distribution if I use
But approximate_distribution returns with only 0s if I use
Any help or advice is much appreciated! |
@beckernick I am not quite sure if it works with spectral initialization, could you try using |
That looks like a bug to me. Oddly, oddly we also have python tests for the reproducibility and those appear to be passing... Victor's got a good point- it's very possible the spectral embedding is not honoring the random state and that's why we are using random init in the pytests. |
Looks like that's the bug: import cuml
from sklearn.datasets import make_blobs
N = 1000
X, y = make_blobs(
n_samples=N
)
NREP = 3
for i in range(NREP):
reducer = cuml.manifold.umap.UMAP(
random_state=12,
init="random"
)
X_t = reducer.fit_transform(X)
print(reducer.random_state)
print(X_t[:5])
print()
662124363
[[ -4.766629 8.464443 ]
[ 8.891461 1.2006083]
[ -7.211566 -7.8680773]
[ -5.811491 -12.208349 ]
[ -6.8120937 7.2288113]]
662124363
[[ -4.766629 8.464443 ]
[ 8.891461 1.2006083]
[ -7.211566 -7.8680773]
[ -5.811491 -12.208349 ]
[ -6.8120937 7.2288113]]
662124363
[[ -4.766629 8.464443 ]
[ 8.891461 1.2006083]
[ -7.211566 -7.8680773]
[ -5.811491 -12.208349 ]
[ -6.8120937 7.2288113]] |
Dear cuml team,
I am utilizing BERTopic for topic modeling. I understand that when I import UMAP from umap, and HDBSCAN from hdbscan, I can reproduce the results of topic modeling by setting random_state in UMAP.
But I realized that if I import HDBSCAN from cuml.cluster, and UMAP from cuml.manifold, then the results of topic modeling can no longer be replicated even when I set random_state in UMAP.
This is done on the Colab platform, and I upgrade BERTopic to 0.16.2.
Any ideas on how I can reproduce topic modeling results using cuml? Thanks much!
The text was updated successfully, but these errors were encountered: