Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] UMAP random_state doesn't provide consistency #5892

Open
chentitus opened this issue May 16, 2024 · 6 comments
Open

[BUG] UMAP random_state doesn't provide consistency #5892

chentitus opened this issue May 16, 2024 · 6 comments
Labels
bug Something isn't working CUDA / C++ CUDA issue Cython / Python Cython or Python issue

Comments

@chentitus
Copy link

Dear cuml team,

I am utilizing BERTopic for topic modeling. I understand that when I import UMAP from umap, and HDBSCAN from hdbscan, I can reproduce the results of topic modeling by setting random_state in UMAP.

But I realized that if I import HDBSCAN from cuml.cluster, and UMAP from cuml.manifold, then the results of topic modeling can no longer be replicated even when I set random_state in UMAP.

This is done on the Colab platform, and I upgrade BERTopic to 0.16.2.

Any ideas on how I can reproduce topic modeling results using cuml? Thanks much!

@chentitus chentitus added ? - Needs Triage Need team to review and classify question Further information is requested labels May 16, 2024
@beckernick
Copy link
Member

beckernick commented May 16, 2024

The UMAP docstring indicates that random_state can't provide exact determinism but should provide consistency up to about 3 digits of precision.

@dantegd , possible we have a bug or the documentation is wrong?

import cuml
from sklearn.datasets import make_blobs

N = 1000

X, y = make_blobs(
    n_samples=N
)

NREP = 3
for i in range(NREP):
    reducer = cuml.manifold.umap.UMAP(
        random_state=12
    )
    X_t = reducer.fit_transform(X)
    print(reducer.random_state)
    print(X_t[:5])
    print()
662124363
[[-2.5505848  -0.63661003]
 [-5.3669243  -0.07881355]
 [-4.428316    1.4433041 ]
 [-0.9989338  10.929661  ]
 [ 6.8667793  -9.262173  ]]

662124363
[[ -1.9667425   -2.6903896 ]
 [ -3.396501    -0.25006104]
 [ -1.6785622    0.13145828]
 [  3.3643045   11.314904  ]
 [ -2.0715647  -11.898888  ]]

662124363
[[  0.3823166    2.5653324 ]
 [  0.5335636   -0.0426445 ]
 [  2.2950068    0.81112003]
 [ -7.4286957   10.400803  ]
 [  8.3242235  -10.5068655 ]]

@beckernick beckernick added CUDA / C++ CUDA issue Cython / Python Cython or Python issue bug Something isn't working and removed ? - Needs Triage Need team to review and classify question Further information is requested labels May 16, 2024
@beckernick beckernick changed the title [QST] Can't reproduce same BERTopic results when using cuml version of UMAP and HDBSCAN [BUG] UMAP random_state doesn't provide consistency when used May 16, 2024
@beckernick beckernick changed the title [BUG] UMAP random_state doesn't provide consistency when used [BUG] UMAP random_state doesn't provide consistency May 16, 2024
@chentitus
Copy link
Author

Dear cuml team,

Another cuml-related issue has just popped up:

I need to know topic distribution of each document so I follow BERTopic instructions to implement approximate_distribution, but it returns with a ndarray containing nothing but 0s.

I have just realized that this issue may be due to cuml.

approximate_distribution can generate topic distribution if I use

from umap import UMAP
from hdbscan import HDBSCAN

But approximate_distribution returns with only 0s if I use

from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP

Any help or advice is much appreciated!

@viclafargue
Copy link
Contributor

@beckernick I am not quite sure if it works with spectral initialization, could you try using init="random"?

@cjnolet
Copy link
Member

cjnolet commented May 16, 2024

That looks like a bug to me. Oddly, oddly we also have python tests for the reproducibility and those appear to be passing...

Victor's got a good point- it's very possible the spectral embedding is not honoring the random state and that's why we are using random init in the pytests.

@beckernick
Copy link
Member

Looks like that's the bug:

import cuml
from sklearn.datasets import make_blobs

N = 1000

X, y = make_blobs(
    n_samples=N
)

NREP = 3
for i in range(NREP):
    reducer = cuml.manifold.umap.UMAP(
        random_state=12,
        init="random"
    )
    X_t = reducer.fit_transform(X)
    print(reducer.random_state)
    print(X_t[:5])
    print()
662124363
[[ -4.766629    8.464443 ]
 [  8.891461    1.2006083]
 [ -7.211566   -7.8680773]
 [ -5.811491  -12.208349 ]
 [ -6.8120937   7.2288113]]

662124363
[[ -4.766629    8.464443 ]
 [  8.891461    1.2006083]
 [ -7.211566   -7.8680773]
 [ -5.811491  -12.208349 ]
 [ -6.8120937   7.2288113]]

662124363
[[ -4.766629    8.464443 ]
 [  8.891461    1.2006083]
 [ -7.211566   -7.8680773]
 [ -5.811491  -12.208349 ]
 [ -6.8120937   7.2288113]]

@viclafargue
Copy link
Contributor

Are we planning to fix spectral initialization already or should I open a PR to update the documentation regarding this limitation for now?

cc @cjnolet @dantegd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CUDA / C++ CUDA issue Cython / Python Cython or Python issue
Projects
None yet
Development

No branches or pull requests

4 participants