[BUG] UMAP random_state doesn't provide consistency #5892

chentitus · 2024-05-16T10:26:30Z

Dear cuml team,

I am utilizing BERTopic for topic modeling. I understand that when I import UMAP from umap, and HDBSCAN from hdbscan, I can reproduce the results of topic modeling by setting random_state in UMAP.

But I realized that if I import HDBSCAN from cuml.cluster, and UMAP from cuml.manifold, then the results of topic modeling can no longer be replicated even when I set random_state in UMAP.

This is done on the Colab platform, and I upgrade BERTopic to 0.16.2.

Any ideas on how I can reproduce topic modeling results using cuml? Thanks much!

beckernick · 2024-05-16T14:04:36Z

The UMAP docstring indicates that random_state can't provide exact determinism but should provide consistency up to about 3 digits of precision.

@dantegd , possible we have a bug or the documentation is wrong?

import cuml
from sklearn.datasets import make_blobs

N = 1000

X, y = make_blobs(
    n_samples=N
)

NREP = 3
for i in range(NREP):
    reducer = cuml.manifold.umap.UMAP(
        random_state=12
    )
    X_t = reducer.fit_transform(X)
    print(reducer.random_state)
    print(X_t[:5])
    print()
662124363
[[-2.5505848  -0.63661003]
 [-5.3669243  -0.07881355]
 [-4.428316    1.4433041 ]
 [-0.9989338  10.929661  ]
 [ 6.8667793  -9.262173  ]]

662124363
[[ -1.9667425   -2.6903896 ]
 [ -3.396501    -0.25006104]
 [ -1.6785622    0.13145828]
 [  3.3643045   11.314904  ]
 [ -2.0715647  -11.898888  ]]

662124363
[[  0.3823166    2.5653324 ]
 [  0.5335636   -0.0426445 ]
 [  2.2950068    0.81112003]
 [ -7.4286957   10.400803  ]
 [  8.3242235  -10.5068655 ]]

chentitus · 2024-05-16T15:57:46Z

Dear cuml team,

Another cuml-related issue has just popped up:

I need to know topic distribution of each document so I follow BERTopic instructions to implement approximate_distribution, but it returns with a ndarray containing nothing but 0s.

I have just realized that this issue may be due to cuml.

approximate_distribution can generate topic distribution if I use

from umap import UMAP
from hdbscan import HDBSCAN

But approximate_distribution returns with only 0s if I use

from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP

Any help or advice is much appreciated!

viclafargue · 2024-05-16T15:58:44Z

@beckernick I am not quite sure if it works with spectral initialization, could you try using init="random"?

cjnolet · 2024-05-16T19:11:59Z

That looks like a bug to me. Oddly, oddly we also have python tests for the reproducibility and those appear to be passing...

Victor's got a good point- it's very possible the spectral embedding is not honoring the random state and that's why we are using random init in the pytests.

beckernick · 2024-05-30T18:36:27Z

Looks like that's the bug:

import cuml
from sklearn.datasets import make_blobs

N = 1000

X, y = make_blobs(
    n_samples=N
)

NREP = 3
for i in range(NREP):
    reducer = cuml.manifold.umap.UMAP(
        random_state=12,
        init="random"
    )
    X_t = reducer.fit_transform(X)
    print(reducer.random_state)
    print(X_t[:5])
    print()
662124363
[[ -4.766629    8.464443 ]
 [  8.891461    1.2006083]
 [ -7.211566   -7.8680773]
 [ -5.811491  -12.208349 ]
 [ -6.8120937   7.2288113]]

662124363
[[ -4.766629    8.464443 ]
 [  8.891461    1.2006083]
 [ -7.211566   -7.8680773]
 [ -5.811491  -12.208349 ]
 [ -6.8120937   7.2288113]]

662124363
[[ -4.766629    8.464443 ]
 [  8.891461    1.2006083]
 [ -7.211566   -7.8680773]
 [ -5.811491  -12.208349 ]
 [ -6.8120937   7.2288113]]

viclafargue · 2024-06-13T15:55:32Z

Are we planning to fix spectral initialization already or should I open a PR to update the documentation regarding this limitation for now?

cc @cjnolet @dantegd

chentitus added ? - Needs Triage Need team to review and classify question Further information is requested labels May 16, 2024

beckernick added CUDA / C++ CUDA issue Cython / Python Cython or Python issue bug Something isn't working and removed ? - Needs Triage Need team to review and classify question Further information is requested labels May 16, 2024

beckernick changed the title ~~[QST] Can't reproduce same BERTopic results when using cuml version of UMAP and HDBSCAN~~ [BUG] UMAP random_state doesn't provide consistency when used May 16, 2024

beckernick changed the title ~~[BUG] UMAP random_state doesn't provide consistency when used~~ [BUG] UMAP random_state doesn't provide consistency May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] UMAP random_state doesn't provide consistency #5892

[BUG] UMAP random_state doesn't provide consistency #5892

chentitus commented May 16, 2024

beckernick commented May 16, 2024 •

edited

chentitus commented May 16, 2024

viclafargue commented May 16, 2024

cjnolet commented May 16, 2024 •

edited

beckernick commented May 30, 2024

viclafargue commented Jun 13, 2024

[BUG] UMAP random_state doesn't provide consistency #5892

[BUG] UMAP random_state doesn't provide consistency #5892

Comments

chentitus commented May 16, 2024

beckernick commented May 16, 2024 • edited

chentitus commented May 16, 2024

viclafargue commented May 16, 2024

cjnolet commented May 16, 2024 • edited

beckernick commented May 30, 2024

viclafargue commented Jun 13, 2024

beckernick commented May 16, 2024 •

edited

cjnolet commented May 16, 2024 •

edited