Clustering freezing when assigning noise points: #1190

adrianlyjak · 2024-02-22T18:50:42Z

I'm attempting to cluster ~600k short texts (reviews). The process goes ok up until it logs that it's assigning noise points to clusters. It spends close to an hour embedding and clustering.

I successfully clustered a much smaller dataset sampled from this data (1000 items)

I'm running lilac in docker, it's the latest tag, which appears to currently be lilacai/lilac:0.3.5. I'm running it on a system with a geforce 3090.

Here are the logs:

jinaai/jina-embeddings-v2-small-en using device: cuda:0
[local/reviews][1 shards] map "cluster_documents" to "('review__cluster',)"Computing embeddings: 100%|██████████| 646314/646314 [35:16<00:00, 305.37it/s]s]
Computing embeddings took 2122.877s.
/usr/local/lib/python3.11/site-packages/umap/umap_.py:1943: UserWarning: n_jobs value -1 overridden to 1 by setting random_state. Use no seed for parallelism.
  warn(f"n_jobs value {self.n_jobs} overridden to 1 by setting random_state. Use no seed for parallelism.")
UMAP: Reducing dim from 512 to 5 of 646314 vectors took 1435.081s.
HDBSCAN: Clustering took 151.289s.
237724 noise points (36.8%) will be assigned to nearest cluster.

After this point it freezes, and the UI and server become very slow. The main thread appears to be very busy

Looking at the code, it seems like it's skipping assigning labels or something?

lilac/lilac/data/clustering.py

Line 386 in 8e7418d

with DebugTimer('HDBSCAN: Computing membership for the noise points'):

if num_noisy > 0 and num_noisy < len(clusterer.labels_):

the num_noisy count here seems to be quite high, and so I assume this condition isn't true, so it's actually skipping assigning labels. My read-through of the code here falls apart, and I'm unsure where the process is spending its time.

The text was updated successfully, but these errors were encountered:

deepfates · 2024-02-23T17:02:13Z

I get this also on Mac M3. I have tried to cluster the same data with CPU on Linux but it failed differently there at a later step

 UserWarning: n_jobs value -1 overridden to 1 by setting random_state. Use no seed for parallelism.
  warn(f"n_jobs value {self.n_jobs} overridden to 1 by setting random_state. Use no seed for parallelism.")
UMAP: Reducing dim from 512 to 5 of 60844 vectors took 59.686s.
HDBSCAN: Clustering took 3.125s.
25147 noise points (41.3%) will be assigned to nearest cluster.

kostum123 · 2024-02-25T01:12:33Z

Same for me. Its stuck at "noise points (42.3%) will be assigned to nearest cluster." Also bge m3 cant be used for clustering model, lilac defaults to weak model that trained on english only regardless of my preferred embedding model and instead of using already created embeddings for same dataset field it creates new embeddings. I hope both get fixed.

dsmilkov · 2024-02-28T17:29:09Z

We just added dataset.cluster(skip_noisy_assignment=...) (UI support too) in #1194

When set to True, it will skip assigning noisy points to the nearest cluster to speedup clustering. This will be available in the next release (in a couple of days).

For much faster (100x) clustering, please apply for our Lilac Garden pilot.

kostum123 · 2024-03-03T01:43:59Z

We just added dataset.cluster(skip_noisy_assignment=...) (UI support too) in #1194

When set to True, it will skip assigning noisy points to the nearest cluster to speedup clustering. This will be available in the next release (in a couple of days).

For much faster (100x) clustering, please apply for our Lilac Garden pilot.

Will I be able to use bge m3 in clustering?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clustering freezing when assigning noise points: #1190

Clustering freezing when assigning noise points: #1190

adrianlyjak commented Feb 22, 2024

deepfates commented Feb 23, 2024

kostum123 commented Feb 25, 2024 •

edited

dsmilkov commented Feb 28, 2024

kostum123 commented Mar 3, 2024

Clustering freezing when assigning noise points: #1190

Clustering freezing when assigning noise points: #1190

Comments

adrianlyjak commented Feb 22, 2024

deepfates commented Feb 23, 2024

kostum123 commented Feb 25, 2024 • edited

dsmilkov commented Feb 28, 2024

kostum123 commented Mar 3, 2024

kostum123 commented Feb 25, 2024 •

edited