Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix hierarchical_topics(...) when the distances between three clusters are the same #1929

Merged
merged 9 commits into from
Jun 13, 2024

Conversation

azikoss
Copy link
Contributor

@azikoss azikoss commented Apr 17, 2024

Adds functionality that makes sure that during calculating of hierarchical_topics(...) the distances between clusters are unique (by adding some small noise) otherwise the flatting of the hierarchy would produce incorrect values for "Topics" for these clusters (#1907)

…s are unique (by adding some small noise) otherwise the flatting of the hierarchy would produce incorrect values for "Topics" for these clusters (MaartenGr#1907)
Copy link
Owner

@MaartenGr MaartenGr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work on this! I left a few notes with things that we might need to consider.

bertopic/_utils.py Outdated Show resolved Hide resolved
bertopic/_utils.py Outdated Show resolved Hide resolved
tests/test_utils.py Outdated Show resolved Hide resolved
assert len(unique_dists) == len(dists), "The number of elements must be the same"
assert len(dists) == len(np.unique(unique_dists)), "The distances must be unique"

check_dists([0, 0, 0.5, 0.75, 1, 1], noise_max=1e-7)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you checked the actual values of the updated distance list? When I run it, I get the following updated values:

[0.00000000e+00 8.32483552e-08 5.00000000e-01 7.50000000e-01
 1.00000000e+00 2.00000008e+00]

The last value is twice as big which should not happen. I have a feeling the code for get_unique_distances could be simplified a bit. What about simply doing something like this:

def get_unique_distances(dists):
    increment =  np.random.uniform(low=1e-5, high=1e-6)
    last_val = -float('inf')
    return [last_val := max(dist, last_val + increment) for dist in dists]

my_list = [0, 0, 0, 0.5, 0.75, 1, 1]
get_unique_distances(my_list)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice simplification.

Are we ok with changing distances that do not have a duplicate?

E.g. check_dists([0, 0, 0, 0, 0, 0, 0, 1e-7], noise_max=1e-7) changes the last value otherwise the distances would not be in the increasing order.

I had a bug in the code (should assign and not add), that's why the last value was 2.00000008e+00.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we ok with changing distances that do not have a duplicate?

Hmmm, my preference would indeed be to keep them as is as long as it requires no more than one or two lines of code. I would like to simplify this as much as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified the code. Please have a look and let me know if you have any ideas.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! I just tested it a bunch of times and it all looks good to me. Thanks for simplifying the code. I'll re-run the workflow to check whether everything passes. If it does, I will go ahead and merge the PR.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests failed but I believe because you used list[float] which is not supported in python 3.8. Removing that should make the tests pass I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah, you are right! I just changed it! Thank you!

@azikoss
Copy link
Contributor Author

azikoss commented Jun 3, 2024

@MaartenGr could you please run the tests and see if we can merge this?

@MaartenGr MaartenGr mentioned this pull request Jun 6, 2024
6 tasks
@MaartenGr
Copy link
Owner

@azikoss After merging #1894 there are now a couple of small conflicts in this PR. Could you take a look? As soon as those are resolved I will go ahead and merge this PR.

@azikoss
Copy link
Contributor Author

azikoss commented Jun 12, 2024

Yes, done!

@MaartenGr
Copy link
Owner

Awesome, thank you for the work on this, it is greatly appreciated! 😄

@MaartenGr MaartenGr merged commit 0a28916 into MaartenGr:master Jun 13, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants