GitHub - mynameisfiber/pyxmeans: Quick implementation of xmeans in python and C

pyxmeans

This is a quick implementation of XMeans for using kmeans type clustering with an unknown number of clusters. In order to make this code runnable, I chose to use MiniBatchKMeans instead of KMeans, but they should be swappable.

Currently MiniBatch and XMeans are supported. XMeans uses multiple MiniBatch and trial MiniBatch runs in order to infer how many clusters the data has. It does so by taking the population of a given cluster center, splitting it, and seeing if the resulting labels have a better BIC (Bayesian Information Criterion) than before. This can be done successively until we find the number of clusters.

In addition to providing XMeans, the MiniBatch implementation in this package is exceedingly fast. Below are benchmarks for all the provided clustering methods and sklearn's MiniBatch routine.

$ python -m pyxmeans.benchmark
Creating data
Number of points:  10000
Number of dimensions:  2
Number of clusters:  32
initial BIC:  -50214.4559857
initial variance:  0.00164148581105
initial RMS Error:  2.31798411948

Clustering with single-threaded pyxmeans
singlethreaded pyxmeans took 0.043875s
BIC:  -50762.9827994
Variance:  0.00115765494439
RMS Error:  2.31035692593

Clustering with multi-threaded pyxmeans
multithreaded pyxmeans took 0.326129s
BIC:  -50982.8001508
Variance:  0.00104848000929
RMS Error:  2.3113536455

Clustering with multi-threaded pyxmeans (starting k at 20)
multithreaded pyxmeans took 79.005781s
Num Clusters:  30
BIC:  -50352.8461421
Variance:  0.00104986238957
RMS Error:  2.31100693171

Clustering with sklearn
scikitlearn took 9.241426s
BIC:  -50679.1763114
Variance:  0.00112580908789
RMS Error:  2.31050074192

NOTES:

* `max_no_improvement` is set to `None` for MiniBatchKMeans to properly
  compare per-iteration speeds since we currently do not support
  early-stopping.
* RMS Error for the multi-threaded pymeans is higher because that function
  aims at minimizing the variance of the resulting model.

Dependencies

numpy

Todo:

Optimize data layout when dealing with / comparing computed clusters
Better memory management in XMeans (we're copying things everywhere)
Pool out children tests in XMeans

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
pyxmeans		pyxmeans
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benchmark.png		benchmark.png
benchmark.py		benchmark.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pyxmeans

pyxmeans

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

benchmark.png

benchmark.png

benchmark.py

benchmark.py

setup.py

setup.py

Repository files navigation

pyxmeans

Dependencies

Todo:

About

Releases

Packages

Contributors 3

Languages

License

mynameisfiber/pyxmeans

Folders and files

Latest commit

History

Repository files navigation

pyxmeans

Dependencies

Todo:

About

Resources

License

Stars

Watchers

Forks

Languages