BICO

BICO is a fast streaming algorithm to compute high quality solutions for the k-means problem on very large sets of points. It combines the tree data structure of SIGMOND Test of Time Award winning algorithm BIRCH with insights from clustering theory to obtain solutions fast while keeping the error regarding the k-means cost function low.

Installation

pip install bico

Example

from bico import BICO
import numpy as np
import time

np.random.seed(42)

data = np.random.rand(10000, 10)

start = time.time()
bico = BICO(n_clusters=3, random_state=0, fit_coreset=True)
bico.fit(data)

print("Time:", time.time() - start)
# Time: 0.08275651931762695

print(bico.coreset_points_)
# BICO returns a set of points that act as a summary of the entire dataset.
# By default, at most 200 * n_clusters points are returned.
# This behaviour can be changed by setting the `summary_size` parameter.

# [[0.45224018 0.70183673 0.55506671 ... 0.70132665 0.57244196 0.66789088]
#  [0.73712952 0.5250208  0.43809322 ... 0.61427161 0.67910981 0.56207661]
#  [0.89905336 0.46942062 0.20677639 ... 0.74210482 0.75714522 0.49651055]
#  ...
#  [0.68744494 0.41508081 0.39197623 ... 0.44093386 0.21983902 0.37237243]
#  [0.60820965 0.29406341 0.67067782 ... 0.66435474 0.2390822  0.20070476]
#  [0.67385626 0.33474823 0.68238779 ... 0.3581703  0.65646253 0.41386131]]

print(bico.cluster_centers_)
# If the `fit_coreset` parameter is set to True, the cluster centers are computed using KMeans from sklearn based on the coreset.

# [[0.46892639 0.41968333 0.47302945 0.51782955 0.39390839 0.56209413
#   0.4481691  0.49521457 0.31394509 0.5104331 ]
#  [0.54384638 0.518978   0.49456809 0.56677848 0.63881783 0.33627504
#   0.49873782 0.5541338  0.52913562 0.56017203]
#  [0.48639347 0.55542596 0.54350474 0.41931257 0.48117255 0.60089563
#   0.55457724 0.44833238 0.67583389 0.43069267]]

Example with Large Datasets

For very large datasets, the data may not actually fit in memory. In this case, you can use partial_fit to stream the data in chunks. In this example, we use the US Census Data (1990) dataset. You can find more examples in the tests folder.

from bico import BICO
import numpy as np
import time

np.random.seed(42)

data = np.random.rand(10000, 10)

start = time.time()
bico = BICO(n_clusters=3, random_state=0)
for chunk in pd.read_csv(
    "census.txt", delimiter=",", header=None, chunksize=10000
):
    bico.partial_fit(chunk.to_numpy(copy=False))
# If a final `partial_fit` is called with no data, the coreset is computed
bico.partial_fit()

Development

Install poetry

curl -sSL https://install.python-poetry.org | python3 -

Install clang

sudo apt-get install clang

Set clang variables

export CXX=/usr/bin/clang++
export CC=/usr/bin/clang

Install the package

poetry install

If the installation does not work and you do not see the C++ output, you can build the package to see the stack trace

poetry build

Run the tests

poetry run python -m unittest discover tests -v

Citation

If you use this code, please cite the following paper:

Hendrik Fichtenberger, Marc Gillé, Melanie Schmidt, Chris Schwiegelshohn and Christian Sohler. "BICO: BIRCH Meets Coresets for k-Means Clustering" (2013). ESA 2013.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
bico		bico
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.mypy.ini		.mypy.ini
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
build_extension.py		build_extension.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BICO

Installation

Example

Example with Large Datasets

Development

Citation

About

Releases 2

Languages

License

algo-hhu/bico

Folders and files

Latest commit

History

Repository files navigation

BICO

Installation

Example

Example with Large Datasets

Development

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Languages