Skip to content

C implementation of the k-means clustering algorithm that has been parallelized to run across multiple threads with OpenMP and uses silhouette coefficients to find an optimal number of clusters.

License

Notifications You must be signed in to change notification settings

lmarzen/k-means-clustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

k-means-clustering

This project is a C implementation of the k-means clustering algorithm that has been parallelized to run across multiple threads with OpenMP and uses silhouette coefficients to find an optimal number of clusters.

This algorithm first attempts to identify an optimal number of clusters to solve for, using silhouette coefficients that are averaged over k-folds. The dataset is parsed from a file and split into training and testing datasets and uses k-folds cross-validation. Once silhouette coefficients have been calculated for a range of k values, a target k is selected, and centroids are calculated on the entire dataset.

This implementation can handle datasets of arbitrary dimension and length. The expected input format is comma-separated, but the delimiter can be changed with the '-d' flag. For an example dataset, see data/iris.csv.

Two output files will be generated in the directory of the binary. The first 'output_clusters.csv' will be the dataset with an additional column indicating which cluster each point belongs to. The second file is 'output_centroids.csv', which contains the coordinates of the centroids.

Options

-i [filepath]

    Input filename/path.
    Default is 'input.cvs'
-d [delimiter]
    Delimiter used when parsing the input dataset file.
    Default delimiter is ",".
-k [num_clusters]
    Specify the number of clusters to identify, k. If you know the number of clusters that should be identified, you can pass this option to bypass using silhouette analysis.
    Must be a positive integer.
-m [min]
    Specify the minimum number of clusters to analyze during silhouette analysis.
    Must be a positive integer.
    Default is 2.
-M [max]
    Specify the maximum number of clusters to analyze during silhouette analysis.
    Must be a positive integer.
    Default is 10.
-b [max_iterations]
    Maximum allowed iterations in each k-means.
    Must be a positive integer.
    Default is 100.
-e [num_kmeans]
    Number of parallel executed k-means.
    Must be a positive integer.
    Default is 100.
-f [num_folds]
    Number of folds for cross-validation.
    Must be a positive integer.
    Default is 5.
-t [num_threads]
    Number of threads to spread the workload across.
    Must be a positive integer.
    Default behavior will use all available threads.
-r
    Randomize the dataset order. It is important that the dataset is randomized for cross-validation.
-n
    Normalize the dataset. This is a good idea if the dataset is not already normalized.

Getting Started

Linux:

  • Clone repository git clone https://github.com/lmarzen/k-means-clustering.git or download and extract ZIP.

  • Open a terminal(or command prompt on Windows) in the src directory and run make to build the program.

  • Run the program by typing ./kmeans followed by any valid arguments.

  • Done.

About

C implementation of the k-means clustering algorithm that has been parallelized to run across multiple threads with OpenMP and uses silhouette coefficients to find an optimal number of clusters.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published