k-means-clustering

This project is a C implementation of the k-means clustering algorithm that has been parallelized to run across multiple threads with OpenMP and uses silhouette coefficients to find an optimal number of clusters.

This algorithm first attempts to identify an optimal number of clusters to solve for, using silhouette coefficients that are averaged over k-folds. The dataset is parsed from a file and split into training and testing datasets and uses k-folds cross-validation. Once silhouette coefficients have been calculated for a range of k values, a target k is selected, and centroids are calculated on the entire dataset.

This implementation can handle datasets of arbitrary dimension and length. The expected input format is comma-separated, but the delimiter can be changed with the '-d' flag. For an example dataset, see data/iris.csv.

Two output files will be generated in the directory of the binary. The first 'output_clusters.csv' will be the dataset with an additional column indicating which cluster each point belongs to. The second file is 'output_centroids.csv', which contains the coordinates of the centroids.

Options

-i [filepath]

Input filename/path.

Default is 'input.cvs' -d [delimiter]

Delimiter used when parsing the input dataset file.

Default delimiter is ",". -k [num_clusters]

Specify the number of clusters to identify, k. If you know the number of clusters that should be identified, you can pass this option to bypass using silhouette analysis.

Must be a positive integer. -m [min]

Specify the minimum number of clusters to analyze during silhouette analysis.

Must be a positive integer.

Default is 2. -M [max]

Specify the maximum number of clusters to analyze during silhouette analysis.

Must be a positive integer.

Default is 10. -b [max_iterations]

Maximum allowed iterations in each k-means.

Must be a positive integer.

Default is 100. -e [num_kmeans]

Number of parallel executed k-means.

Must be a positive integer.

Default is 100. -f [num_folds]

Number of folds for cross-validation.

Must be a positive integer.

Default is 5. -t [num_threads]

Number of threads to spread the workload across.

Must be a positive integer.

Default behavior will use all available threads. -r

Randomize the dataset order. It is important that the dataset is randomized for cross-validation. -n

Normalize the dataset. This is a good idea if the dataset is not already normalized.

Getting Started

Linux:

Clone repository git clone https://github.com/lmarzen/k-means-clustering.git or download and extract ZIP.
Open a terminal(or command prompt on Windows) in the src directory and run make to build the program.
Run the program by typing ./kmeans followed by any valid arguments.
Done.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
data		data
output		output
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

output

output

src

src

LICENSE

LICENSE

README.md

README.md

Repository files navigation

k-means-clustering

Options

Getting Started

About

Releases

Packages

Languages

License

lmarzen/k-means-clustering

Folders and files

Latest commit

History

Repository files navigation

k-means-clustering

Options

Getting Started

About

Resources

License

Stars

Watchers

Forks

Languages