You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey guys! I wrote a summary of the code + explanations, so that I could use it fast and understand what I am doing. Feel free to use it! It was especially helpful during the exam.
library(MASS) # for geyser data
library(flexclust) # bundestag data
library(pdfCluster) # oliveoil data
library(fpc) #
library(sn) # Skew-normal and skew-t distributions
library(cluster) # Needed for gap statistic (clusGap), Gower (daisy) and PAM
library(smacof) # MDS (Multi-dimensional scaling)
library(mclust) # ARI (Adjusted Rand Index), Gaussian Mixture Models
library(teigen) # For t-distribution heavy-tailed Mixture Models
library(pracma) # For K-means++ algorithm
library(mixsmsn) # For Skew-Normal Mixture Models (smsn)
library(robustbase) # For huberM
library(mixtools) # For ellipse illustration
library(poLCA) # For categorical data
library(fda) # For functional data
library(funFEM) # For FEM algorithm
library(nomclust) # For simple matching
Principal component analysis finds the (orthogonal) directions in which the variance
is higest within the data.
Scaling is important and will alter the results.
################
#### Modify ####
################
data <- scale(olive)
clustering <- oliveoil[,2]
################
pr <- princomp(data) # Create Principal component object
#### Summaries ####
summary(pr)
pr$sdev / sum(pr$sdev) # Variance explained by Principal component
cumsum(pr$sdev / sum(pr$sdev)) # Cumulative variance explained up to each Principal component
#### Visualize components ####
plot(pr, main = "Principal components") # Histogram of Principal components
biplot(pr, cex=0.7) # Display the variable parts for the first two Principal components
#### Visualize clustering ####
plot(pr$scores, col = clustering, pch = clusym[clustering]) # Plot clustering on first two components. Similar to MDS
MDS
Not a cluster analysis method, but useful for displaying multi-dimensional data.
We use Ratio MDS in this course.
Scaling is important and will alter the results.
################
#### Modify ####
################
dist_matr <- as.matrix(dist(olive))
################
#### Compute MDS object ####
mds_obj <- mds(dist_matr, ndim = 2)
#### Goodness of MDS ####
mds_obj$stress # The lower the stress, the closer the distances are to the true distances
#### Visualize MDS ####
plot(mds_obj$conf, asp = 1)
plot(mds_obj$conf, type = "n", asp = 1) # Make empty plot, asp means "aspect ratio" (correct distances)
text(mds_obj$conf, labels = row(data)) # Add labels
Heatmap
################
#### Modify ####
################
data <- scale(olive)
################
data <- as.matrix(data)
#### Heatmap of data ####
heatmap(data, Rowv = NA, Colv = NA, col = grey(seq(1, 0, -0.01))) # Binary display of data (like genes)
#### Heatmap of covariance ####
heatmap(cov(data), scale = "none", col = cm.colors(20)) # Show correlations between variables
Functional data
Distance measures
Overview
Terminology:
A dissimilarity is a measure that fulfills the following two properties:
$d(x, y) \ge 0$
$d(x, x) = 0$
A dissimilarity is a distance measure when it additionally fulfills the third property:
3. $d(x,z) < d(x, y) + d(y, z)$
Categorization Dissimilarities:
Jaccard (missing values only)
Simple Matching(missing values only)
Gower coefficient (missing values only)
Correlation
Distances:
Euclidean
Manhattan
Minkowski (general form, including Euclidean and Manhattan)
Mahanalobis
Continuous variables
Scaling is important and will alter the results, except for the Mahalanobis distance.
################
#### Modify ####
################
data <- scale(olive)
################
#### Euclidean ####
eucl <- dist(data, method = "euclidean")
eucl_m <- as.matrix(eucl)
#### Manhattan ####
manh <- dist(data, method = "manhattan")
manh_m <- as.matrix(manh)
#### Mahanalobis ####
maha_m <- matrix(0, ncol = nrow(data), nrow = nrow(data))
data_cov <- cov(data)
for (i in 1:nrow(data)) {
maha_m[i,] = mahalanobis(data, as.numeric(data[i,]), data_cov)
}
maha <- as.dist(maha_m)
#### Correlation dissimilarity ####
correl <- cor(data)
## Two options as what great negative correlation indicates:
# Not similar observations
cor_dist <- 0.5 - correl / 2
# Similar observations
cor_distsim <- 1 - abs(correl)
Binary and categorical variables
################
#### Modify ####
################
data <- housing
################
#### Simple matching ####
simp_m <- sm(data)
#### Jaccard ####
jacc <- dist(data, method = "binary")
#### Gower ####
# The keyword "type" determines which metric to use for each variable. Defaults are:
# Binary and factor variables -> simple matching
# Ordered factors -> Replaced with 1, 2, 3,... and then treated as numerical
# For "type" = asymm -> Jaccard (binary variables only)
# For "type" = symm -> Simple matching (binary variables only) -> Default
# For "type" = factor (nominal) -> Same as "symm" except when missing values are present
gow <- daisy(data, metric = "gower", type = list(symm = 4, factor = c(9, 10)))
Clustering methods
Needed for determination of number of clusters.
################
#### Modify ####
################
data <- scale(olive)
dist_obj <- dist(data) # Needed for silhouette, default same as dist
max_nc <- 15
################
clus <- list()
sil <- list()
asw <- NA
Partitioning
K-means
Partitioning method that is based on euclidean data only. Squared Euclidean distance is used for the objective function (not for distance matrix!).
Cluster concept: homogeneous, spherical (round)
Scaling is important and will alter the results.
################
#### Modify ####
################
data <- scale(olive)
max_nc <- 15
################
set.seed(12345)
clus <- list()
sil <- list()
asw <- NA
for (k in 2:max_nc) {
clus[[k]] <- kmeans(data, k, iter.max = 100, nstart = 100)
sil[[k]] <- silhouette(clus[[k]]$cluster, dist = dist_obj)
asw[k] <- summary(sil[[k]])$avg.width
}
plot(1:max_nc, asw, type = "l", xlab = "Number of clusters", ylab = "ASW")
# asw[2] = 0
nc <- which.max(asw)
nc
plot(sil[[nc]])
km <- kmeans(data, centers = nc, nstart = 100, iter.max = 100) # Cluster the data
#### Results ####
clustering <- km$cluster
#### Examine clustering ####
table(clustering)
mds_obj <- mds(dist(data))
plot(mds_obj$conf, asp = 1, col = clustering, pch = clusym[clustering])
pairs(data, col = clustering)
PAM
Partitioning method based on dissimilarity measure.
It minimizes the sum of pairwise dissimilarities and for this finds an
optimal solution with manhattan distance.
################
#### Modify ####
################
dist_obj <- dist(scale(geyser))
method = "average" # Alternatives: "single", "complete", "ward.D2"
max_nc = 15
################
set.seed(12345)
clus <- list()
sil <- list()
asw <- NA
#### Compute clustering ####
hier_clus <- hclust(dist_obj, method = method)
#### Visualize Dendrogram ####
# Use this to choose the numbers of clusters (additional to ASW)
plot(hier_clus)
for (k in 2:max_nc) {
clus[[k]] <- cutree(hier_clus, k)
sil[[k]] <- silhouette(clus[[k]], dist = dist_obj)
asw[k] <- summary(sil[[k]])$avg.width
}
plot(1:max_nc, asw, type = "l", xlab = "Number of clusters", ylab = "ASW")
# asw[2] = 0
nc <- which.max(asw)
nc
plot(sil[[nc]])
#### Results ####
clustering <- cutree(hier_clus, nc)
#### Examine clustering ####
table(clustering)
mds_obj <- mds(dist_obj)
plot(mds_obj$conf, col = clustering, pch = clusym[clustering])
pairs(data, col = clustering)
Number of clusters
Overview
Visualizations give a great hint in which number of clusters is appropriate.
This includes the MDS plot as well as Dendrogram and other suitable plots.
Use the following methods to determine the number of clusters:
K-means
GAP
ASW
PAM
ASW
Hierarchical
ASW
GAP (for Ward only)
Mixture model
BIC
funFEM
BIC
ASW
The same distance matrix should be used. Otherwise, silhouettes are not comparable!
################
#### Modify ####
################
data <- scale(olive)
dist_obj <- dist(data) # Needed for silhouette, default same as dist
max_nc <- 15
################
clus <- list()
sil <- list()
asw <- NA
Squared Euclidean distance is used for computation. Also the
objective function of K-means is based on minimizing the squared Euclidean distance.
################
#### Modify ####
################
data <- scale(geyser)
max_nc <- 15
clus_fun = kmeans
################
#### Elbow plot ####
# Choose k where there is a clear smaller decrease than before. (Looks like an elbow)
# This is often not clearly defined
sk <- numeric(0)
for (k in 1:max_nc) {
sk[k] <- kmeans(data, k, nstart = 100)$tot.withinss
}
plot(1:max_nc, sk, xlab = "k", ylab = "S_k", type = "l")
#### Gap statistic ####
# This statistic is based on the logarithm of the within sum of squares
# "Gap" means here the gap between the expected and observed within sum of squares
# d.power = 2 indicates squared euclidean distances
# spaceH0 = "scaledPCA" indicates space for uniform sampling - alternative "original"
# B = 50 indicates 50 simulations from uniform distribution
cg <- clusGap(data, clus_fun, K.max = max_nc, B = 50, d.power = 2, spaceH0 = "scaledPCA", nstart = 100)
plot(cg) # Gap values
print(cg, method = "globalSEmax", SE.factor = 2) # All important values in table
plot(1:max_nc, exp(cg$Tab[,1]), xlab = "k", ylab = "S_k", type = "l") # S_k values
#### Gapnc function ####
# Automatically detects the optimal number of clusters, according to the Gap statistic
# It finds the GLOBAL optimum (one could also be interested in LOCAL optimum -> change "method" keyword)
# Alternatives for "method": "firstmax", "globalmax", "firstSEmax"
gapnc <- function(data,clus_fun=kmeans,
K.max=10, B = 100, d.power = 2,
spaceH0 ="scaledPCA",
method ="globalSEmax", SE.factor = 2,...) {
gap1 <- clusGap(data, clus_fun, K.max, B, d.power, spaceH0, ...)
nc <- maxSE(gap1$Tab[,3], gap1$Tab[,4], method, SE.factor) # Find optimal number of clusters
out <- list()
out$gapout <- gap1
out$nc <- nc
out$kmopt <- clus_fun(data,nc,...) # Re-run clus_fun with optimal nc
out
}
gapnc_result <- gapnc(data, clus_fun = clus_fun, B = 50, K.max = max_nc)
print(gapnc_result$nc)
plot(gapnc_result$gapout, main = "Gap values") # Values of Gap statistic (expected log(S) - observed log(S))
print(gapnc_result$gapout, method = "globalSEmax", SE.factor = 2)
#### Results ####
nc <- gapnc_result$nc
gap_clustering <- gapnc_result$kmopt$cluster
BIC
BIC only gives a consistent estimator if the assumptions are exactly fulfilled.
This is often not the case with real data.
Therefore, it tends to prefer more parameters, the more data is added.
Visualize clusterings
General
################
#### Modify ####
################
data <- scale(olive)
dist_obj <- dist(data)
dist_matr <- as.matrix(dist_obj)
clustering <- kmeans(olive, 3)$cluster # Number of clusters
hier_clus <- hclust(dist_obj, method = "average")
################
#### Pair plot ####
# Look for well-seperated clusters
pairs(data, col = clustering)
#### MDS ####
mds_obj <- mds(dist_obj)
plot(mds_obj$conf, col = clustering, pch = clusym[clustering])
#### PCA ####
# Needs data and not only distance matrix like MDS
pr <- princomp(data)
plot(pr$scores, col = clustering, pch = clusym[clustering])
#### Heatmap ####
# Include hierarchical clustering to order the observations and/or variables
heatmap(data, Rowv = hier_clus,
Colv = NA, col = cm.colors(200)) # Adjust Rowv and Colv with as.dendrogram(<hclust object>)
Algorithm specific
K-means
################
#### Modify ####
################
data <- scale(olive)
nc <- 3
################
km <- kmeans(data, nc)
clustering <- km$cluster
#### Visualize cluster means ####
# Only the first two variables are displayed here. Modify for more insights.
plot(data, col = clustering, pch = clusym[clustering])
points(km$centers, pch = "M", cex = 2, col = 4)
Mixture Models
Clustering choice
Compare clusterings
################
#### Modify ####
################
clustering <- kmeans(olive, 3)$cluster # The clustering proposed by our analysis or algorithm
category <- oliveoil[,1] # The true category that should be clustered like
################
#### Table overview ####
table(clustering, category)
#### Adjusted Rand Index ####
adjustedRandIndex(clustering, category)
Big data sets
K-means
The algorithm K-means++ picks starting points and thus does
not need nstart = 100 or similar to perform well.
It picks initial centers where each center is an observation.
They are picked in an iterative manner, the probability
being proportional to the distance to the closest already picked center.
km_pp <- function(X, k) {
n <- nrow(X)
C <- numeric(k)
C[1] <- sample(1:n, 1)
for (i in 2:k) {
dm <- distmat(X, X[C, ])
pr <- apply(dm, 1, min); pr[C] <- 0
C[i] <- sample(1:n, 1, prob = pr)
}
kmeans(X, X[C, ])
}
Mixture Models
EM-algorithm and hierarchical initialisation may be too slow. (Slide 247)
Choose random subset (2000) -> run EM-algorithm
Start single EM-iteration on all data
Take 1% with lowest densities as new cluster
Start single EM-iteration on all data -> keep only if improved
If the number of observations $n$ is very big and the number of variables $p$ is small, then:
Covariance matrix $\Sigma$ is a $p x p$ matrix (small) -> PCA
Distance matrix is $n x n$ matrix (very big) -> MDS
Thus, PCA might be computable fast while MDS takes very long for large $n$.
Clustering statistics
################
#### Modify ####
################
dist_obj <- dist(scale(olive))
clustering <- kmeans(scale(olive), 3)$cluster # The clustering proposed by our analysis or algorithm
alternative <- kmeans(scale(olive), 9)$cluster # The true category that should be clustered like
################
set.seed(12345)
#### Compute S (within sum of squares) ####
clus_stats <- cluster.stats(dist_obj, clustering)
clus_stats$within.cluster.ss
cat_stats <- cluster.stats(dist_obj, alternative)
cat_stats$within.cluster.ss
Full cycle Analysis
"Classical" models
################
#### Modify ####
################
data <- scale(olive)
distance = "euclidean"
dist_obj = dist(data, method = distance)
dist_obj <- dist(scale(olive))
clustering <- kmeans(scale(olive), 3)$cluster # The clustering proposed by our analysis or algorithm
alternative <- kmeans(scale(olive), 9)$cluster # The true category that should be clustered like
################
Mixture Models
Overview
Scaling
Only Non-unit shape covariance matrices of the models are
invariant to scaling! -> In theory at least, in practice clear difference
BIC is not scale invariant!
Gaussian Mixture Models
Overview
Model specification
The 3 letters refer to these 3 properties of the Covariance matrix:
Volume ($\lambda$ -> geometric mean of eigenvalues)
Shape ($A$ -> Diagonal matrix of scaled eigenvalues)
Orientation ($D$ -> orthogonal matrix with scaled eigenvectors)
The related formula for the Covariance matrix is:
$\Sigma = \lambda D A D^T$
Computation
Gaussian Mixture Models are in general not scale invariant. Scaling will alter the results!
################
#### Modify ####
################
data <- scale(olive)
max_nc <- 15
dist_obj <- dist(data)
category <- oliveoil$region
################
set.seed(12345)
#### Compute clustering ####
gaus_mm <- Mclust(data, G=1:max_nc)
# gaus_mm <- Mclust(data, G=1:max_nc, modelNames = "VVE")
#### Results ####
## Display results
# Best model
# Large BIC is best!
summary(gaus_mm)
# Best 3 models (BIC)
summary(gaus_mm$BIC)
## Store results
clustering <- gaus_mm$classification
table(clustering)
nc <- summary(gaus_mm)$G
nc
#### Clustering Evaluation
## Diagnostic plot ##
plot(gaus_mm, what="BIC")
plot(gaus_mm, what="classification")
plot(gaus_mm, what="uncertainty")
plot(gaus_mm, what="density")
## Model comparison ##
# ARI
adjustedRandIndex(clustering, category)
# PCA and MDS
pr_mm <- princomp(data)
plot(pr_mm$scores, col = clustering, pch = clusym[clustering])
mds_mm <- mds(as.matrix(dist_obj), ndim = 2)
mds_mm$stress
plot(mds_mm$conf, col = clustering, pch = clusym[clustering])
pairs(data, col = clustering)
#### Results ####
gaus_mm$classification
gaus_mm$parameters # Parameters for each cluster/model (like mean and Covariance matrices)
gaus_mm$z # Posterior probability matrix of observation i beloning to cluster k
T-distribution models
Overview
Model specification
The 4 letters refer to these 3 properties of the Covariance matrix and the degrees of freedom:
Volume
Shape
Orientation
Degrees of freedom
The possible model choices are
U (unconstrained)
C (constrained)
I (unit matrix)
Degrees of freedom
Covariance matrix only exists if $\tau \ge 3$.
It equals to $\frac{\tau}{\tau - 2} \Sigma$.
Computation
Heavy-tailed Mixture Models are not scale invariant. Scaling will alter the results!
################
#### Modify ####
################
data <- scale(olive)
max_nc <- 15
dist_obj <- dist(data)
category <- oliveoil$region
################
set.seed(12345)
#### Compute clustering ####
# Scale data before
t_mm <- teigen(data, Gs = 1:max_nc, scale=TRUE)
#### Results ####
## Display results
# Large BIC is best!
summary(t_mm)
t_mm$allbic
str(t_mm)
## Store results
clustering <- t_mm$classification
table(clustering)
nc <- summary(t_mm)$bicgroups
nc
#### Clustering Evaluation
## Diagnostic plot
# Displays only marginals, specify via xmarg and ymarg
plot(t_mm, xmarg = 1, ymarg = 2, what = "contour")
plot(t_mm, xmarg = 2, ymarg = 1, what = "uncertainty")
## PCA and MDS
pr_t_mm <- princomp(data)
plot(pr_t_mm$scores, col = clustering, pch = clusym[clustering])
mds_t_mm <- mds(as.matrix(dist_obj), ndim = 2)
mds_t_mm$stress
plot(mds_t_mm$conf, col = clustering, pch = clusym[clustering])
pairs(data, col = clustering)
## Model comparison ##
adjustedRandIndex(clustering, category)
Skew-normal Mixture Models
Numerical instability
This algorithm is often numerically unstable. In order to still make it run, follow the steps below in the following order.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hey guys! I wrote a summary of the code + explanations, so that I could use it fast and understand what I am doing. Feel free to use it! It was especially helpful during the exam.
Preliminaries
Load libraries
Load data
Data exploration
PCA
Principal component analysis finds the (orthogonal) directions in which the variance
is higest within the data.
Scaling is important and will alter the results.
MDS
Not a cluster analysis method, but useful for displaying multi-dimensional data.
We use Ratio MDS in this course.
Scaling is important and will alter the results.
Heatmap
Functional data
Distance measures
Overview
Terminology:
A dissimilarity is a measure that fulfills the following two properties:
A dissimilarity is a distance measure when it additionally fulfills the third property:$d(x,z) < d(x, y) + d(y, z)$
3.
Categorization
Dissimilarities:
Distances:
Continuous variables
Scaling is important and will alter the results, except for the Mahalanobis distance.
Binary and categorical variables
Clustering methods
Needed for determination of number of clusters.
Partitioning
K-means
Partitioning method that is based on euclidean data only.
Squared Euclidean distance is used for the objective function (not for distance matrix!).
Cluster concept: homogeneous, spherical (round)
Scaling is important and will alter the results.
PAM
Partitioning method based on dissimilarity measure.
It minimizes the sum of pairwise dissimilarities and for this finds an
optimal solution with manhattan distance.
Cluster concept: homogeneous
Hierarchical
Cluster concepts:
Number of clusters
Overview
Visualizations give a great hint in which number of clusters is appropriate.
This includes the MDS plot as well as Dendrogram and other suitable plots.
Use the following methods to determine the number of clusters:
K-means
PAM
Hierarchical
Mixture model
funFEM
ASW
The same distance matrix should be used. Otherwise, silhouettes are not comparable!
PAM and K-Means
Hierarchical
GAP
Squared Euclidean distance is used for computation. Also the
objective function of K-means is based on minimizing the squared Euclidean distance.
BIC
BIC only gives a consistent estimator if the assumptions are exactly fulfilled.
This is often not the case with real data.
Therefore, it tends to prefer more parameters, the more data is added.
Visualize clusterings
General
Algorithm specific
K-means
Mixture Models
Clustering choice
Compare clusterings
Big data sets
K-means
The algorithm K-means++ picks starting points and thus does
not need nstart = 100 or similar to perform well.
It picks initial centers where each center is an observation.
They are picked in an iterative manner, the probability
being proportional to the distance to the closest already picked center.
Mixture Models
EM-algorithm and hierarchical initialisation may be too slow. (Slide 247)
Alternative
Gaussian Mixture Model computation
MDS and PCA
If the number of observations$n$ is very big and the number of variables $p$ is small, then:
Thus, PCA might be computable fast while MDS takes very long for large$n$ .
Clustering statistics
Full cycle Analysis
"Classical" models
Mixture Models
Overview
Scaling
Only Non-unit shape covariance matrices of the models are
invariant to scaling! -> In theory at least, in practice clear difference
BIC is not scale invariant!
Gaussian Mixture Models
Overview
Model specification
The 3 letters refer to these 3 properties of the Covariance matrix:
The related formula for the Covariance matrix is:
Computation
Gaussian Mixture Models are in general not scale invariant. Scaling will alter the results!
T-distribution models
Overview
Model specification
The 4 letters refer to these 3 properties of the Covariance matrix and the degrees of freedom:
The possible model choices are
Degrees of freedom$\tau \ge 3$ .$\frac{\tau}{\tau - 2} \Sigma$ .
Covariance matrix only exists if
It equals to
Computation
Heavy-tailed Mixture Models are not scale invariant. Scaling will alter the results!
Skew-normal Mixture Models
Numerical instability
This algorithm is often numerically unstable. In order to still make it run, follow the steps below in the following order.
Multinomial Mixture Model for categorical data
Flexmixedruns
PoLCA
Functional Data
PCA
Data should be in form:
Below, I then automatically transpose the data, because the fda package
needs the data exactly the other way round.
FEM
This method finds a low-dimensional subspace that gives optimal
separation of clusters.
Robust statistics
Robust Standardization
Parameter estimation
One variable
Whole dataset
Outlier Identification
Outliers by Mahalanobis distance can be
Its better to identify outliers from robust statistics which
are not effected by outliers.
Important
MCD gives different solutions, depending on the number of variables
it is computed. This is a main difference to Covariance matrix.
Outliers in regression
Weights for each observation tell about the importance.
If weight is zero for an observation, it is most certainly an outlier.
Beta Was this translation helpful? Give feedback.
All reactions