You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Here is a comprehensive collection of all the code he had in his slides, unified in one place. I made only minor changes to the code, comments etc. are left inside. I figure this can be useful for debugging and faster look up.
Feel free to make comments or suggestions on how this could be improved!
library(MASS)
data(geyser)
# You can also load geyser.dat from Moodle and read it with
# geyser <- read.table("geyser.dat",header=TRUE)
# Here’s how it looks like:
str(geyser)
#’data.frame’: 299 obs. of 2 variables:
# $ waiting : num 80 71 57 80 75 77 60 86 77 56 ...
# $ duration: num 4.02 2.15 4 4 4 ...
plot(geyser)
German bundestag election pair plot
Slide 12 (plot slide 11)
library(flexclust)
p05 <- bundestag(2005)
# You can also load bundestag.dat from Moodle and read it with
# bundestagdata <- read.table("bundestag.dat",header=TRUE)
# The variables in p05 are bundestag[,1:5];
# the variables state and ewb below are variables 6 and 7
# in bundestagdata.
str(p05)
# How the object looks like:
# num [1:299, 1:5] 0.391 0.362 0.363 0.376 0.415 ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:299] "Flensburg - Schleswig" "Nordfriesland -
# Dithmarschen Nord" "Steinburg - Dithmarschen Sued"
# "Rendsburg-Eckernfoerde" ...
# ..$ : chr [1:5] "SPD" "UNION" "GRUENE" "FDP" ...
# federal states:
state <- bundestag(2005, state=TRUE)
# ewb takes values "East", "West", "Berlin" as explained:
ewb <- rep("West",299)
ewb[state=="Berlin"] <- "Berlin"
ewb[state %in% c("Brandenburg","Mecklenburg-Vorpommern","Sachsen",
"Sachsen-Anhalt","Thueringen")] <- "East"
# Plotting
pairs(p05,xlim=c(0,0.6),ylim=c(0,0.6),cex=0.5)
# Note the xlim, ylim parameters. All values of all variables
# have the same meaning, so fixing the x and y value range for
# all variables simultaneously will show which parties are
# big and which are small.
# cex=0.5 makes plot symbols smaller.
pairs(olive,cex=0.3,col=oliveoil[,1])
# This works despite oliveoil[,1] being a factor!
Principal Component Analysis
Olive oil (non-standardized)
Slide 22 (plot 23 and 24)
prolive <- princomp(olive)
# Also function prcomp does PCA, this is preferred by some sources.
summary(prolive)
# Importance of components:
# Comp.1 Comp.2 Comp.3 Comp.4
# Standard deviation 479.7299024 150.82827868 45.394449751 27.522646558
# Proportion of Variance 0.8970072 0.08866821 0.008031707 0.002952451
# Cumulative Proportion 0.8970072 0.98567544 0.993707152 0.996659603
# Comp.5 Comp.6 Comp.7 Comp.8
# Standard deviation 24.78169442 1.196956e+01 7.1390744088 6.9756965249
# Proportion of Variance 0.00239367 5.584168e-04 0.0001986489 0.0001896608
# Cumulative Proportion 0.99905327 9.996117e-01 0.9998103392 1.0000000000
plot(prolive,main="Olive oil principal components") # Variance captured
biplot(prolive,cex=0.7) # Biplot with variable axes
Olive oil (standardized)
Slide 27 (plots 28, 29 and 30)
library(fpc)
# This is needed for the use of clusym in plot later
solive <- scale(olive) # Standardise data
sprolive <- princomp(solive) # PCA
summary(sprolive)
# Importance of components:
# Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
# Standard deviation 1.9274086 1.3276711 1.0072629 0.88966996 0.57726430
# Proportion of Variance 0.4651763 0.2207247 0.1270444 0.09911235 0.04172721
# Cumulative Proportion 0.4651763 0.6859009 0.8129454 0.91205772 0.95378493
# Comp.6 Comp.7 Comp.8
# Standard deviation 0.49838105 0.34440148 0.0455864323
# Proportion of Variance 0.03110233 0.01485251 0.0002602203
# Cumulative Proportion 0.98488727 0.99973978 1.0000000000
plot(sprolive,main="Standardised olive oil principal components")
biplot(sprolive,cex=0.7) # Biplot with variable axes
plot(sprolive$scores,col=oliveoil[,2],pch=clusym[oliveoil[,2]]) # First two PCs with regions
More example datasets
Veronica plants
Slide 34 (plots 33)
veronica <- read.table("data/veronica.dat")
# This assumes that the data set in in the directory in which R is run;
# otherwise you need to add the location where you have put the data set.
str(veronica)
#’data.frame’: 207 obs. of 583 variables:
# $ V1 : int 0 0 0 1 0 0 0 0 0 1 ...
# $ V2 : int 0 0 0 0 0 1 0 0 0 0 ...
# $ V3 : int 1 1 0 1 0 0 0 0 0 0 ...
# $ V4 : int 0 0 0 0 0 0 0 0 0 0 ...
# (...)
# Some things can better be done on matrices:
veronicam <- as.matrix(veronica)
heatmap(veronicam,Rowv=NA,Colv=NA,col=grey(seq(1,0,-0.01)),scale="none")
# rows are plants, columns are variables
Artificial data set 1
Slide 36 (plot 35)
# This is how I generated the data set.
# You actually don’t need to do this, use file "clusterdata1.dat"
# from Virtuale.
library(sn)
set.seed(665544)
v1 <- c(rnorm(50,0,1), rsn(70,5,1,8), rnorm(30,6,1))
v2 <- c(rnorm(50,0,1), rsn(70,0,1,8), 8+rt(30,5))
clusterdata1 <- cbind(v1,v2)
# WARNING! The function rsn has changed its precise
# handling of random numbers in an update, which means that
# the code above will produce a slightly different data set
# now, despite the set.seed.
plot(clusterdata1)
set.seed(665544)
c1k3 <- kmeans(clusterdata1,centers=3,nstart=100)
# nstart indicates the number of random initialisations for
# starting the algorithm. The default value is 1, and this really
# should be changed.
# There’s also a parameter iter.max that gives the maximum number
# of iterations before the algorithm is stopped in any case.
# The default of this is 10, which is too low in my view,
# however for the mostly small datasets presented here it should
# be enough.
plot(clusterdata1,col=c1k3$cluster,pch=clusym[c1k3$cluster])
# clusym is a character vector with numbers "1", "2", etc.
# Show the cluster means:
points(c1k3$centers,pch="M",cex=2,col=4)
xpoints <- seq(-4,4,by=0.01)
norm1 <- dnorm(xpoints,mean=-1,sd=1)
norm2 <- dnorm(xpoints,mean=1,sd=1)
# Plotting densities:
plot(xpoints,norm1,type="l",ylab="density")
points(xpoints,norm2,type="l")
# red line, and means of distributions
lines(c(0,0),c(0,0.4),col=2)
points(1,0,pch="X",col=4)
points(-1,0,pch="X",col=4)
# Computations for truncated distributions.
# See https://en.wikipedia.org/wiki/Truncated_normal_distribution
# for formulae.
# Probability of being below zero for left Gaussian
prob1 <- pnorm(0,-1,1)
# Probability of being below zero for right Gaussian
prob2 <- pnorm(0,1,1)
# Expected value of Gaussian with mean -1, truncated at 0.
te1 <- -1-dnorm(1)/pnorm(1)
# Expected value of Gaussian with mean 1, truncated at 0.
te2 <- 1-dnorm(-1)/pnorm(-1)
# Expected value for left cluster combines the two according to
# their probabilities for their contributions to the left cluster
ecluster1 <- prob1*te1+prob2*te2
# Symmetric for right cluster.
ecluster2 <- -ecluster1
# Plot cluster means
points(ecluster1,0,pch="M",col=3)
points(ecluster2,0,pch="M",col=3)
Number of clusters
Elbow plot
Slide 80 (plot 79)
sk <- numeric(0)
kclusterings <- list()
for (k in 1:10){
kclusterings[[k]] <- kmeans(clusterdata1,k,nstart=100)
sk[k] <- kclusterings[[k]]$tot.withinss
# This is the component that has the S_k
}
plot(1:10,sk,xlab="k",ylab="S_k",type="l")
Gap statistic
Slide 92 and 93 (plots 91 and 94)
library(cluster) # Has the clusGap function
set.seed(123456)
# K.max indicates the maximum number of clusters; B=100 indicates
# 100 simulations from uniform distribution; d.power=2 indicates squared
# Euclidean distances (as in the original paper and in class),
# spaceH0="scaledPCA" specifies the way the uniform distribution is simulated
cg1 <- clusGap(clusterdata1,kmeans,K.max=10,B=100,d.power=2,spaceH0="scaledPCA",nstart=100)
print(cg1,method="globalSEmax",SE.factor=2)
# Clustering Gap statistic ["clusGap"] from call:
# clusGap(x = clusterdata1, FUNcluster = kmeans, K.max = 10, B = 100,
# d.power = 2, spaceH0 = "scaledPCA", nstart = 100)
# B=100 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
# --> Number of clusters (method ’globalSEmax’, SE.factor=2): 3
# logW E.logW gap SE.sim
# [1,] 7.222172 7.518419 0.2962467 0.06073207
# [2,] 6.419890 6.634534 0.2146432 0.05540815
# [3,] 4.658166 6.306516 1.6483500 0.05224121
# [4,] 4.475601 6.007479 1.5318780 0.05353712
# [5,] 4.280162 5.721320 1.4411574 0.04925218
# [6,] 4.159101 5.483455 1.3243545 0.04770478
# [7,] 4.018231 5.298835 1.2806042 0.04809762
# [8,] 3.894625 5.140342 1.2457174 0.05056464
# [9,] 3.768839 5.005861 1.2370216 0.05406971
# [10,] 3.623056 4.880549 1.2574927 0.05641072
plot(cg1)
# Values of gap
plot(1:10,exp(cg1$Tab[,1]),xlab="k",ylab="S_k",type="l")
# Values of S_k; cg1$Tab[,1] has values of log(S_k), therefore the exp.
plot(1:10,cg1$Tab[,1],xlab="k",ylab="log S_k",type="l")
points(1:10,cg1$Tab[,2],xlab="k",ylab="log S_k",type="l",lty=2)
legend(6,6.5,c("log S_k in data","E(log S_k) uniform"),lty=1:2)
# Values of log(S_k) and its expectation under uniform distribution
# The latter are in cg1$Tab[,2].
Gapnc
Function
Slide 96
gapnc <- function(data,FUNcluster=kmeans,
K.max=10, B = 100, d.power = 2,
spaceH0 ="scaledPCA",
method ="globalSEmax", SE.factor = 2,...) {
# As in original clusGap function the ... arguments are passed on
# to the clustering method FUNcluster (kmeans).
# Run clusGap
gap1 <- clusGap(data,kmeans,K.max, B, d.power,spaceH0,...)
# Find optimal number of clusters; note that the method for
# finding the optimum and the SE.factor q need to be specified here.
nc <- maxSE(gap1$Tab[,3],gap1$Tab[,4],method, SE.factor)
# Re-run kmeans with optimal nc.
kmopt <- kmeans(data,nc,...)
out <- list()
out$gapout <- gap1
out$nc <- nc
out$kmopt <- kmopt
out
}
# The output of clusGap is in component gapout.
# The optimal number of clusters is in component nc.
# The optimal kmeans output is in component kmopt.
Example Artificial data set 1
Slide 97
set.seed(123456)
cgnc1 <- gapnc(clusterdata1,nstart=100)
# Could also specify K.max, B, d.power, spaceH0, method, SE.factor
plot(cgnc1$gapout) # same clusgap plot as before
print(cgnc1$gapout, method="globalSEmax", SE.factor=2)
# Unfortunately need to specify method and SE.factor here
# once more to reproduce earlier output.
cgnc1$nc
# [1] 3
# plot(data,col=cgnc1$kmopt$cluster) # As seen before
Example Geyser data
Slide 99 (plot 98)
set.seed(76543)
# Use the scaled version of the data set!
cgg <- clusGap(sgeyser,kmeans,K.max=10,B=100,d.power=2,spaceH0="scaledPCA",nstart=100)
print(cgg,method="globalSEmax",SE.factor=2)
# --> Number of clusters (method ’globalSEmax’, SE.factor=2): 3
plot(cgg)
# Values of gap
plot(1:10,exp(cgg$Tab[,1]),xlab="k",ylab="S_k",type="l")
# Values of S_k
plot(1:10,cgg$Tab[,1],xlab="k",ylab="log S_k",type="l")
points(1:10,cgg$Tab[,2],xlab="k",ylab="log S_k",type="l",lty=2)
legend(6,5,c("log S_k in data","E(log S_k) uniform"),lty=1:2)
# Values of log(S_k) and its expectation under uniform distribution
# Unfortunately, the "clusGap"-function doesn’t give out a clustering,
# so this has to be computed afterwards again.
geyser3 <- kmeans(sgeyser,3,nstart=100)
plot(sgeyser,col=geyser3$cluster,pch=clusym[geyser3$cluster])
# Alternatively:
cgg2 <- gapnc(sgeyser,nstart=100)
Example bundestag data
Slide 101 (plots 102 and 103)
set.seed(12345)
cgp05 <- clusGap(p05,kmeans,,K.max=10,B=100,d.power=2,spaceH0="scaledPCA",nstart=100)
print(cgp05,method="globalSEmax",SE.factor=2)
# --> Number of clusters (method ’globalSEmax’, SE.factor=2): 9
plot(cgp05)
# Gap values
plot(1:10,exp(cgp05$Tab[,1]),xlab="k",ylab="S_k",type="l")
# S_k values
plot(1:10,cgp05$Tab[,1],xlab="k",ylab="log S_k",type="l")
points(1:10,cgp05$Tab[,2],xlab="k",ylab="log S_k",type="l",lty=2)
legend(6,5,c("log S_k in data","E(log S_k) uniform"),lty=1:2)
# log S_k values and expectation under uniform
# Re-compute 9-means
p059 <- kmeans(p05,9,nstart=100)
pairs(p05,col=p059$cluster,cex=0.5)
# Alternatively:
cgp2 <- gapnc(p05,nstart=100)
Dissimilarities
Countinuous variables
Euclidean and Manhatten
Slide 119 (plot 120)
dolive2 <- dist(olive,method="euclidean")
dolive1 <- dist(olive,method="manhattan")
# This shows you how similar they are:
plot(dolive1,dolive2,cex=0.3)
# Actually, it makes much more of a difference to scale the data.
solive <- scale(olive)
dolives2 <- dist(solive,method="euclidean")
dolives1 <- dist(solive,method="manhattan")
plot(dolive2,dolives2,cex=0.3) # For example.
# dist produces an object of class dist.
# This is vector of all distances in one row.
dolivematrix <- as.matrix(dolives2)
# This gives the distance between observations 1 and 2.
dolivematrix[1,2]
# [1] 0.6644688
# Distance matrix can be transformed into a dist-object by as.dist.
Mahanalobis distance
Slide 121
# The mahalanobis command can only compute a vector of
# Mahalanobis distances, so producing all distances is
# more tedious; here’s how to make a distance matrix:
mahalm <- matrix(0,ncol=572,nrow=572)
olivecov <- cov(olive)
for (i in 1:572)
mahalm[i,] <- mahalanobis(olive,as.numeric(olive[i,]),olivecov)
# Note that for the Mahalanobis distance it doesn’t make
# a difference whether the data set is scaled or not.
plot(as.dist(mahalm),dolives2,cex=0.3)
corp05 <- cor(p05)
# > corp05
# SPD UNION GRUENE FDP LINKE
# SPD 1.00000000 -0.5662267 0.08935976 -0.3300797 -0.1337904
# UNION -0.56622666 1.0000000 -0.15295824 0.3563825 -0.6220824
# GRUENE 0.08935976 -0.1529582 1.00000000 0.3386211 -0.3644767
# FDP -0.33007972 0.3563825 0.33862105 1.0000000 -0.4741279
# LINKE -0.13379043 -0.6220824 -0.36447674 -0.4741279 1.0000000
cordistp05 <- 0.5-corp05/2
# > cordistp05
# SPD UNION GRUENE FDP LINKE
# SPD 0.0000000 0.7831133 0.4553201 0.6650399 0.5668952
# UNION 0.7831133 0.0000000 0.5764791 0.3218088 0.8110412
# GRUENE 0.4553201 0.5764791 0.0000000 0.3306895 0.6822384
# FDP 0.6650399 0.3218088 0.3306895 0.0000000 0.7370640
# LINKE 0.5668952 0.8110412 0.6822384 0.7370640 0.0000000
Gower coefficient
Slides 140 and 142
# The daisy function requires library(cluster)
housing <- read.table("data/Boston.dat",header=TRUE)
housing$rad <- as.factor(housing$rad)
# Gower dissimilarity:
gdhousing <- daisy(housing,metric="gower",type=list(symm=4))
# Specify that binary variable 4 chas is treated as symmetric,
# i.e., simple matching distance.
# gdhousing is a dist object.
# metric="gower" is unnecessary, because this is
# automatically done if factor or ordered variables are in data
# Note also that treating a binary variable as "symm"
# is mathematically equivalent to treating it as numerical.
housing$rad <- as.ordered(housing$rad)
# Gower dissimilarity:
gdhousing2 <- daisy(housing,metric="gower",type=list(asymm=4))
# symm and asymm can also take vectors such as
# type=list(asymm=c(2,4),symm=c(3,6)) if vars 2,3,4,6 are binary.
Multidimensional Scaling
Examples (bundestag and veronica)
Slides 145 and 146 (plot 147)
cordistp05 <- 0.5-corp05/2
# > cordistp05
# SPD UNION GRUENE FDP LINKE
# SPD 0.000 0.783 0.455 0.665 0.566
# UNION 0.783 0.000 0.576 0.321 0.811
# GRUENE 0.455 0.576 0.000 0.330 0.682
# FDP 0.665 0.321 0.330 0.000 0.737
# LINKE 0.566 0.811 0.682 0.737 0.000
library(smacof)
mdsparties <- mds(cordistp05,ndim=2)
# ndim is the number of dimensions here. 2 is default.
plot(mdsparties$conf,type="n",asp=1)
# asp=1 is the "aspect ratio" and means that distances are
# correctly represented,
# because x and y axis scaling are the same.
text(mdsparties,labels=dimnames(p05)[[2]])
mdsparties$stress
# [1] 0.05975537 # 6%, not much information missing
mdsveronica <- mds(jveronica)
plot(mdsveronica$conf,asp=1)
mdsveronica$stress
# [1] 0.2577986 # quite a bit of information not represented.
# Could try ndim higher than 2, but standard images are 2-d.
Similarity between clusters
Adjusted Rand index
Slide 155
library(mclust) # This has the adjustedRandIndex command.
solive <- scale(olive)
olive3 <- kmeans(olive,3,nstart=100)
olive3s <- kmeans(solive,3,nstart=100)
adjustedRandIndex(olive3$cluster,olive3s$cluster)
# 0.4587804, these are somewhat different.
adjustedRandIndex(olive3$cluster,oliveoil$macro.area)
# 0.3182057
adjustedRandIndex(olive3s$cluster,oliveoil$macro.area)
# 0.448355, both OK but not great, with scaling clearly better
# The gap statistic suggests a higher number of clusters.
cgolive <- clusGap(solive,kmeans,20,B=100,d.power=2,spaceH0="scaledPCA",nstart=100)
print(cgolive,method="globalSEmax",SE.factor=2)
# B=100 simulated reference sets, k = 1..20; spaceH0="scaledPCA"
# --> Number of clusters (method ’globalSEmax’, SE.factor=2): 19
kmolive19 <- kmeans(solive,19,nstart=100)
adjustedRandIndex(kmolive19$cluster,oliveoil$region)
# [1] 0.4228399
adjustedRandIndex(kmolive19$cluster,oliveoil$macro.area)
# [1] 0.1793979
# Not better.
Hierarchical clustering
Standard methods
Bundestag distance
Slides 166 (plots 165)
cordistp05
# SPD UNION GRUENE FDP LINKE
# SPD 0.00 0.78 0.45 0.66 0.56
# UNION 0.78 0.00 0.57 0.32 0.81
# GRUENE 0.45 0.57 0.00 0.33 0.68
# FDP 0.66 0.32 0.33 0.00 0.73
# LINKE 0.56 0.81 0.68 0.73 0.00
Veronica data
Cluster observations
Slide 170 (plot 171)
# Average Linkage
plantclust <- hclust(jveronica,method="average")
# Plot the dendrogram:
plot(plantclust)
# 8 looks like a good number of clusters.
plantclust8 <- cutree(plantclust,8)
# "Estimating" the number of cluster from
# looking at the dendrogram is somewhat questionable though.
# Visualisation using MDS:
plot(mdsveronica$conf,col=plantclust8,pch=clusym[plantclust8])
# This looks good.
# Single Linkage
plantclusts <- hclust(jveronica,method="single")
plot(plantclusts) # 8 looks still OK
plantclusts8 <- cutree(plantclusts,8)
plot(mdsveronica$conf,col=plantclusts8,pch=clusym[plantclusts8])
# The same as before.
# Complete Linkage
plantclustc <- hclust(jveronica,method="complete")
plot(plantclustc) # Could use 8 or 9.
plantclustc8 <- cutree(plantclustc,8)
plot(mdsveronica$conf,col=plantclustc8,pch=clusym[plantclustc8])
# The same as before.
Cluster genes
Slide 172 (plot 173)
# I decided to use the Jaccard distance
# also for AFLP bands/genes:
vveronica <- dist(t(veronicam),method="binary")
varclust <- hclust(vveronica,method="average")
# As a clustering this is pretty messy,
# but still it
# can be used to impose an order of genes.
# This submits the two dendrograms to the heatmap:
heatmap(veronicam,Rowv=as.dendrogram(plantclust),
Colv=as.dendrogram(varclust),
col=grey(seq(1,0,-0.01)))
Old Faithful Geyser data
Slide 174 (plots 175, 176 and 177)
# Average Linkage
geyclust <- hclust(dist(sgeyser),method="average")
plot(geyclust)
# 4 looks like a good K, will isolate one outlier
geyclust4 <- cutree(geyclust,4)
plot(sgeyser,col=geyclust4,pch=clusym[geyclust4])
# Single Linkage
geyclusts <- hclust(dist(sgeyser),method="single")
plot(geyclusts)
# 7 looks like a good K
geyclusts7 <- cutree(geyclusts,7)
plot(sgeyser,col=geyclusts7,pch=clusym[geyclusts7])
# Complete Linkage
geyclustc <- hclust(dist(sgeyser),method="complete")
plot(geyclustc)
# 5 looks like a good K
geyclustc5 <- cutree(geyclustc,5)
plot(sgeyser,col=geyclustc5,pch=clusym[geyclustc5])
Wards method
Bundestag data
Slide 184 and 185
# kmeans with 5 clusters
set.seed(12345)
kmbundestag5 <- kmeans(p05,5,nstart=100)
# Ward’s method
wbundestag <- hclust(dist(p05),method="ward.D2")
# plot(wbundestag, labels=FALSE) # Dendrogram not shown
# Run with labels=FALSE because otherwise constituency names
# will dominate the plot.
# Same as wbundestag <- agnes(p05,method="ward")
# With K=5:
wbundestag5 <- cutree(wbundestag,5)
table(kmbundestag5$cluster,wbundestag5)
# wbundestag5
# 1 2 3 4 5
# 1 27 0 0 55 0
# 2 72 0 0 0 0
# 3 0 0 0 16 22
# 4 8 0 36 0 0
# 5 0 63 0 0 0
# Fairly different.
adjustedRandIndex(kmbundestag5$cluster,wbundestag5)
# [1] 0.6362054
library(fpc)
# This can be used to compute S for any clustering:
kmb <- cluster.stats(dist(p05),kmbundestag5$cluster)
kmb$within.cluster.ss
# 1.319956
# This is the same as kmbundestag5$tot.withinss
wmb <- cluster.stats(dist(p05),wbundestag5)
wmb$within.cluster.ss
# 1.534854
# Quite a bit worse.
PAM
Bundestag data
Slide 192 (plot 193)
Needed: p05
library(cluster)
bundestagk5 <- kmeans(p05,5,nstart=100)
# By default, if pam is called with a data set that is not a
# dist-object, the Euclidean distance is used:
bundestagp5 <- pam(p05,5)
adjustedRandIndex(bundestagk5$cluster,bundestagp5$cluster)
# [1] 0.7249983
# Alternatively, pam can be started with a dist-object, which allows
# computing it with the Manhattan-distance:
p05manhattan <- dist(p05,method="manhattan")
bundestagp5m <- pam(p05manhattan,5)
adjustedRandIndex(bundestagp5$cluster,bundestagp5m$cluster)
# [1] 0.6607979
adjustedRandIndex(bundestagk5$cluster,bundestagp5m$cluster)
# [1] 0.8863854
mdsp05m <- mds(p05manhattan)
plot(mdsp05m$conf,col=bundestagk5$cluster,pch=clusym[bundestagk5$cluster])
plot(mdsp05m$conf,col=bundestagp5$cluster,pch=clusym[bundestagp5$cluster])
plot(mdsp05m$conf,col=bundestagp5m$cluster,pch=clusym[bundestagp5m$cluster])
# Could also use PCA to visualise this.
prp05 <- princomp(p05)
plot(prp05$scores,col=bundestagk5$cluster,pch=clusym[bundestagk5$cluster])
plot(prp05$scores,col=bundestagp5$cluster,pch=clusym[bundestagp5$cluster])
plot(prp05$scores,col=bundestagp5m$cluster,pch=clusym[bundestagp5m$cluster])
Average Silhouette width
Different cluster sizes
Slide 198 (plot 199 and 200)
Needed: p05manhatten
pasw <- NA
pclusk <- list()
psil <- list()
# Look at K between 2 and 30:
for (k in 2:30){
# PAM clustering:
pclusk[[k]] <- pam(p05manhattan,k)
# Computation of silhouettes:
psil[[k]] <- silhouette(pclusk[[k]],dist=p05manhattan)
# ASW needs to be extracted:
pasw[k] <- summary(psil[[k]])$avg.width
}
# Plot the ASW-values against K:
plot(1:30,pasw,type="l",xlab="Number of clusters",ylab="ASW")
# Result in numbers:
# > pasw
# [1] NA 0.4867459 0.4228157 0.4265560 0.3779154 0.3345530
# 0.3131574
# [8] 0.3254449 0.3177023 0.3185734 0.3036558 0.3130183 0.3039691
# 0.3157744
# (...)
# Best value at K=2.
# MDS-plot:
plot(mdsp05m$conf,col=pclusk[[2]]$cluster,pch=clusym[pclusk[[2]]$cluster])
# Silhouette plots for 2 and 5 clusters:
plot(psil[[2]])
plot(psil[[5]])
Comparison of cluster methods
Slide 201 and 204 (plots 202, 203 and 204)
Needed: jveronica
# PAM:
pasw <- NA
pclusk <- list()
psil <- list()
for (k in 2:30){
pclusk[[k]] <- pam(jveronica,k)
psil[[k]] <- silhouette(pclusk[[k]])
pasw[k] <- summary(psil[[k]])$avg.width
}
plot(1:30,pasw,type="l",xlab="Number of clusters",ylab="ASW",ylim=c(0.15,0.6))
# Maximum at K=7, ASW=0.5386146
# Average Linkage
plantclust <- hclust(jveronica,method="average")
tasw <- NA
tclusk <- list()
tsil <- list()
for (k in 2:30){
tclusk[[k]] <- cutree(plantclust,k)
tsil[[k]] <- silhouette(tclusk[[k]],dist=jveronica)
tasw[k] <- summary(silhouette(tclusk[[k]],dist=jveronica))$avg.width
}
points(1:30,tasw,type="l",col=3,lty=2,lwd=2)
# Maximum is at K=8, ASW=0.5524769
# ...plots...
legend(20,0.55,legend=c("PAM","ave.linkage"),col=c(1,3),lwd=c(1,2),lty=c(1,2))
plot(psil[[7]])
plot(psil[[8]])
plot(tsil[[8]])
Mixture Models
Introduction
Example
Slide 241 and 242 (plot 243, 244 and 245)
Needed: olive, oliveoil
library(mclust)
molive <- Mclust(olive,G=1:15)
summary(molive)
# ----------------------------------------------------
# Gaussian finite mixture model fitted by EM algorithm
# ----------------------------------------------------
#
# Mclust VVE (ellipsoidal, equal orientation) model with 10 components:
#
# log.likelihood n df BIC ICL
# -20448.03 572 197 -42146.84 -42193.07
#
# Clustering table:
# 1 2 3 4 5 6 7 8 9 10
# 30 96 110 37 50 64 34 49 45 57
summary(molive$BIC)
# Best BIC values:
# VVE,10 VVV,6 VVE,12
# BIC -42146.84 -42158.49981 -42176.74491
# BIC diff 0.00 -11.65605 -29.90115
adjustedRandIndex(molive$classification,oliveoil$region)
# [1] 0.5914548
# Fairly good
# Diagnostic plots
plot(molive)
# Could try model with 3 clusters and compare with Macro areas
molive3 <- Mclust(olive,G=3)
adjustedRandIndex(molive3$classification,oliveoil$macro.area)
# [1] 0.5349465
# Also OKish
# Set of diagnostic plots offered by mclust
plot(molive3)
# PCA can be clearer for not too low dimensional data
sprolive <- princomp(solive)
plot(sprolive$scores,col=molive$classification,
pch=clusym[molive$classification])
Specify certain models
Slide 246
Needed: olive
# If you want one or more specific covariance matrix models,
# you could specify modelNames, e.g.,
moliveVVV <- Mclust(olive,G=1:10,modelNames="VVV")
molivex <- Mclust(olive,G=1:10,modelNames=c("EEE","VVV"))
# Information available in output object, e.g.
molive$classification
# Clustering vector
molive$parameters
# Estmated parameters
molive$z
# Matrix of posterior probabilities p_ik that point i was generated
# by mixture component k
Mixtures of skew and heavy-tailed distributions
Example of multivariate t-mixture (olive oil)
Computation
Slide 260
Needed: olive
library(teigen)
set.seed(776655)
tolive <- teigen(olive, Gs=1:12)
# Gs: numbers of clusters, optimum decided by BIC
# teigen has a parameter scale with default scale=TRUE.
# This standardises data before clustering, so result should be
# same as teigen(solive, Gs=1:12, scale=FALSE)
# See teigen help page for more options.
summary(tolive)
#------------- Summary for teigen -------------
# ------ RESULTS ------
# Loglik: -1817.584
# BIC: -5063.723
# ICL: -5072.88
# Model: UUUC
# Groups: 5
# Clustering Table:
#
# 1 2 3 4 5
# 60 100 87 223 102
Assessment
Slides 262, 263, 264 and 265 (plots 264 and 265)
Needed: tolive, sprolive, oliveoil
tolive$allbic # large is good.
# G=1 G=2 G=3 G=4 G=5 G=6 G=7
#UUUU -7889.168 -5720.133 -5680.123 -5275.945 -5075.159 -5222.871 -5383.248
#UUUC -Inf -5724.228 -5669.654 -5262.440 -5063.723 -5211.953 -5342.531
#CUCU -Inf -6277.511 -6112.269 -5819.137 -5566.531 -5693.455 -5564.644
#CUCC -Inf -6275.323 -6105.870 -5807.910 -5552.134 -5673.826 -5577.603
#CUUU -Inf -5783.868 -5950.907 -5449.103 -5269.403 -5387.851 -5159.616
#CUUC -Inf -5777.513 -5946.490 -5430.813 -5254.264 -5363.843 -5155.408
#CCCU -Inf -7371.333 -6877.942 -6613.939 -6217.014 -6051.247 -6013.761
#CCCC -Inf -7365.670 -6882.688 -6623.697 -6209.228 -6031.467 -6000.251
#CIUU -12967.869 -9969.186 -9568.104 -8457.330 -7873.963 -7582.871 -7409.848
#CIUC -Inf -10025.944 -9593.945 -8444.805 -7900.001 -7593.757 -7413.229
#CICU -Inf -11200.063 -10260.297 -9461.043 -8817.974 -8504.407 -8287.299
#CICC -Inf -11194.998 -10261.022 -9463.984 -8808.221 -8479.225 -8274.528
#UIIU -12939.113 -11306.381 -10822.561 -9807.542 -9119.964 -8971.279 -8869.141
#(...)
# G=8 G=9 G=10 G=11 G=12
#UUUU -5164.612 -5259.720 -Inf -Inf -Inf
#UUUC -5130.624 -5258.010 -Inf -Inf -Inf
#(...)
# ...very big list...
str(tolive)
# List of 13
# $ iter : num 32
# $ fuzzy : num [1:572, 1:5] 1.77e-17 5.24e-17 1.09e-15 2.04e-18 3.83e-20 ...
# $ parameters :List of 9
# ..$ df : num [1:5] 7.68 7.68 7.68 7.68 7.68
# ..$ mean : num [1:5, 1:8] -0.971 -0.154 -0.764 0.921 -0.719 ...
# ..$ lambda : num [1:5] 0.0542 0.1141 0.0376 0.1073 0.0368
# ..$ d : num [1:8, 1:8, 1:5] -0.1013 -0.0102 0.9745 -0.1017 0.1085 ...
# ..$ a : num [1:8, 1:8, 1:5] 19.5 0 0 0 0 ...
# ..$ weights: num [1:572, 1:5] 0.00645 0.00653 0.00685 0.00434 0.00251 ...
# ..$ sigma : num [1:8, 1:8, 1:5] 0.3136 0.0163 -0.0513 -0.0631 -0.1265 ...
# ..$ pig : num [1:5] 0.105 0.176 0.153 0.389 0.177
# ..$ conv : logi TRUE
# $ allbic : num [1:28, 1:12] -7889 -Inf -Inf -Inf -Inf ...
# ..- attr(*, "dimnames")=List of 2
# .. ..$ : chr [1:28] "UUUU" "UUUC" "CUCU" "CUCC" ...
# .. ..$ : chr [1:12] "G=1" "G=2" "G=3" "G=4" ...
# $ bic : num -5064
# $ bestmodel : chr "The best model (BIC of -5063.72) is UUUC with G=5"
# $ modelname : chr "UUUC"
# $ classification: int [1:572] 2 2 2 2 2 2 2 2 2 2 ...
# $ G : int 5
# (...and more...)
plot(tolive,xmarg=1,ymarg=5,what="contour")
plot(tolive,xmarg=1,ymarg=5,what="uncertainty")
# The plot command for teigen allows for two kinds of plots.
# Both are done on two variables, and xmarg, ymarg specify
# the numbers of the variables used.
plot(sprolive$scores,col=tolive$classification,pch=clusym[tolive$classification])
# Plot on PCs
adjustedRandIndex(tolive$classification,oliveoil$region)
# [1] 0.7727825
adjustedRandIndex(tolive$classification,oliveoil$macro.area)
# [1] 0.6189345 # These are clearly better than from mclust.
Mixture of skew-normal distributions
Slide 266, 267 and 268 (plot 269)
Needed: solive, oliveoil
set.seed(67543)
estg <- smsn.search(solive,nu=1,g.min=1,g.max=12,family="Skew.normal")
# WARNING! This is not fast!
# g.min, g.max min. and max. number of clusters.
# nu is not needed for a skew normal mixture but has to be specified.
# family="Skew.t" fits s skew t-distribution,
# then nu is degrees of freedom.
# For family="Skew.t" note that nu must be provided and will *not* be
# estimated! Also it looks like only one value of nu can be specified
# that will be used for all mixture components.
# Can also specify uni.Gama=TRUE, which constrains
# Sigma-deltaˆ* deltaˆ*T to be constant over clusters.
str(estg) # (shortened)
# List of 2 #
# $ criteria : Named num [1:12] 7931 5920 5798 5507 5316 ...
# (BIC-values)
# $ best.model:List of 14
# ..$ mu :List of 5 # means, I call them a.
# .. ..$ : num [1:8] -0.778 -0.464 0.129 -0.225 1.166 ...
# .. ..$ : num [1:8] -0.919 -0.554 0.853 0.84 -0.285 ...
# (...)
# ..$ Sigma :List of 5
# .. ..$ : num [1:8, 1:8] 0.22921 -0.02283 0.01077 -0.088 0.00567 ...
# .. ..$ : num [1:8, 1:8] 0.6224 0.0129 -0.0344 -0.1809 -0.1114 ...
# (...)
# ..$ shape :List of 5 # This is lambda.
# .. ..$ : num [1:8, 1] 0.818 -1.005 -0.684 -0.535 -2.261 ...
# .. ..$ : num [1:8, 1] -5.27 7.27 -3.34 3.89 -7.9 ...
# (...)
# (Component proportions)
# ..$ pii : num [1:5] 0.173 0.11 0.389 0.152 0.176
# ..$ logLik : num -1820
# ..$ nu : num 1
# ..$ aic : num 4168
# ..$ bic : num 5316
# ..$ edc : num 4903
# ..$ icl : num 5321
# ..$ iter : num 89
# ..$ n : int 572
# ..$ group : int [1:572] 5 5 5 5 5 5 5 5 5 5 ...
# ..$ uni.Gama: logi FALSE
adjustedRandIndex(estg$best.model$group,oliveoil$region)
# [1] 0.7796783
adjustedRandIndex(estg$best.model$group,oliveoil$macro.area)
# [1] 0.6251346
# About the same as from teigen, but slightly better.
plot(1:12,estg$criteria,type="l",ylab="BIC",xlab="Number of clusters")
plot(sprolive$scores,col=estg$best.model$group,pch=clusym[estg$best.model$group])
A Mixture model for categorical data
Example Veronica data
Computation
Slide 283 (plots 284 and 285)
Needed: veronica
set.seed(887766)
veronicabernm <- flexmixedruns(veronica,continuous=0,discrete=583,n.cluster=1:10)
# This uses by default 20 random initialisations.
# This will also fit mixtures with mixed type continuous and categorical variables.
# Default assumption: continuous variables come first, therefore
# continuous=0,discrete=583 - all variables are categorical.
# With xvarsorted=FALSE, can specify freely what kind of variable is where.
plot(1:10,veronicabernm$bicvals,typ="l",
xlab="Number of clusters",ylab="BIC")
plot(mdsveronica$conf,col=veronicabernm$flexout[[6]]@cluster,pch=clusym[veronicabernm$flexout[[6]]@cluster])
Result breakdown
Slides 286 and 287
Needed: veronicabernm
# Can access parameter estimators
str(veronicabernm$flexout[[6]],max.level=2)
# Formal class ’flexmix’ [package "flexmix"] with 18 slots
# ..@ posterior :List of 2
# ..@ weights : NULL
# ..@ iter : int 4
# ..@ cluster : int [1:207] 4 4 4 3 2 5 6 1 4 3 ...
# ..@ logLik : num -13361
# ..@ df : num 3467
# ..@ control :Formal class ’FLXcontrol’ [package "flexmix"] with 6 slots
# ..@ group : Factor w/ 0 levels:
# ..@ size : Named int [1:6] 29 13 22 48 64 31
# .. ..- attr(*, "names")= chr [1:6] "1" "2" "3" "4" ...
# ..@ converged : logi TRUE
# ..@ k0 : int 6
# ..@ model :List of 1
# ..@ prior : num [1:6] 0.1401 0.0628 0.1063 0.2319 0.3092 ...
# ..@ components :List of 6
# ..@ concomitant:Formal class ’FLXP’ [package "flexmix"] with 7 slots
# ..@ formula :Class ’formula’ language x ˜ 1
# .. .. ..- attr(*, ".Environment")=<environment: 0x555f82bae2c8>
# ..@ call : language flexmix(formula = x ˜ 1, k = k,
# cluster = initial.cluster, model = lcmixed(continuous = continuous,
# discrete| __truncated__
# ..@ k : int 6
# E.g., this is for component 5, variables 1-4,
# just as illustration (the "[[1]]"
# is required to find the parameters, but otherwise
# doesn’t mean anything)
veronicabernm$flexout[[6]]@components[[5]][[1]]@parameters$pp
# [[1]]
# [1] 1 0
#
# [[2]]
# [1] 0.03125 0.96875
#
# [[3]]
# [1] 1 0
#
# [[4]]
# [1] 1 0
# etc.
library(fda) # Functional data analysis
covid21 <- read.table("data/covid2021.dat")
covid21v <- as.matrix(covid21[,5:559])
# Raw data plot:
plot(1:555,covid21v[1,],type="l",ylim=c(0,25),ylab="New cases over one week per 1000 inhabitants",
xlab="Day (1 April 2020-7 October 2021)",
main="Covid weekly new cases for 179 countries")
for(i in 2:179) points(1:555,covid21v[i,],type="l")
# Single country
i <- 79 # Italy (can print covid21[,1] to find numbers for countries)
i <- 69 # Haiti
plot(1:555,covid21v[i,],cex=0.5,ylab="New cases over one week per 1000 inhabitants",
xlab="Day (1 April 2020-7 October 2021)",
main=covid21[i,1])
# Constructing B-spline basis
bbasis10 <- create.bspline.basis(c(1,555),nbasis=10) # with p=10
# Splines approximating data as linear combinations of B-spline basis
fdcovid10 <- Data2fd(1:555,y=t(as.matrix(covid21v)),basisobj=bbasis10)
bbasis <- create.bspline.basis(c(1,555),nbasis=100) # same with p=100
fdcovid <- Data2fd(1:555,y=t(as.matrix(covid21v)),basisobj=bbasis)
# Plot basis
plot(bbasis10)
plot(bbasis)
# Smooth splines for data with smooth mean function:
plot(fdcovid)
mcovid <- mean.fd(fdcovid)
lines(mcovid,col=2,lwd=5)
# Show smooth fit of individual countries
plotfit.fd(t(covid21v),1:555,fdcovid10,index=79,cex.pch=0.5)
plotfit.fd(t(covid21v),1:555,fdcovid10,index=69,cex.pch=0.5)
plotfit.fd(t(covid21v),1:555,fdcovid,index=79,cex.pch=0.5)
plotfit.fd(t(covid21v),1:555,fdcovid,index=69,cex.pch=0.5)
Functional PCA
PCA scores and scores for continents
Slide 311 (plots 312 and 313)
Needed: fdcovid, covid21
# Create functional data object of PCA approximations
covidpcaapprox <- covidpca$harmonics
i <- 1
pcacoefi <- covidpca$harmonics$coefs %*% covidpca$scores[i,]+mcovid$coefs
covidpcaapprox$coefs <- pcacoefi
for (i in 2:179){
pcacoefi <- covidpca$harmonics$coefs %*% covidpca$scores[i,]+mcovid$coefs
covidpcaapprox$coefs <- cbind(covidpcaapprox$coefs, pcacoefi)
}
dimnames(covidpcaapprox$coefs)[[2]] <- covid21[,1]
plotfit.fd(t(covid21v),1:555,covidpcaapprox,index=79,cex.pch=0.5)
plotfit.fd(t(covid21v),1:555,covidpcaapprox,index=69,cex.pch=0.5)
plotfit.fd(t(covid21v),1:555,covidpcaapprox,index=164,cex.pch=0.5)
A mixture-based method for functional clustering
Overview models
Slide 320
Needed:
femmodels <- c("DkBk", "DkB", "DBk","DB", "AkjBk",
"AkjB", "AkBk", "AkB", "AjBk", "AjB", "ABk","AB")
# E.g., "AkBk" has Sigma_k spherical but can vary between clusters,
# beta_k can vary between clusters.
# Default model "AkjBk" assumes Sigma_k diagonal,
# Sigma_k and beta_k can vary between clusters.
# funFEM help page claims that model="all" will compare all by BIC and
# find the best, but this currently doesn’t seem to work.
Slide 324 and 325 (plot 326)
Needed: fdcovid
library(funFEM)
# The following (using default model "AkjBk") does not work very well:
covidcluster <- funFEM(fdcovid,K=2:10)
# Shows errors, not all models can be fitted (degenerating likelihood).
# We still get a final result, but only K=2 was fitted without error.
# Can’t fit K=1.
# Try out all models and find the best.
# Unfortunately just submitting model="all" to funFEM doesn’t work
# as it should as of funFEM Version 1.2.
set.seed(1234567)
femmodels <- c("DkBk", "DkB", "DBk", "DB", "AkjBk", "AkjB", "AkB", "AkBk", "AjBk", "AjB", "ABk", "AB")
nmodels <- length(femmodels)
femresults <- list() # Save output for all models in femmodels
bestk <- bestbic <- numeric(0)
# bestk: vector of best K for each model.
# bestbic: Best BIC value for each model.
K=2:10 # Numbers of clusters K to try out.
fembic <- matrix(NA,nrow=nmodels,ncol=max(K))
# fembic will hold all BIC values for models (rows) and K (columns);
# NA for those that cannot be fitted.
for (i in 1:nmodels){ # This takes a long time!!
print(femmodels[i])
femresults[[i]] <- funFEM(fdcovid,model=femmodels[i],K=K)
fembic[i,K] <- femresults[[i]]$allCriterions$bic
bestk[i] <- which(fembic[i,]==max(fembic[i,K],na.rm=TRUE))
bestbic[i] <- max(fembic[i,K],na.rm=TRUE)
}
besti <- which(bestbic==max(bestbic,na.rm=TRUE))
besti <- which(bestbic==max(bestbic,na.rm=TRUE)) # Best model
besti
# [1] 11
femmodels[besti] # Sigma_k spherical and equal, beta_k can vary with k.
# [1] "ABk"
bestk # K=8 optimal for model 11 "ABk"
# [1] 2 2 2 3 2 2 2 2 7 7 8 10
# Can reproduce results faster (up to random initialisation):
# femresult11 <- funFEM(fdcovid,model="ABk",K=8)
femresult11 <- femresults[[11]]
# Plot BIC values for all models and K:
i <- 1
plot(1:max(K),fembic[i,],col=i,pch=i,ylim=c(min(fembic,na.rm=TRUE),max(fembic,na.rm=TRUE)),type="n")
for(i in 1:nmodels) text(1:max(K),fembic[i,],femmodels[i],col=i)
# Clustering is in cls component of output.
table(femresult11$cls,covid21$continent)
# Africa Asia Australia Central America Europe North America South America
# 1 0 3 0 1 23 1 0
# 2 11 15 1 6 2 0 2
# 3 1 5 1 3 6 0 2
# 4 32 10 1 2 0 0 0
# 5 6 9 0 2 7 1 7
# 6 1 1 0 0 0 0 0
# 7 0 0 0 0 6 0 0
# 8 0 4 0 6 0 0 1
adjustedRandIndex(femresult11$cls,covid21$continent)
# [1] 0.2038502
library(fpc)
plot(covid21[,4:3],col=femresult11$cls,pch=clusym[femresult11$cls])
# Plot clusters against latitude and longitude
# This prints out the countries in the clusters.
for(i in 1:femresult11$K){
print(i)
print(covid21[femresult11$cls==i,1])
}
Visualize cluster separation
Slide 329 (plots 330 and 331)
Needed: covidpca, femresult11, fdcovid
# Clusters on principal components
pairs(covidpca$scores,col=femresult11$cls,pch=19)
# Visualisation of discriminative subspace U,
# projection of observations on U-space:
fdproj <- t(fdcovid$coefs) %*% femresult11$U
pairs(fdproj,col=femresult11$cls,pch=19)
plot(fdproj,col=femresult11$cls,pch=19,xlab="DC 1",ylab="DC 2")
# femresult11$P has all the probabilities p_ik for the objects
# to belong to the clusters. Here all these are larger than 0.992
# or smaller than 0.008, so all points are very confidently classified.
# Although this is probably correctly computed, I think that
# this is overconfident, due to large degrees of freedom for
# finding the optimally discriminating subspace.
# Plot the curves and clusters
plot(1:555,covid21v[1,],type="l",ylim=c(0,25),ylab="New cases over one week per 1000 inhabitants",
xlab="Day (1 April 2020-7 October 2021)",lwd=1.5,col=femresult11$cls[1])
for(i in 2:179) points(1:555,covid21v[i,],type="l",col=femresult11$cls[i],lwd=1.5)
# Plot the cluster mean curves
clmeans <- fdcovid; clmeans$coefs <- t(femresult11$prms$my)
plot(clmeans,lwd=3) # col doesn’t seem to work here, neither lwd
legend(100,10,legend=1:8,col=c(1:6,1:2),lty=c(1:5,1:3))
# Plot individual clusters and mean curves
par(ask=TRUE)
for (k in 1:femresult11$K){
plot(1:555,covid21v[1,],type="l",ylim=c(0,25),
ylab="New cases over one week per 1000 inhabitants",
xlab="Day (1 April 2020-7 October 2021)",col=as.integer(femresult11$cls[1]==k))
for(i in 2:179)
points(1:555,covid21v[i,],type="l",col=as.integer(femresult11$cls[i]==k))
meank <- colMeans(covid21v[femresult11$cls==k,])
points(1:555,meank,type="l",lwd=5,col=2)
}
par(ask=FALSE)
Robust Statistics
Introduction
Outlier example
Slides 4, 5, 6
Needed: Dortmund_msbd.dat
dortmund <- read.table("data/Dortmund_msbd.dat",header=TRUE,row.names=1)
str(dortmund)
# ’data.frame’: 170 obs. of 7 variables:
# $ unemployed : num 1.688 1.065 1.134 1.045 0.644 ...
# $ buildings : num 1.375 0.823 1.448 1.083 1.267 ...
# $ birthdeath : num -0.0187 -0.8871 -0.0697 -0.2197 -0.0973 ...
# $ cars : num 9.21 11.55 6.82 7.9 5.29 ...
# $ households : int 160 62 201 132 514 203 410 318 635 190 ...
# $ social_insurance: num 6.14 3.5 6.52 4.83 4.94 ...
# $ benefits : num 1.556 1.065 0.687 1.303 0.307 ...
mean(dortmund$birthdeath)
# [1] -0.2316692
sum(dortmund$birthdeath>-0.2317)
# [1] 165
boxplot(dortmund$birthdeath,main="dortmund$birthdeath")
boxplot(dortmund$birthdeath,main="dortmund$birthdeath",ylim=c(-0.9,0.135))
abline(mean(dortmund$birthdeath),0,col=2)
mean(dortmund$birthdeath)
# [1] -0.2316692
median(dortmund$birthdeath)
# [1] -0.01918741
which.min(dortmund$birthdeath)
# [1] 133 # Outlier is "Rombergpark", a park district with 4 households and a clinic!
mean(dortmund$birthdeath[-133])
# [1] -0.03481514
Robust scale estimators
Example
Slide 34
Needed: dortmund
mean(dortmund$birthdeath)
# [1] -0.2316692
median(dortmund$birthdeath)
# [1] -0.01918741
sd(dortmund$birthdeath)
# [1] 2.568608
mad(dortmund$birthdeath, constant = 1.4826) # constant = 1.4826 is the default.
# [1] 0.04708085
library(robustbase)
huberM(dortmund$birthdeath,k=1.5) # k=1.5 is default; ca. 95% efficiency in normal model
# $mu
# [1] -0.02259647
# $s
# [1] 0.04708085 # Uses the MAD by default and gives it out.
# $it
# [1] 8 # Number of iterations of algorithm
# $SE
# [1] NA # If se=TRUE, huberM estimates its own standard error,
# can be used for tests and confidence intervals
cd <- cov(dortmund)
# covMcd function is in package robustbase.
mcdd <- covMcd(dortmund)
# h is specified through parameter alpha with default 0.5,
# h is chosen so that h/n\approx alpha with automatic rounding.
# BP is about equal to 1-alpha (equal for n to infiunemployed buildings birthdeath cars
# 0.5596628 1.6569043 -0.2316692 4.2945776
# households social_insurance benefits
# 363.1000000 3.0091202 0.4554566nity).
mcdd75 <- covMcd(dortmund,alpha=0.75)
colMeans(dortmund)
#
mcdd$center
# unemployed buildings birthdeath cars
# 0.45806045 1.69674801 -0.02134013 3.97095994
# households social_insurance benefits
# 346.60344828 2.82726397 0.31878068
mcdd75$center
# unemployed buildings birthdeath cars
# 0.50289076 1.65312025 -0.01893479 3.89633131
# households social_insurance benefits
# 362.81617647 2.80461850 0.39194712
# MCD(alpha=0.75) is some kind of compromise between the classical mean/cov-matrix
# and the MCD with alpha=0.5, but results are not always in between.
# Estmated covariance matrix is in component cov:
str(mcdd$cov)
# num [1:7, 1:7] 0.03583 -0.05986 -0.00225 -0.09547 17.72687 ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:7] "unemployed" "buildings" "birthdeath" "cars" ...
# ..$ : chr [1:7] "unemployed" "buildings" "birthdeath" "cars" ...
# The covMcd output also has components raw.cov and raw.center.
# These give the MCD how it was defined above, whereas cov and center
# give somewhat improved versions with better efficiency but same robustness.
# These are really not very different here:
str(mcdd$raw.cov)
# num [1:7, 1:7] 0.03162 -0.07117 -0.00195 -0.12078 18.29471 ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:7] "unemployed" "buildings" "birthdeath" "cars" ...
# ..$ : chr [1:7] "unemployed" "buildings" "birthdeath" "cars" ...
mcdd$raw.center
# unemployed buildings birthdeath cars
# 0.44453445 1.71049846 -0.02140164 3.99712387
# households social_insurance benefits
# 353.38202247 2.84980928 0.29372514
mcdd$center
# unemployed buildings birthdeath cars
# 0.45806045 1.69674801 -0.02134013 3.97095994
# households social_insurance benefits
# 346.60344828 2.82726397 0.31878068
# Visualise differences on 2 variables:
library(mixtools) # Used for showing the ellipses defined by the cov-matrices
v1 <- 2 # Looking at variables 2 and 4.
v2 <- 4
plot(dortmund[,c(v1,v2)])
ellipse(colMeans(dortmund[,c(v1,v2)]),cd[c(v1,v2),c(v1,v2)],alpha=0.01)
# alpha=0.01 here means that if data are normally distributed, 99% are expected
# to be inside the ellipse.
ellipse(mcdd$center[c(v1,v2)],mcdd$cov[c(v1,v2),c(v1,v2)],col=2,alpha=0.01)
ellipse(mcdd75$center[c(v1,v2)],mcdd75$cov[c(v1,v2),c(v1,v2)],col=4,alpha=0.01)
text(3,10.5,"standard cov")
text(4.2,7.5,"MCD h/n=0.75",col=4)
text(3.7,6,"MCD h/n=0.5",col=2)
cor(dortmund[,v1],dortmund[,v2])
# [1] 0.6477843
# The following computes "robust correlation" for cov-entries:
mcdd$cov[v1,v2]/sqrt(mcdd$cov[v1,v1]*mcdd$cov[v2,v2])
# [1] 0.8747173
mcdd75$cov[v1,v2]/sqrt(mcdd75$cov[v1,v1]*mcdd75$cov[v2,v2])
# [1] 0.8840861
v1 <- 3
v2 <- 7
plot(dortmund[,c(v1,v2)])
ellipse(colMeans(dortmund[,c(v1,v2)]),cd[c(v1,v2),c(v1,v2)],alpha=0.01)
ellipse(mcdd$center[c(v1,v2)],mcdd$cov[c(v1,v2),c(v1,v2)],col=2,alpha=0.01)
ellipse(mcdd75$center[c(v1,v2)],mcdd75$cov[c(v1,v2),c(v1,v2)],col=4,alpha=0.01)
text(-12,1,"standard cov")
text(-5,1.3,"MCD h/n=0.75",col=4)
text(-5,0.5,"MCD h/n=0.5",col=2)
cov(dortmund[,c(v1,v2)])
mcdd$cov[c(v1,v2),c(v1,v2)]
mcdd75$cov[c(v1,v2),c(v1,v2)]
cov(dortmund[,c(v1,v2)])
# birthdeath benefits
# birthdeath 6.5977450 -0.2056158
# benefits -0.2056158 0.1520452
mcdd$cov[c(v1,v2),c(v1,v2)]
# birthdeath benefits
# birthdeath 0.003033218 -0.002274133
# benefits -0.002274133 0.069849984
mcdd75$cov[c(v1,v2),c(v1,v2)]
# birthdeath benefits
# birthdeath 0.002990605 0.002533904
# benefits 0.002533904 0.129704708
v1 <- 1
v2 <- 6
plot(dortmund[,c(v1,v2)])
ellipse(colMeans(dortmund[,c(v1,v2)]),cd[c(v1,v2),c(v1,v2)],alpha=0.01)
ellipse(mcdd$center[c(v1,v2)],mcdd$cov[c(v1,v2),c(v1,v2)],col=2,alpha=0.01)
ellipse(mcdd75$center[c(v1,v2)],mcdd75$cov[c(v1,v2),c(v1,v2)],col=4,alpha=0.01)
text(1.3,5,"standard cov")
text(1.3,3.5,"MCD h/n=0.75",col=4)
text(0.55,1.7,"MCD h/n=0.5",col=2)
cov(dortmund[,c(v1,v2)])
mcdd$cov[c(v1,v2),c(v1,v2)]
mcdd75$cov[c(v1,v2),c(v1,v2)]
cov(dortmund[,c(v1,v2)])
# unemployed social_insurance
# unemployed 0.07188192 0.04501998
# social_insurance 0.04501998 0.62228084
mcdd$cov[c(v1,v2),c(v1,v2)]
# unemployed social_insurance
# unemployed 0.035830520 0.005097865
# social_insurance 0.005097865 0.308213692
mcdd75$cov[c(v1,v2),c(v1,v2)]
# unemployed social_insurance
# unemployed 0.05527552 -0.01177858
# social_insurance -0.01177858 0.30552307
Outlier identification
Slides 47 and 49 (plots 48 and 50)
Needed: mcdd, dortmund
# library robustbase has a plot.mcd function and one could
# use plot(mcdd) for outlier diagnostic plots, but this has some problems
# - need to add tol=1e-20 because otherwise gives an error.
# Also uses 0.975 quantile, which can’t be changed and I prefer 0.99.
# So I use my own code.
plot(1:170,sqrt(mcdd$mah),type="n",xlab="Observation",
ylab="Robust Mahalanobis distance")
text(1:170,sqrt(mcdd$mah),rownames(dortmund),cex=0.7)
abline(sqrt(qchisq(0.99,7)),0,col=2)
# Should look at smaller values to see more precisely what’s going on:
plot(1:170,sqrt(mcdd$mah),type="n",ylim=c(0,30),xlab="Observation",
ylab="Robust Mahalanobis distance")
text(1:170,sqrt(mcdd$mah),rownames(dortmund),cex=0.7)
abline(sqrt(qchisq(0.99,7)),0,col=2)
# Compare with Mahalanobis distances based on mean and sample cov-matrix
plot(sqrt(mahalanobis(dortmund,colMeans(dortmund),cd)),sqrt(mcdd$mah),type="n",
xlab="Standard Mahalanobis distance",
ylab="Robust Mahalanobis distance")
text(sqrt(mahalanobis(dortmund,colMeans(dortmund),cd)),sqrt(mcdd$mah),
rownames(dortmund),cex=0.7)
abline(sqrt(qchisq(0.99,7)),0,col=2)
abline(v=sqrt(qchisq(0.99,7)),col=2)
plot(sqrt(mahalanobis(dortmund,colMeans(dortmund),cd)),sqrt(mcdd$mah),type="n",
xlim=c(0,10),ylim=c(0,30),xlab="Standard Mahalanobis distance",
ylab="Robust Mahalanobis distance")
text(sqrt(mahalanobis(dortmund,colMeans(dortmund),cd)),sqrt(mcdd$mah),
rownames(dortmund),cex=0.7)
abline(sqrt(qchisq(0.99,7)),0,col=2)
abline(v=sqrt(qchisq(0.99,7)),col=2)
# Compare with alpha=0.75
plot(sqrt(mcdd75$mah),sqrt(mcdd$mah),xlim=c(0,30),ylim=c(0,30),
xlab="Robust Mahalanobis distance (alpha=0.75)",
ylab="Robust Mahalanobis distance (alpha=0.5)")
abline(sqrt(qchisq(0.99,7)),0,col=2)
abline(v=sqrt(qchisq(0.99,7)),col=2)
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hey guys!
Here is a comprehensive collection of all the code he had in his slides, unified in one place. I made only minor changes to the code, comments etc. are left inside. I figure this can be useful for debugging and faster look up.
Feel free to make comments or suggestions on how this could be improved!
Prelimineries and Introduction
Simple plots
Example dataset Old Faithful Geyser
German bundestag election pair plot
Olive oil pairplot
Principal Component Analysis
Olive oil (non-standardized)
Olive oil (standardized)
More example datasets
Veronica plants
Artificial data set 1
Artificial data set 2
K-means
Examples
Artificial data set 1
Old Faithful Geyser (non-standardized)
Old Faithful Geyser (standardized)
Bundestag data (non-standardized)
Visualization
Cluster interpretation
Inconsistent ML-estimator
Number of clusters
Elbow plot
Gap statistic
Gapnc
Function
Example Artificial data set 1
Example Geyser data
Example bundestag data
Dissimilarities
Countinuous variables
Euclidean and Manhatten
Mahanalobis distance
Binary and categorical variables
Simple Matching distance
Correlation dissimilarity
Gower coefficient
Multidimensional Scaling
Examples (bundestag and veronica)
Similarity between clusters
Adjusted Rand index
Hierarchical clustering
Standard methods
Bundestag distance
Veronica data
Cluster observations
Cluster genes
Old Faithful Geyser data
Wards method
Bundestag data
PAM
Bundestag data
Average Silhouette width
Different cluster sizes
Comparison of cluster methods
Mixture Models
Introduction
Example
Specify certain models
Mixtures of skew and heavy-tailed distributions
Example of multivariate t-mixture (olive oil)
Computation
Assessment
Mixture of skew-normal distributions
A Mixture model for categorical data
Example Veronica data
Computation
Result breakdown
Clustering heatmap
Functional data
Introduction
Preliminary + plot of data
B-spline basis
Functional PCA
PCA scores and scores for continents
Approximations via PCA projections
A mixture-based method for functional clustering
Overview models
Clustering results
Visualize cluster separation
Plot curves with clusters
Robust Statistics
Introduction
Outlier example
Robust scale estimators
Example
MCD
Outlier identification
Outliers in linear regression
Without outlier
Regression outlier
Bad leverage point
Good leverage point
Example for masking
Robust Regression estimators
Robust Regression estimators L1
Final robust estimator
Prelimineries and Introduction
Simple plots
Example dataset Old Faithful Geyser
Slide 10 (plot slide 9)
German bundestag election pair plot
Slide 12 (plot slide 11)
Olive oil pairplot
Slides 15 (plot slide 14)
Visualize macro areas of olive oil data
Slide 16
Principal Component Analysis
Olive oil (non-standardized)
Slide 22 (plot 23 and 24)
Olive oil (standardized)
Slide 27 (plots 28, 29 and 30)
More example datasets
Veronica plants
Slide 34 (plots 33)
Artificial data set 1
Slide 36 (plot 35)
Artificial data set 2
Slide 38 (plots 37)
K-means
Examples
Artificial data set 1
Slide 50 (plots 48 and 49)
Old Faithful Geyser (non-standardized)
Slide 51 and 53 (plots 52 and 53)
Old Faithful Geyser (standardized)
Slide 54 (plot 55)
Bundestag data (non-standardized)
Visualization
Slide 58 (plot 59)
Cluster interpretation
Slide 60
Inconsistent ML-estimator
Slides 73 and 74 (plot 72)
Number of clusters
Elbow plot
Slide 80 (plot 79)
Gap statistic
Slide 92 and 93 (plots 91 and 94)
Gapnc
Function
Slide 96
Example Artificial data set 1
Slide 97
Example Geyser data
Slide 99 (plot 98)
Example bundestag data
Slide 101 (plots 102 and 103)
Dissimilarities
Countinuous variables
Euclidean and Manhatten
Slide 119 (plot 120)
Mahanalobis distance
Slide 121
Binary and categorical variables
Simple Matching distance
Slide 128 (plot 129)
Correlation dissimilarity
Slide 132
Gower coefficient
Slides 140 and 142
Multidimensional Scaling
Examples (bundestag and veronica)
Slides 145 and 146 (plot 147)
Similarity between clusters
Adjusted Rand index
Slide 155
Hierarchical clustering
Standard methods
Bundestag distance
Slides 166 (plots 165)
Veronica data
Cluster observations
Slide 170 (plot 171)
Cluster genes
Slide 172 (plot 173)
Old Faithful Geyser data
Slide 174 (plots 175, 176 and 177)
Wards method
Bundestag data
Slide 184 and 185
PAM
Bundestag data
Slide 192 (plot 193)
Needed: p05
Average Silhouette width
Different cluster sizes
Slide 198 (plot 199 and 200)
Needed: p05manhatten
Comparison of cluster methods
Slide 201 and 204 (plots 202, 203 and 204)
Needed: jveronica
Mixture Models
Introduction
Example
Slide 241 and 242 (plot 243, 244 and 245)
Needed: olive, oliveoil
Specify certain models
Slide 246
Needed: olive
Mixtures of skew and heavy-tailed distributions
Example of multivariate t-mixture (olive oil)
Computation
Slide 260
Needed: olive
Assessment
Slides 262, 263, 264 and 265 (plots 264 and 265)
Needed: tolive, sprolive, oliveoil
Mixture of skew-normal distributions
Slide 266, 267 and 268 (plot 269)
Needed: solive, oliveoil
A Mixture model for categorical data
Example Veronica data
Computation
Slide 283 (plots 284 and 285)
Needed: veronica
Result breakdown
Slides 286 and 287
Needed: veronicabernm
Clustering heatmap
Slide 288 (plot 289)
Needed: veronicam, veronicabernm
Functional data
Introduction
Preliminary + plot of data
Slide 305
Needed: "covid2021.dat"
B-spline basis
Slide 306 (plots 299, 300, 302, 303 and 304)
Needed: covid21v
Functional PCA
PCA scores and scores for continents
Slide 311 (plots 312 and 313)
Needed: fdcovid, covid21
Approximations via PCA projections
Slide 315
Needed: covidpca, mcovid
A mixture-based method for functional clustering
Overview models
Slide 320
Needed:
Slide 324 and 325 (plot 326)
Needed: fdcovid
Clustering results
Slide 327 (plot 328)
Needed: femresults11, covid21
Visualize cluster separation
Slide 329 (plots 330 and 331)
Needed: covidpca, femresult11, fdcovid
Plot curves with clusters
Slide 332 (plot 333, 334 and 335)
Needed: covid21v, femresult11, fdcovid
Robust Statistics
Introduction
Outlier example
Slides 4, 5, 6
Needed: Dortmund_msbd.dat
Robust scale estimators
Example
Slide 34
Needed: dortmund
MCD
Slides 40, 41, 42, 44 and 45 (plot 43)
Needed: dortmund
Outlier identification
Slides 47 and 49 (plots 48 and 50)
Needed: mcdd, dortmund
Outliers in linear regression
Without outlier
Slide 58 (plot 59)
Needed:
Regression outlier
Slide 60 (plot 61)
Bad leverage point
Slide 62 (plot 63)
Needed: regdata1
Good leverage point
Slide 64 (plot 65)
Needed: regdata1
Example for masking
Slide 66 (plot 67)
Needed:
Robust Regression estimators
Slide 84 (plots 85 and 86)
Needed: regdata3
Robust Regression estimators L1
Slide 88 (plots 89 and 90)
Needed: regdata4
Final robust estimator
Slide 91 (plot 94)
Needed: starsCYG
Beta Was this translation helpful? Give feedback.
All reactions