New plot methods for `check_outliers` (?) #262

rempsyc · 2023-01-20T20:18:14Z

This is regarding the check_outliers() paper for the journal Mathematics (easystats/performance#544). I wonder if we should add new plot methods to include in the article submission (deadline is Feb 23). I explain in detail below.

Model-based outliers

For model-based outliers, see has an awesome plotting method:

library(performance)
library(see)

data <- rbind(mtcars[1:4], 42, 55)
model <- lm(disp ~ mpg * hp, data = data)
x <- check_outliers(model, method = "cook")
plot(x)

^{Created on 2023-01-20 with reprex v2.0.2}

Multiple methods

For multiple methods, we have no choice but to standardize the distance scores if we want to plot them on the same scale so I think the current solution is pretty satisfying.

library(performance)
library(see)

data <- rbind(mtcars[1:4], 42, 55)
model <- lm(disp ~ mpg * hp, data = data)
x <- check_outliers(data, method = c("zscore_robust", "iqr", "mcd", "lof"))
plot(x)

^{Created on 2023-01-20 with reprex v2.0.2}

Multivariate methods

For a single multivariate method, I think it is ok-ish. Could be a lot of work to do a custom plotting method for each multivariate method so I think this is fine. But the x-axis is hard to read since the numbers overlap (so imagine with big data sets).

library(performance)
library(see)

data <- rbind(mtcars[1:4], 42, 55)
model <- lm(disp ~ mpg * hp, data = data)
x <- check_outliers(data, method = "mcd")
plot(x)

^{Created on 2023-01-20 with reprex v2.0.2}

For the Mahalanobis method specifically, one colleague believes their own custom plot is more useful (and would be happy to see it implemented within easystats):

data <- rbind(mtcars[1:4], 42, 55)
data <- cbind(car = row.names(data), data)

mahaout <- function (dataset, vars, idvar) {
  maha <- as.data.frame(na.omit(dataset[, c(idvar, vars)]))
  maha$values <- mahalanobis(na.omit(dataset[, vars]),
                             colMeans(na.omit(dataset[,vars]), na.rm=T),
                             cov(na.omit(dataset[, vars]), use = "p"))
  crit <- qchisq(0.999, df = ncol(dataset[, vars]))
  plot(sort(maha$values),
       xlab = "Observations", ylab = "Mahalanobis values")
  abline(h = crit, col = "darkred")
  outliers <- maha[which(maha$values > crit), idvar]
  return(outliers)
}

mahaout(data, vars = names(data[-1]), idvar="car")

#> [1] "34"

^{Created on 2023-01-20 with reprex v2.0.2}

Should we have something like that for method = mahalanobis and similar ones? The guiding principle could be: plotting distance of individual observations + line at chosen threshold. If we do this it might not be that much work since the actual distances and thresholds are already accessible as attributes, so it would make for very consistent plotting.

Lakens's Method

Edit: forgot to add this other example: Alternatively, we have the plot outlier method from the Daniel Lakens's outliers paper (Leys et al. (2019)).

library(Routliers)

data <- rbind(mtcars[1:4], 42, 55)
res <- outliers_mcd(x = data)
plot_outliers_mcd(res, x = data)

^{Created on 2023-01-20 with reprex v2.0.2}

Univariate methods

Let me give you another example of mine for univariate outliers. Currently, we have the same boring plot for method = zscore_robust for instance.

library(performance)
library(see)

data <- rbind(mtcars[1:4], 42, 55)
x <- check_outliers(data, method = "zscore_robust")
plot(x)

^{Created on 2023-01-20 with reprex v2.0.2}

But I was imagining that perhaps it would be useful to use something like this for zscores:

library(rempsyc)

data <- rbind(mtcars[1:4], 42, 55)
plot_outliers(data, response = "mpg", method = "sd", criteria = 3)

^{Created on 2023-01-20 with reprex v2.0.2}

And something similar for robust zscores, but for several variables we could also wrap it in a panel:

library(rempsyc)
library(see)
data <- rbind(mtcars[1:4], 42, 55)
plots(lapply(names(data), function(x) {
  plot_outliers(data, response = x, ytitle = x, method = "mad", criteria = 3)
}), n_columns = 2)

^{Created on 2023-01-20 with reprex v2.0.2}

Lakens's Method

Edit: forgot to add this other example: Alternatively, we have the plot outlier method from the Daniel Lakens's outliers paper (Leys et al. (2019)).

library(Routliers)

data <- rbind(mtcars[1:4], 42, 55)
res <- outliers_mad(x = data$mpg)
plot_outliers_mad(res, x = data$mpg)

^{Created on 2023-01-20 with reprex v2.0.2}

Challenges

One possible challenge for univariate method is when applied to several columns. In that case the proposed solution will not work since the rescaled score (0-1) is an aggregate of the score of each column (for single multivariate methods that would not be a problem by definition). So we could implement this when a single method + single column are selected? Unless of course we use lapply with see:plots like in the last example.

The text was updated successfully, but these errors were encountered:

DominiqueMakowski · 2023-01-22T02:36:51Z

I agree that the bar chart might not be the prettiest, BUT it has one major advantage over the others suggestions: it shows the row-name: it makes easy to identify which observation is problematic, which is the goal of check_outliers.

The plots of check_outliers should aim at achieving 3 objectives:

Show how much outlying observations "outlie" as compared to the group
Explicitly show the threshold used for classification
Show the row name of the outliers

I like the idea of a dot-violin plot like

It does look better I think too; but perhaps we could further improve it by labelling the points, example using pool points if the row names are numerical or gglabels if characters?

rempsyc · 2023-01-23T19:06:55Z

BUT it has one major advantage over the others suggestions: it shows the row-name: it makes easy to identify which observation is problematic, which is the goal of check_outliers.

True, but check_outliers() already flags them, so I think it is less critical to tag them on the plot compared to if people would only use the plot and not not the check_outliers() output. Plus as mentioned before, even with 30 observations it's hard to read which observations we are looking at. Imagine for +400 data points like in my data sets the current plotting method would be simply unusable.

Edit: here's an example with my current real-world data:

I feel like the point of the plot is more to have an idea of the overall distribution and the edge cases, while ideally using the original threshold and not a rescale transformation. And I agree with the three objectives you mention too. While I believe tagging them is slightly less important, I agree it would be a nice addition (either as default or as an option, with many observations that too can become unwieldy). I just want to confirm with you, would you only tag outliers (as is more common) or all observations (as I understand from poolpoint)?

Finally, I know right now it looks weird to have "All data" as x-axis title but this is because since outliers need to be checked by group if need be, they can be plotted as such as well, e.g.,

library(rempsyc)

data <- rbind(mtcars[1:4], 42, 55)
data$cyl[33:34] <- 6
plot_outliers(data, group = "cyl", response = "mpg")

^{Created on 2023-01-23 with reprex v2.0.2}

Ideally our plot method would detect if check_outliers has a grouped attribute and adjust the plot accordingly.

If you give me the green light to add the poolpoints to the dot-violin plot and add it as a new plotting method (at least for zscore methods), I'll go ahead and start working on that.

rempsyc · 2023-03-19T17:11:33Z

Reviewer 2 in easystats/performance#544

It is necessary to change the x-axis labels of all the figures as the numbers overlap; they should be written in a smaller font or otherwise eliminate some of them since they are not readable. I am talking about the figures referring to the histogram of the outliers.

So I guess we have no choice to change the plotting methods now!! So @DominiqueMakowski @IndrajeetPatil do you approve of the above after all?

strengejacke · 2023-03-19T20:08:48Z

For the paper, we can just add ggplot-layers to get the plot we want, unless these are very quick changes. I wouldn't spend too much time on coding, if not really required for the review.

rempsyc added the enhancement 🔥 New feature or request label Jan 20, 2023

rempsyc mentioned this issue Feb 12, 2023

check_outliers paper easystats/performance#544

Merged

rempsyc mentioned this issue Jan 24, 2024

Rotate x-axis labels by default for plot method for check_outliers() #317

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New plot methods for `check_outliers` (?) #262

New plot methods for `check_outliers` (?) #262

rempsyc commented Jan 20, 2023 •

edited

DominiqueMakowski commented Jan 22, 2023

rempsyc commented Jan 23, 2023 •

edited

rempsyc commented Mar 19, 2023

strengejacke commented Mar 19, 2023

New plot methods for check_outliers (?) #262

New plot methods for check_outliers (?) #262

Comments

rempsyc commented Jan 20, 2023 • edited

Model-based outliers

Multiple methods

Multivariate methods

Lakens's Method

Univariate methods

Lakens's Method

Challenges

DominiqueMakowski commented Jan 22, 2023

rempsyc commented Jan 23, 2023 • edited

rempsyc commented Mar 19, 2023

strengejacke commented Mar 19, 2023

New plot methods for `check_outliers` (?) #262

New plot methods for `check_outliers` (?) #262

rempsyc commented Jan 20, 2023 •

edited

rempsyc commented Jan 23, 2023 •

edited