Don’t over-think about ‘outliers’, use a student-t distribution as an alternative

Artificial Intelligence

Don’t over-think about ‘outliers’, use a student-t distribution as an alternative

admin

March 30, 2024

Don’t over-think about ‘outliers’, use a student-t distribution as an alternative

A Student’s t-distribution is nothing greater than a Gaussian distribution with heavier tails. In other words, we will say that the Gaussian distribution is a special case of the Student’s t-distribution. The Gaussian distribution is defined by the mean (μ) and the usual deviation (σ). The Student t distribution, alternatively, adds a further parameter, the degrees of freedom (df), which controls the “thickness” of the distribution. This parameter assigns greater probability to events farther from the mean. This feature is especially useful for small sample sizes, comparable to in biomedicine, where the idea of normality is questionable. Note that because the degrees of freedom increase, the Student t-distribution approaches the Gaussian distribution. We are able to visualize this using density plots:

# Load needed libraries
library(ggplot2)# Set seed for reproducibility
set.seed(123)
# Define the distributions
x <- seq(-4, 4, length.out = 200)
y_gaussian <- dnorm(x)
y_t3 <- dt(x, df = 3)
y_t10 <- dt(x, df = 10)
y_t30 <- dt(x, df = 30)
# Create an information frame for plotting
df <- data.frame(x, y_gaussian, y_t3, y_t10, y_t30)
# Plot the distributions
ggplot(df, aes(x)) +
geom_line(aes(y = y_gaussian, color = "Gaussian")) +
geom_line(aes(y = y_t3, color = "t, df=3")) +
geom_line(aes(y = y_t10, color = "t, df=10")) +
geom_line(aes(y = y_t30, color = "t, df=30")) +
labs(title = "Comparison of Gaussian and Student t-Distributions",
x = "Value",
y = "Density") +
scale_color_manual(values = c("Gaussian" = "blue", "t, df=3" = "red", "t, df=10" = "green", "t, df=30" = "purple")) +
theme_classic()

Figure 1: Comparison of Gaussian and Student t-Distributions with different degrees of freedom.

Note in Figure 1 that the hill across the mean gets smaller because the degrees of freedom decrease because of this of the probability mass going to the tails, that are thicker. This property is what gives the Student’s t-distribution a reduced sensitivity to outliers. For more details on this matter, you may check this blog.

We load the required libraries:

library(ggplot2)
library(brms)
library(ggdist)
library(easystats)
library(dplyr)
library(tibble)
library(ghibli)

So, let’s skip data simulations and get serious. We’ll work with real data I even have acquired from mice performing the rotarod test.

First, we load the dataset into the environment and set the corresponding factor levels. The dataset comprises IDs for the animals, a groping variable (Genotype), an indicator for 2 different days on which the test was performed (day), and different trials for a similar day. For this text, we model only one in all the trials (Trial3). We are going to save the opposite trials for a future article on modeling variation.

As the info handling implies, our modeling strategy might be based on Genotype and Day as categorical predictors of the distribution of Trial3.

In biomedical science, categorical predictors, or grouping aspects, are more common than continuous predictors. Scientists on this field wish to divide their samples into groups or conditions and apply different treatments.

data <- read.csv("Data/Rotarod.csv")
data$Day <- factor(data$Day, levels = c("1", "2"))
data$Genotype <- factor(data$Genotype, levels = c("WT", "KO"))
head(data)

Let’s have an initial view of the info using Raincloud plots as shown by Guilherme A. Franchi, PhD in this great blog post.

edv <- ggplot(data, aes(x = Day, y = Trial3, fill=Genotype)) +
scale_fill_ghibli_d("SpiritedMedium", direction = -1) +
geom_boxplot(width = 0.1,
outlier.color = "red") +
xlab('Day') +
ylab('Time (s)') +
ggtitle("Rorarod performance") +
theme_classic(base_size=18, base_family="serif")+
theme(text = element_text(size=18),
axis.text.x = element_text(angle=0, hjust=.1, vjust = 0.5, color = "black"),
axis.text.y = element_text(color = "black"),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
legend.position="bottom")+
scale_y_continuous(breaks = seq(0, 100, by=20), 
limits=c(0,100)) +
# Line below adds dot plots from {ggdist} package 
stat_dots(side = "left", 
justification = 1.12,
binwidth = 1.9) +
# Line below adds half-violin from {ggdist} package
stat_halfeye(adjust = .5, 
width = .6, 
justification = -.2, 
.width = 0, 
point_colour = NA)
edv

Figure 2: Exploratory data visualization.

Figure 2 looks different from the unique by Guilherme A. Franchi, PhD because we’re plotting two aspects as an alternative of 1. Nevertheless, the character of the plot is identical. Listen to the red dots, these are those that could be considered extreme observations that tilt the measures of central tendency (especially the mean) toward one direction. We also observe that the variances are different, so modeling also sigma may give higher estimates. Our task now’s to model the output using the brms package.

LEAVE A REPLY Cancel reply