I did an animation demonstrating how an outlier can exert leverage on the regression line in Regression leverage and influence. It was pointed out to me that in such an instance, one should be using the Theil-Sen regression with an outlier in the data. The animation and R code are below. Here is a summary of the difference from Theil-Sen Regression: Programming and Understanding an Outlier-Resistant Alternative to Least Squares (4/28/2023):
** Note: I can’t put this in a quote block due to the formulas. **
One alternative to least-squares regression when estimating a linear relationship between x and y is Theil-Sen regression. In simple linear regression fit by least squares, the coefficient of predictor x (the slope between x and y) and the intercept are chosen such that the value of
(the sum of squared residuals) is minimized. In Theil-Sen regression, the slope is calculated by taking the median of the slopes between each pair of points in the data. (For a pair of points,
the slope is calculated as
An intercept can be calculated to accompany the Theil-Sen slope estimate as well: For each point,
where m is the Theil-Sen slope (i.e., find the intercept of the line with the calculated Theil-Sen slope that passes through each point); then take the median of those intercepts (Young, p. 201).
** End quote block. **
Here is the animation using the same data set as in the original post. I’ll leave it as an exercise as to why the Theil-Sen line moves a little but then stops moving as the outlier point moves down.
If you are teaching a programming course, then it is a good exercise to create code to calculate the slope and intercept of the Theil-Sen line. These computational methods are viable today, with computing power being what it is, but education hasn’t always caught up. Thanks for the comment on LinkedIn for pointing this out to me.
R Code
## Packages
library(dplyr)
library(tidyr)
library(ggplot2)
library(animation)
library(ggpmisc)
library(RobustLinearReg)
## Colors
MyBlue <- "#437fca"
MyRed <- "#be4242"
MyPurple <- "#5B005B"
MyLightP <- "#dfdbdf"
MyLightP2 <- "#f8f4f8"
MyLightP3 <- "#fcfafc"
## create data frame
set.seed(43)
nData <- 40
xSpec <- 30
xTemp <- runif( nData, 0, 15 )
yTemp <- 2 * xTemp + 5 + rnorm( nData, 0, 2 )
ySpec <- seq(2 * xSpec + 5, 20, length.out = 50 )
xData <- c( rep( xTemp, 50), rep( xSpec, 50 ))
yData <- c( rep( yTemp, 50), ySpec )
tData <- c( rep( 1:50, each = nData ), 1:50 )
data <- data.frame( "x" = xData, "y" = yData, "time" = tData)
dataFirst <- data %>% filter( time == 1)
## Defined variables
CaptionData <- ""
## fixed data regression summary
dev.new(width = 1456,height = 936,unit = "px")
result <- lm( y ~ x, data = dataFirst )
Int <- round( result$coefficients[[1]], 2 )
Slope <- round( result$coefficients[[2]], 2 )
Rsq <- round( summary(result)$r.squared, 2 )
p_val <- format( summary(result)$coefficients[2,4], 2)
RegResult <- data.frame("Result" = c("Int", "Slope", "Rsq", "p_val"),
"Fixed" = c(Int, Slope, Rsq, p_val))
## Create Annimation
saveGIF(
{
ani.options(interval = 0.20, nmax = 50)
for (i in 1:50){
DataA <- data %>% filter(time == i)
# table data for animation
result2 <- lm(y~x, data = DataA)
Int2 <- round(result2$coefficients[[1]], 2)
Slope2 <- round(result2$coefficients[[2]], 2)
Rsq2 <- round(summary(result2)$r.squared, 2)
p_val2 <- format(summary(result2)$coefficients[2,4], 2)
RegResult$'Red (Least Squares)' <- c(Int2, Slope2, Rsq2, p_val2)
# Theil-Sen Regression
result3 <- theil_sen_regression(y ~ x, data = DataA)
Int3 <- round(result3$coefficients[[1]], 2)
Slope3 <- round(result3$coefficients[[2]], 2)
Rsq3 <- round(summary(result3)$r.squared, 2)
p_val3 <- format(summary(result3)$coefficients[2,4], 2)
RegResult$'Blue (Theil-Sen)' <- c(Int3, Slope3, Rsq3, p_val3)
# Graph
p <- ggplot(DataA, aes(x, y)) +
geom_point(size = 4) +
stat_smooth(method = "lm", formula = y ~ x, geom = "smooth",
se = FALSE, color = MyRed, linewidth = 1.5) +
geom_point(data = dataFirst, aes(x, y), size = 4) +
stat_smooth(data = dataFirst, aes(x, y), method = "lm",
formula = y ~ x, geom = "smooth", se = FALSE,
color = MyPurple, linewidth = 1.5) +
geom_point(data = DataA[DataA$x == 30, ], aes(x, y), size = 5, color = MyRed)+
geom_segment(aes(x = 0, xend = 30, y = Int3, yend = Int3 + 30*Slope3),
linewidth=1.5, color=MyBlue)+
theme(axis.text = element_text(size = 14),
axis.title = element_text(size = 16),
plot.title = element_text(size = 20),
plot.background = element_rect(fill = MyLightP3),
panel.background = element_rect(fill = MyLightP),
legend.background = element_rect(fill = MyLightP2),
plot.caption = element_text(hjust = c(1),size = c(14),
color = c(MyPurple))) +
labs(title = "Regression Leverage Animation: Least Squares vs. Theil-Sen",
y = NULL, x = NULL,
caption = c("Briefed by Data || Thomas J Pfaff")) +
annotate(geom = "table", x = 3,y = 60, label = list(RegResult), size = 6)
plot(p)
ani.pause()
}
for (k in 1:5){
print(p)
ani.pause() }
}, movie.name = "RegressionLeverage2.gif", ani.width = 1456, ani.height = 936)