Regression leverage and influence
A Probability and Statistics Post: An animation of leverage and an application to global temperatures
A leverage point in a scatter plot is one that is an outlier along the x-axis. A leverage point may or may not have an influence on the regression line coefficients. Figure 1 (code at the end) is an animation of a point with leverage that starts off not influencing the regression line, but as it moves, it exerts influence over the regression coefficients. Notice how the slope of the line drops from 1.97 to 1.2, while the y-intercept increases from 5.28 to 9.78.
Figure 1 is an extreme example, with the last point being twice the distance from the second to last point in the data set. The idea can still apply where the last point exerts influence on the data set, even if it isn’t an outlier. Figure 2 shows the November global temperature anomalies. The last year, 2023, was an El Niño year and a record anomaly for November. That one new data point was influential enough to pull the slope up from 0.028 °F/yr to 0.029 °F/yr. This might not seem like a lot—a 3.6% increase—but if you now use the regression line to extrapolate into the future, it can start to matter. Worse, if one were to create a regression line with just El Niño years, then 2023 starts to have some leverage and would be more influential.
As a second example, Figure 3 shows the data for December. The last year, 2023, is not yet officially an El Niño year, but it will be. We see the same dynamic as in November.
Please share and like
Please help me find readers by forwarding this article to your friends (and even those who aren't your friends), sharing this post on social media, and clicking like. If you're on Twitter, you can find me at BriefedByData. If you have any article ideas, feedback, or other views, please email me at briefedbydata@substack.com.
Thank you
In a crowded media market, it's hard to get people to read your work. I have a long way to go, and I want to say thank you to everyone who has helped me find and attract subscribers.
Disagreeing and using comments
I'd rather know the truth and understand the world than always be right. I'm not writing to upset or antagonize anyone on purpose, though I guess that could happen. I welcome dissent and disagreement in the comments. We all should be forced to articulate our viewpoints and change our minds when we need to, but we should also know that we can respectfully disagree and move on. So, if you think something said is wrong or misrepresented, then please share your viewpoint in the comments.
Code for animation
## Packages
library(dplyr)
library(tidyr)
library(ggplot2)
library(animation)
library(ggpmisc)
## Colors
MyRed <- "#be4242"
MyPurple <- "#5B005B"
MyLightP <- "#dfdbdf"
MyLightP2 <- "#f8f4f8"
## create data frame
set.seed(43)
nData <- 40
xSpec <- 30
xTemp <- runif( nData, 0, 15 )
yTemp <- 2 * xTemp + 5 + rnorm( nData, 0, 2 )
ySpec <- seq(2 * xSpec + 5, 20, length.out = 50 )
xData <- c( rep( xTemp, 50), rep( xSpec, 50 ))
yData <- c( rep( yTemp, 50), ySpec )
tData <- c( rep( 1:50, each = nData ), 1:50 )
data <- data.frame( "x" = xData, "y" = yData, "time" = tData)
dataFirst <- data %>% filter( time == 1)
## Defined variables
CaptionData <- ""
## fixed data regression summary
result<-lm( y ~ x, data = dataFirst )
Int <- round( result$coefficients[[1]], 2 )
Slope <- round( result$coefficients[[2]], 2 )
Rsq <- round( summary(result)$r.squared, 2 )
p_val <- format( summary(result)$coefficients[2,4], 2)
RegResult <- data.frame("Result" = c("Int", "Slope", "Rsq", "p_val"),
"Fixed" = c(Int, Slope, Rsq, p_val))
## Create Animation
saveGIF(
{
ani.options(interval = 0.20, nmax = 50)
for (i in 1:50){
DataA <- data %>% filter(time == i)
DataA[DataA$x==30, "y" ]
# table data for animation
result2<-lm(y~x, data = DataA)
Int2 <- round(result2$coefficients[[1]], 2)
Slope2 <- round(result2$coefficients[[2]], 2)
Rsq2 <- round(summary(result2)$r.squared, 2)
p_val2 <- format(summary(result2)$coefficients[2,4], 2)
RegResult$Current <- c(Int2, Slope2, Rsq2, p_val2)
# Graph
p <- ggplot(DataA, aes(x, y)) +
geom_point(size = 4) +
stat_smooth(method = "lm", formula = y ~ x, geom = "smooth",
se = FALSE, color = MyRed, size = 1.5) +
geom_point(data = dataFirst, aes(x, y), size = 4) +
stat_smooth(data = dataFirst, aes(x, y), method = "lm",
formula = y ~ x, geom = "smooth", se = FALSE,
color = MyPurple, size = 1.5) +
geom_point(data = DataA[DataA$x == 30, ], aes(x, y), size = 5, color = MyRed)+
theme(axis.text = element_text(size = 14),
axis.title = element_text(size = 16),
plot.title = element_text(size = 20),
plot.background = element_rect(fill = MyLightP3),
panel.background = element_rect(fill = MyLightP),
legend.background = element_rect(fill = MyLightP2),
plot.caption = element_text(hjust = c(1),size = c(14),
color = c(MyPurple))) +
labs(title = "Regression Leverage Animation",
y = NULL, x = NULL,
caption = c("Briefed by Data || Thomas J Pfaff")) +
annotate(geom = "table", x = 3,y = 60, label = list(RegResult), size = 6)
plot(p)
ani.pause()
}
for (k in 1:5){
print(p)
ani.pause() }
}, movie.name = "RegressionLeverage.gif", ani.width = 1456, ani.height = 936)
Outliers have outsized influence on a regression line because the technique minimizes the sum of the squared distances of the observations to the line. The squared distance to the outlier would be large if the line wasn't drawn closer to it.