Regression leverage and influence

A Probability and Statistics Post: An animation of leverage and an application to global temperatures

Jan 20, 2024

A leverage point in a scatter plot is one that is an outlier along the x-axis. A leverage point may or may not have an influence on the regression line coefficients. Figure 1 (code at the end) is an animation of a point with leverage that starts off not influencing the regression line, but as it moves, it exerts influence over the regression coefficients. Notice how the slope of the line drops from 1.97 to 1.2, while the y-intercept increases from 5.28 to 9.78.

Figure 1. An animation demonstrating how a point with leverage can exert influence over the regression coefficients. The animation R code is at the end.

Figure 1 is an extreme example, with the last point being twice the distance from the second to last point in the data set. The idea can still apply where the last point exerts influence on the data set, even if it isn’t an outlier. Figure 2 shows the November global temperature anomalies. The last year, 2023, was an El Niño year and a record anomaly for November. That one new data point was influential enough to pull the slope up from 0.028 °F/yr to 0.029 °F/yr. This might not seem like a lot—a 3.6% increase—but if you now use the regression line to extrapolate into the future, it can start to matter. Worse, if one were to create a regression line with just El Niño years, then 2023 starts to have some leverage and would be more influential.

As a second example, Figure 3 shows the data for December. The last year, 2023, is not yet officially an El Niño year, but it will be. We see the same dynamic as in November.

Please share and like

Please help me find readers by forwarding this article to your friends (and even those who aren't your friends), sharing this post on social media, and clicking like. If you're on Twitter, you can find me at BriefedByData. If you have any article ideas, feedback, or other views, please email me at briefedbydata@substack.com.

Thank you

In a crowded media market, it's hard to get people to read your work. I have a long way to go, and I want to say thank you to everyone who has helped me find and attract subscribers.

Disagreeing and using comments

I'd rather know the truth and understand the world than always be right. I'm not writing to upset or antagonize anyone on purpose, though I guess that could happen. I welcome dissent and disagreement in the comments. We all should be forced to articulate our viewpoints and change our minds when we need to, but we should also know that we can respectfully disagree and move on. So, if you think something said is wrong or misrepresented, then please share your viewpoint in the comments.

Code for animation


## Packages

library(dplyr)
library(tidyr)
library(ggplot2)
library(animation)
library(ggpmisc)

## Colors

MyRed <- "#be4242"
MyPurple <- "#5B005B"
MyLightP <- "#dfdbdf" 
MyLightP2 <-  "#f8f4f8"   

## create data frame

set.seed(43)
nData <- 40
xSpec <- 30
xTemp <- runif( nData, 0, 15 )
yTemp <- 2 * xTemp + 5 + rnorm( nData, 0, 2 )
ySpec <- seq(2 * xSpec + 5, 20, length.out = 50 )
xData <- c( rep( xTemp, 50), rep( xSpec, 50 ))
yData <- c( rep( yTemp, 50), ySpec )
tData <- c( rep( 1:50, each = nData ), 1:50 )

data <- data.frame( "x" = xData, "y" = yData, "time" = tData) 

dataFirst <- data %>% filter( time == 1)

## Defined variables

CaptionData <- ""

## fixed data regression summary


result<-lm( y ~ x, data = dataFirst )

Int <- round( result$coefficients[[1]], 2 )
Slope <- round( result$coefficients[[2]], 2 )
Rsq <- round( summary(result)$r.squared, 2 )
p_val <- format( summary(result)$coefficients[2,4], 2)

RegResult <- data.frame("Result" = c("Int", "Slope", "Rsq", "p_val"), 
                        "Fixed" = c(Int, Slope, Rsq, p_val))

## Create Animation


saveGIF(
{
ani.options(interval = 0.20, nmax = 50)

for (i in 1:50){


DataA <- data %>% filter(time == i)

DataA[DataA$x==30, "y" ]

# table data for animation

result2<-lm(y~x, data = DataA)

Int2 <- round(result2$coefficients[[1]], 2)
Slope2 <- round(result2$coefficients[[2]], 2)
Rsq2 <- round(summary(result2)$r.squared, 2)
p_val2 <- format(summary(result2)$coefficients[2,4], 2)

RegResult$Current <- c(Int2, Slope2, Rsq2, p_val2)

# Graph

p <-  ggplot(DataA, aes(x, y)) + 
  geom_point(size = 4) + 
  stat_smooth(method = "lm", formula = y ~ x, geom = "smooth",
              se = FALSE, color = MyRed, size = 1.5) +
  geom_point(data = dataFirst, aes(x, y), size = 4) +
  stat_smooth(data = dataFirst, aes(x, y), method = "lm",
              formula = y ~ x, geom = "smooth", se = FALSE,
              color = MyPurple, size = 1.5) +
  geom_point(data = DataA[DataA$x == 30, ], aes(x, y), size = 5, color = MyRed)+
  theme(axis.text = element_text(size = 14),
        axis.title = element_text(size = 16),
        plot.title = element_text(size = 20),
        plot.background = element_rect(fill = MyLightP3),
        panel.background = element_rect(fill = MyLightP), 
        legend.background = element_rect(fill = MyLightP2),
        plot.caption = element_text(hjust = c(1),size = c(14),
                                    color = c(MyPurple))) +
       labs(title = "Regression Leverage Animation",
       y = NULL, x = NULL,
       caption = c("Briefed by Data || Thomas J Pfaff")) +
  annotate(geom = "table", x = 3,y = 60, label = list(RegResult), size = 6)

plot(p)

ani.pause()

}

for (k in 1:5){
  print(p)
  ani.pause() }

}, movie.name = "RegressionLeverage.gif", ani.width = 1456, ani.height = 936)