Log Link vs. Log(y)

R
Statistics
Author

Demetri Pananos

Published

October 6, 2020

You wanna see a little gotcha in statistics? Take the following data

set.seed(0)
N = 1000
y = rlnorm(N, 0.5, 0.5)

and explain why glm(y ~ 1, family = gaussian(link=log) and lm(log(y)~1) produce different estimates of the coefficients. In case you don’t have an R terminal, here are the outputs

log_lm = lm(log(y) ~ 1)
summary(log_lm)

Call:
lm(formula = log(y) ~ 1)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.61028 -0.34631 -0.02152  0.35173  1.64112 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.49209    0.01578   31.18   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.499 on 999 degrees of freedom
glm_mod = glm(y ~ 1 , family = gaussian(link=log))
summary(glm_mod)

Call:
glm(formula = y ~ 1, family = gaussian(link = log))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5282  -0.6981  -0.2541   0.4702   6.5869  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.61791    0.01698    36.4   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.9918425)

    Null deviance: 990.85  on 999  degrees of freedom
Residual deviance: 990.85  on 999  degrees of freedom
AIC: 2832.7

Number of Fisher Scoring iterations: 5

Answer is the same as the difference between \(E(g(X))\) and \(g(E(X))\) which are not always the same. Let me explain.

First, let’s start with the lognormal random variable. \(y \sim \operatorname{Lognormal}(\mu, \sigma)\) means \(\log(y) \sim \operatorname{Normal}(\mu, \sigma)\). So \(\mu, \sigma\) are the parameters of the underlying normal distribution. When we do lm(log(y) ~ 1), we are modelling \(E(\log(y)) = \beta_0\). So \(\beta_0\) is an estimate of \(\mu\) and \(\exp(\mu)\) is an estimate of the median of the lognormal. That is an easy check

median(y)
[1] 1.600898
#Meh, close enough
exp(coef(log_lm))
(Intercept) 
   1.635723 

If I wanted an estimate of the mean of the lognormal, I would need to add \(\sigma^2/2\) to my estimate of \(\mu\).

mean(y)
[1] 1.855038
#Meh, close enough
sigma = var(log_lm$residuals)
exp(coef(log_lm) + sigma/2)
(Intercept) 
   1.852594 

Ok, onto the glm now. When we use the glm, we model \(\log(E(y)) = \beta_0\), so we model the mean of the lognormal directly. Case in point

mean(y)
[1] 1.855038
exp(coef(glm_mod))
(Intercept) 
   1.855038 

and if I wanted the median, I would need to consider the extra factor of \(\exp(\sigma^2/2)\)

median(y)
[1] 1.600898
exp(coef(glm_mod) - sigma/2)
(Intercept) 
   1.637881 

Log link vs. log outcome can be tricky. Just be sure to know what you’re modelling when you use either.