Demetri Pananos Ph.D - Log Link vs. Log(y)

You wanna see a little gotcha in statistics? Take the following data

set.seed(0)
N = 1000
y = rlnorm(N, 0.5, 0.5)

and explain why glm(y ~ 1, family = gaussian(link=log) and lm(log(y)~1) produce different estimates of the coefficients. In case you don’t have an R terminal, here are the outputs

log_lm = lm(log(y) ~ 1)
summary(log_lm)


Call:
lm(formula = log(y) ~ 1)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.61028 -0.34631 -0.02152  0.35173  1.64112 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.49209    0.01578   31.18   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.499 on 999 degrees of freedom

glm_mod = glm(y ~ 1 , family = gaussian(link=log))
summary(glm_mod)


Call:
glm(formula = y ~ 1, family = gaussian(link = log))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5282  -0.6981  -0.2541   0.4702   6.5869  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.61791    0.01698    36.4   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.9918425)

    Null deviance: 990.85  on 999  degrees of freedom
Residual deviance: 990.85  on 999  degrees of freedom
AIC: 2832.7

Number of Fisher Scoring iterations: 5

Answer is the same as the difference between \(E(g(X))\) and \(g(E(X))\) which are not always the same. Let me explain.

First, let’s start with the lognormal random variable. \(y \sim \operatorname{Lognormal}(\mu, \sigma)\) means \(\log(y) \sim \operatorname{Normal}(\mu, \sigma)\). So \(\mu, \sigma\) are the parameters of the underlying normal distribution. When we do lm(log(y) ~ 1), we are modelling \(E(\log(y)) = \beta_0\). So \(\beta_0\) is an estimate of \(\mu\) and \(\exp(\mu)\) is an estimate of the median of the lognormal. That is an easy check

median(y)

[1] 1.600898

#Meh, close enough
exp(coef(log_lm))

(Intercept) 
   1.635723

If I wanted an estimate of the mean of the lognormal, I would need to add \(\sigma^2/2\) to my estimate of \(\mu\).

mean(y)

[1] 1.855038

#Meh, close enough
sigma = var(log_lm$residuals)
exp(coef(log_lm) + sigma/2)

(Intercept) 
   1.852594

Ok, onto the glm now. When we use the glm, we model \(\log(E(y)) = \beta_0\), so we model the mean of the lognormal directly. Case in point

mean(y)

[1] 1.855038

exp(coef(glm_mod))

(Intercept) 
   1.855038

and if I wanted the median, I would need to consider the extra factor of \(\exp(\sigma^2/2)\)

median(y)

[1] 1.600898

exp(coef(glm_mod) - sigma/2)

(Intercept) 
   1.637881

Log link vs. log outcome can be tricky. Just be sure to know what you’re modelling when you use either.