Log Link vs. Log(y)

2 minute read

Published:

You wanna see a little gotcha in statistics? Take the following data

set.seed(0)
N = 1000
y = rlnorm(N, 0.5, 0.5)

and explain why glm(y ~ 1, family = gaussian(link=log) and lm(log(y)~1) produce different estimates of the coefficients. In case you don’t have an R terminal, here are the outputs


log_lm = lm(log(y) ~ 1)
summary(log_lm)
#> 
#> Call:
#> lm(formula = log(y) ~ 1)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.12328 -0.29604 -0.02781  0.30134  1.20935 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  0.51133    0.04413   11.59   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.4413 on 99 degrees of freedom

glm_mod = glm(y ~ 1 , family = gaussian(link=log))
summary(glm_mod)
#> 
#> Call:
#> glm(formula = y ~ 1, family = gaussian(link = log))
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -1.2997  -0.6014  -0.2201   0.4120   3.7464  
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  0.61084    0.04851   12.59   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 0.7983696)
#> 
#>     Null deviance: 79.039  on 99  degrees of freedom
#> Residual deviance: 79.039  on 99  degrees of freedom
#> AIC: 264.26
#> 
#> Number of Fisher Scoring iterations: 5

Answer is the same as the difference between $E(g(X))$ and $g(E(X))$ which are not always the same. Let me explain.

First, let’s start with the lognormal random variable. $y \sim \operatorname{Lognormal}(\mu, \sigma)$ means $\log(y) \sim \operatorname{Normal}(\mu, \sigma)$. So $\mu, \sigma$ are the parameters of the underlying normal distribution. When we do lm(log(y) ~ 1), we are modelling $E(\log(y)) = \beta_0$. So $\beta_0$ is an estimate of $\mu$ and $\exp(\mu)$ is an estimate of the median of the lognormal. That is an easy check

median(y)
>>>1.644421

exp(coef(model))
>>>1.626632 

#Meh, close enough

If I wanted an estimate of the mean of the lognormal, I would need to add $\sigma^2/2$ to my estimate of $\mu$.

mean(y)
>>>1.833418

exp(coef(log_lm) + sigma/2)
>>> 1.836362 

#Meh, close enough

Ok, onto the glm now. When we use the glm, we model $\log(E(y)) = \beta_0$, so we model the mean of the lognormal directly. Case in point

exp(coef(glm_mod))
>>> 1.833418 

and if I wanted the median, I would need to consider the extra factor of $\exp(\sigma^2/2)$

exp(coef(glm_mod) - sigma/2)
>>>1.612436 

Log link vs. log outcome can be tricky. Just be sure to know what you’re modelling when you use either.