Log Link vs. Log(y)
Published:
You wanna see a little gotcha in statistics? Take the following data
set.seed(0)
N = 1000
y = rlnorm(N, 0.5, 0.5)
and explain why glm(y ~ 1, family = gaussian(link=log)
and lm(log(y)~1)
produce different estimates of the coefficients. In case you don’t have an R terminal, here are the outputs
log_lm = lm(log(y) ~ 1)
summary(log_lm)
#>
#> Call:
#> lm(formula = log(y) ~ 1)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.12328 -0.29604 -0.02781 0.30134 1.20935
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.51133 0.04413 11.59 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.4413 on 99 degrees of freedom
glm_mod = glm(y ~ 1 , family = gaussian(link=log))
summary(glm_mod)
#>
#> Call:
#> glm(formula = y ~ 1, family = gaussian(link = log))
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.2997 -0.6014 -0.2201 0.4120 3.7464
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.61084 0.04851 12.59 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 0.7983696)
#>
#> Null deviance: 79.039 on 99 degrees of freedom
#> Residual deviance: 79.039 on 99 degrees of freedom
#> AIC: 264.26
#>
#> Number of Fisher Scoring iterations: 5
Answer is the same as the difference between $E(g(X))$ and $g(E(X))$ which are not always the same. Let me explain.
First, let’s start with the lognormal random variable. $y \sim \operatorname{Lognormal}(\mu, \sigma)$ means $\log(y) \sim \operatorname{Normal}(\mu, \sigma)$. So $\mu, \sigma$ are the parameters of the underlying normal distribution. When we do lm(log(y) ~ 1)
, we are modelling $E(\log(y)) = \beta_0$. So $\beta_0$ is an estimate of $\mu$ and $\exp(\mu)$ is an estimate of the median of the lognormal. That is an easy check
median(y)
>>>1.644421
exp(coef(model))
>>>1.626632
#Meh, close enough
If I wanted an estimate of the mean of the lognormal, I would need to add $\sigma^2/2$ to my estimate of $\mu$.
mean(y)
>>>1.833418
exp(coef(log_lm) + sigma/2)
>>> 1.836362
#Meh, close enough
Ok, onto the glm now. When we use the glm, we model $\log(E(y)) = \beta_0$, so we model the mean of the lognormal directly. Case in point
exp(coef(glm_mod))
>>> 1.833418
and if I wanted the median, I would need to consider the extra factor of $\exp(\sigma^2/2)$
exp(coef(glm_mod) - sigma/2)
>>>1.612436
Log link vs. log outcome can be tricky. Just be sure to know what you’re modelling when you use either.