R
Statistics
Author

Demetri Pananos

Published

October 6, 2020

You wanna see a little gotcha in statistics? Take the following data

set.seed(0)
N = 1000
y = rlnorm(N, 0.5, 0.5)

and explain why glm(y ~ 1, family = gaussian(link=log) and lm(log(y)~1) produce different estimates of the coefficients. In case you don’t have an R terminal, here are the outputs

log_lm = lm(log(y) ~ 1)
summary(log_lm)

Call:
lm(formula = log(y) ~ 1)

Residuals:
Min       1Q   Median       3Q      Max
-1.61028 -0.34631 -0.02152  0.35173  1.64112

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.49209    0.01578   31.18   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.499 on 999 degrees of freedom
glm_mod = glm(y ~ 1 , family = gaussian(link=log))
summary(glm_mod)

Call:
glm(formula = y ~ 1, family = gaussian(link = log))

Deviance Residuals:
Min       1Q   Median       3Q      Max
-1.5282  -0.6981  -0.2541   0.4702   6.5869

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.61791    0.01698    36.4   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.9918425)

Null deviance: 990.85  on 999  degrees of freedom
Residual deviance: 990.85  on 999  degrees of freedom
AIC: 2832.7

Number of Fisher Scoring iterations: 5

Answer is the same as the difference between $$E(g(X))$$ and $$g(E(X))$$ which are not always the same. Let me explain.

First, let’s start with the lognormal random variable. $$y \sim \operatorname{Lognormal}(\mu, \sigma)$$ means $$\log(y) \sim \operatorname{Normal}(\mu, \sigma)$$. So $$\mu, \sigma$$ are the parameters of the underlying normal distribution. When we do lm(log(y) ~ 1), we are modelling $$E(\log(y)) = \beta_0$$. So $$\beta_0$$ is an estimate of $$\mu$$ and $$\exp(\mu)$$ is an estimate of the median of the lognormal. That is an easy check

median(y)
 1.600898
#Meh, close enough
exp(coef(log_lm))
(Intercept)
1.635723 

If I wanted an estimate of the mean of the lognormal, I would need to add $$\sigma^2/2$$ to my estimate of $$\mu$$.

mean(y)
 1.855038
#Meh, close enough
sigma = var(log_lm\$residuals)
exp(coef(log_lm) + sigma/2)
(Intercept)
1.852594 

Ok, onto the glm now. When we use the glm, we model $$\log(E(y)) = \beta_0$$, so we model the mean of the lognormal directly. Case in point

mean(y)
 1.855038
exp(coef(glm_mod))
(Intercept)
1.855038 

and if I wanted the median, I would need to consider the extra factor of $$\exp(\sigma^2/2)$$

median(y)
 1.600898
exp(coef(glm_mod) - sigma/2)
(Intercept)
1.637881 

Log link vs. log outcome can be tricky. Just be sure to know what you’re modelling when you use either.