set.seed(0)
= 1000
N = rlnorm(N, 0.5, 0.5) y
You wanna see a little gotcha in statistics? Take the following data
and explain why glm(y ~ 1, family = gaussian(link=log)
and lm(log(y)~1)
produce different estimates of the coefficients. In case you don’t have an R terminal, here are the outputs
= lm(log(y) ~ 1)
log_lm summary(log_lm)
Call:
lm(formula = log(y) ~ 1)
Residuals:
Min 1Q Median 3Q Max
-1.61028 -0.34631 -0.02152 0.35173 1.64112
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.49209 0.01578 31.18 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.499 on 999 degrees of freedom
= glm(y ~ 1 , family = gaussian(link=log))
glm_mod summary(glm_mod)
Call:
glm(formula = y ~ 1, family = gaussian(link = log))
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5282 -0.6981 -0.2541 0.4702 6.5869
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.61791 0.01698 36.4 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 0.9918425)
Null deviance: 990.85 on 999 degrees of freedom
Residual deviance: 990.85 on 999 degrees of freedom
AIC: 2832.7
Number of Fisher Scoring iterations: 5
Answer is the same as the difference between \(E(g(X))\) and \(g(E(X))\) which are not always the same. Let me explain.
First, let’s start with the lognormal random variable. \(y \sim \operatorname{Lognormal}(\mu, \sigma)\) means \(\log(y) \sim \operatorname{Normal}(\mu, \sigma)\). So \(\mu, \sigma\) are the parameters of the underlying normal distribution. When we do lm(log(y) ~ 1)
, we are modelling \(E(\log(y)) = \beta_0\). So \(\beta_0\) is an estimate of \(\mu\) and \(\exp(\mu)\) is an estimate of the median of the lognormal. That is an easy check
median(y)
[1] 1.600898
#Meh, close enough
exp(coef(log_lm))
(Intercept)
1.635723
If I wanted an estimate of the mean of the lognormal, I would need to add \(\sigma^2/2\) to my estimate of \(\mu\).
mean(y)
[1] 1.855038
#Meh, close enough
= var(log_lm$residuals)
sigma exp(coef(log_lm) + sigma/2)
(Intercept)
1.852594
Ok, onto the glm now. When we use the glm, we model \(\log(E(y)) = \beta_0\), so we model the mean of the lognormal directly. Case in point
mean(y)
[1] 1.855038
exp(coef(glm_mod))
(Intercept)
1.855038
and if I wanted the median, I would need to consider the extra factor of \(\exp(\sigma^2/2)\)
median(y)
[1] 1.600898
exp(coef(glm_mod) - sigma/2)
(Intercept)
1.637881
Log link vs. log outcome can be tricky. Just be sure to know what you’re modelling when you use either.