At times it can be advantagous to simulate data and study the HLM models and their behavior. In Edpsy 587, some small simulation studies are presented. In the notes on estimation this was done to show the impact of estimation methods on the results (especailly standard errors of fixed effets). Later we will examine a simulation study where we violate assumptions and examine impact on various results.
The basic steps covered here are
These could be writen more efficiently but they work and can be relatively easy to modify
Using screenreg from the texreg package can save you a lot of typing. To get output from texreg into word (and keep spacing, use
We will use the same set up in terms of packages as previously done this semester
library(lme4)
library(texreg)
We will simulate data from a model with a level 1 predictor and a level 2 predictor of the random intercept. In particular, \[ \mbox{Level 1:}\hspace{.5in} Y_{ij} = \beta_{0j} + \beta_{1j}x_{ij} + R_{ij}\] where \(R_{ij} \sim N(0,\sigma^2) i.i.d\). The variable \(x_{ij}\) is our level one variable. \[\mbox{Level 2:} \hspace{.5in} \beta_{0j} = \gamma_{00} + \gamma_{01}z_{j} + U_{0j} \\ \beta_{1j} = \gamma_{10}\] where \(U_{0j} \sim N(0,\tau_0^2)\) and independent of \(R_{ij}\). The linear mixed model is \[ Y_{ij} = \gamma_{00} + \gamma_{10} + \gamma_{01}\mbox{age}_{j} + U_{0j} + R_{ij}.\] Given our model, we need to set up some parameters and sample sized. These could be based on previous studies, expected outcomes, the purpose of the study, or other considerations.
N <- 100 # number of clusters/groups = 100 * 50
nj <- 20 # number of observations for group j
The linear mixed model we are Set fixed effects parameters,
gamma00 <- 5
gamma01 <- 2
gamma10 <- 3
We also need to set variance of random intercept, slope, covariance, and level 1 residual variances
tau2 = 2
sigma2 = 4
We need to save the simulated values so that we can later fit models to it, the following null matrix will do the trick
R.intercept <- matrix(,nrow=N*nj,ncol=7)
head(R.intercept)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] NA NA NA NA NA NA NA
## [2,] NA NA NA NA NA NA NA
## [3,] NA NA NA NA NA NA NA
## [4,] NA NA NA NA NA NA NA
## [5,] NA NA NA NA NA NA NA
## [6,] NA NA NA NA NA NA NA
We also need and index to keep track of rows of the R.intercept object
index <- 1
We will simulate the random intercept data using multiple nexted loops. Think of the sampling of nested data, we first sample marco (e.g., schools) and then within the macro units, sample the micro units (e.g., students)
# (Outer Loop) Sample the marco units
for (macro in (1:N)) {
U0j= sqrt(tau2) * rnorm(1)
age=50 * runif(1) + 20 # this makes age range between 20 - 70
# (Inner loop) Within a marco unit, take a sample of micro units;
for (micro in (1:nj)) {
# Create level 1 explanatory/predictor variable (draw from standard normal)
xij = rnorm(1)
# Create level 1 residual with sigma2= 2
z = rnorm(1)
Rij=sqrt(sigma2)*z;
# Computed response/outcome variable using linear mixed model
# and save the data in R.intercept
yij = gamma00 + gamma01*xij + gamma01*age + U0j + Rij
R.intercept[index,1:7] <- c(macro, micro, yij, age, xij, U0j, Rij)
index <- index + 1 # up-data index
} # --- end inner loop, go to next micro unit.
} # --- end outer loop, go to next macro unit;
The final step is to change our data to a data frame and make sure we have good names for all columns.
R.intercept <- as.data.frame(R.intercept)
names(R.intercept) <- c("macro","micro","yij","age","xij","U0j","Rij")
head(R.intercept)
## macro micro yij age xij U0j Rij
## 1 1 1 73.13785 29.61643 1.44879064 2.129078 3.8783213
## 2 1 2 68.95338 29.61643 0.06119613 2.129078 2.4690412
## 3 1 3 68.91799 29.61643 1.41622003 2.129078 -0.2763921
## 4 1 4 65.15200 29.61643 -1.48835421 2.129078 1.7667613
## 5 1 5 69.00530 29.61643 1.05524872 2.129078 0.5328597
## 6 1 6 61.82356 29.61643 -0.51149147 2.129078 -3.5153985
We will fit a series of models here. One will the “correct” model; that is, the one used to simulate data. We will also fit one that is simpler that levels out out variable that I’ve called age. The last models is on that overfits the data; that is, it is more complex that how we’ve siumlated data. The last models includes a random slope
The correct model:
summary( model.1 <- lmer(yij ~ xij + age + (1 | macro),data=R.intercept, REML=FALSE) )
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: yij ~ xij + age + (1 | macro)
## Data: R.intercept
##
## AIC BIC logLik deviance df.resid
## 8635.3 8663.3 -4312.7 8625.3 1995
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.2353 -0.6656 0.0076 0.7008 3.2610
##
## Random effects:
## Groups Name Variance Std.Dev.
## macro (Intercept) 2.237 1.496
## Residual 3.850 1.962
## Number of obs: 2000, groups: macro, 100
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 4.78153 0.52750 9.065
## xij 1.95593 0.04363 44.829
## age 2.00247 0.01068 187.555
##
## Correlation of Fixed Effects:
## (Intr) xij
## xij -0.005
## age -0.955 0.005
Too simple
summary( model.0 <- lmer(yij ~ xij + (1 | macro),data=R.intercept, REML=FALSE) )
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: yij ~ xij + (1 | macro)
## Data: R.intercept
##
## AIC BIC logLik deviance df.resid
## 9219.9 9242.3 -4605.9 9211.9 1996
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.1674 -0.6608 -0.0015 0.7055 3.2345
##
## Random effects:
## Groups Name Variance Std.Dev.
## macro (Intercept) 856.76 29.270
## Residual 3.85 1.962
## Number of obs: 2000, groups: macro, 100
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 99.29962 2.92737 33.92
## xij 1.95783 0.04371 44.79
##
## Correlation of Fixed Effects:
## (Intr)
## xij 0.000
Too complex
summary( model.2 <- lmer(yij ~ xij + age + (1 + xij | macro), data=R.intercept, REML=FALSE) )
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## unable to evaluate scaled gradient
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge: degenerate Hessian with 1 negative eigenvalues
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: yij ~ xij + age + (1 + xij | macro)
## Data: R.intercept
##
## AIC BIC logLik deviance df.resid
## 8639.3 8678.5 -4312.6 8625.3 1993
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.2337 -0.6684 0.0052 0.7005 3.2597
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## macro (Intercept) 2.238e+00 1.495835
## xij 5.426e-05 0.007366 -1.00
## Residual 3.850e+00 1.962052
## Number of obs: 2000, groups: macro, 100
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 4.78708 0.52754 9.074
## xij 1.95608 0.04364 44.827
## age 2.00235 0.01068 187.528
##
## Correlation of Fixed Effects:
## (Intr) xij
## xij -0.010
## age -0.955 0.005
## optimizer (nloptwrap) convergence code: 0 (OK)
## unable to evaluate scaled gradient
## Model failed to converge: degenerate Hessian with 1 negative eigenvalues
We can more easily compare the results of the models using `with each other and with the parameters used to simulate
screenreg(list(model.0, model.1, model.2))
##
## ====================================================================
## Model 1 Model 2 Model 3
## --------------------------------------------------------------------
## (Intercept) 99.30 *** 4.78 *** 4.79 ***
## (2.93) (0.53) (0.53)
## xij 1.96 *** 1.96 *** 1.96 ***
## (0.04) (0.04) (0.04)
## age 2.00 *** 2.00 ***
## (0.01) (0.01)
## --------------------------------------------------------------------
## AIC 9219.89 8635.31 8639.28
## BIC 9242.30 8663.32 8678.49
## Log Likelihood -4605.95 -4312.66 -4312.64
## Num. obs. 2000 2000 2000
## Num. groups: macro 100 100 100
## Var: macro (Intercept) 856.76 2.24 2.24
## Var: Residual 3.85 3.85 3.85
## Var: macro xij 0.00
## Cov: macro (Intercept) xij -0.01
## ====================================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05
We see that the correct models (model.1) is the best in terms of both AIC and BIC (more on these measures later). Note that the model that is overly complext have \(\tau_{1}^2=0.00\), basically zero.
We’ll use the same sample sizes, fixed effects, and variances as above; however, we also need \(\tau_1^2\) and \(\tau_{01}\), which are set below.
varU1j = 1
covU0jU1j= .4
To get desired covariance matrix for U0j and U1j, I find the use cholskey root to get weights for a linear combination that will create random effects that have the desired variance and covariance.
(Tau.mtx <- matrix(c(tau2, covU0jU1j, covU0jU1j, varU1j),nrow=2,ncol=2) )
## [,1] [,2]
## [1,] 2.0 0.4
## [2,] 0.4 1.0
w <- chol(Tau.mtx)
We set up a null matrix to save data
R.intslope <- matrix( , nrow=N*nj, ncol=8)
and need to set the index back to 1
index <- 1
The process is the same, except we sample an extra random effect and take a linear combination of them to get the desired covariance
# (Outer Loop) Sample the marco units
for (macro in (1:N)) {
U0j <- w[1,1]*rnorm(1)
U1j <- w[1,2]*U0j + w[2,2]*rnorm(1)
zj = 50 * runif(1) + 20
# (Inner loop) Within a marco unit, take a sample of micro units;
for (micro in (1:nj)) {
xij=rnorm(1)
z=rnorm(1);
Rij=sqrt(sigma2)*z;
# Computed response/outcome variable of the linear mixed model and save
yij = gamma00 + gamma10*xij + gamma01*zj + U0j + + U1j*xij + Rij
R.intslope[index,1:8] <- c(macro,micro,yij,zj,xij,U0j,U1j,Rij)
index <- index + 1
} # <--- end inner loop, go to next micro unit
} # <--- end outer loop, go to next macro unit
Again, we turn the simulated data into a data frame with names of what’s what
R.intslope <- as.data.frame(R.intslope)
names(R.intslope) <- c("macro","micro","yij","zj","xij","U0j","U1j","Rij")
Every time you randomly sample values, you’ll get different results, unless you set the seed for the random number generator (which I didn’t). Just to check on what the actual correlation matrix of random effects equals and how close it is to the desried one.
x <- R.intslope[ , c(6,7,8)]
cor(x)
## U0j U1j Rij
## U0j 1.0000000 0.53799402 -0.02346160
## U1j 0.5379940 1.00000000 -0.04098402
## Rij -0.0234616 -0.04098402 1.00000000
cov(x)
## U0j U1j Rij
## U0j 2.53591932 0.99900008 -0.07592333
## U1j 0.99900008 1.35969255 -0.09711468
## Rij -0.07592333 -0.09711468 4.12952325
The covariance matrix between \(U_{0j}\) and \(U_{1j}\) should be approximately equal to
Tau.mtx
## [,1] [,2]
## [1,] 2.0 0.4
## [2,] 0.4 1.0
The covariance bewteen the \(U_{j}\) and \(R_{ij}\) should be very small and close to zero, and \(\sigma\) should be close to 4.
Again we fit models that are too simple, too complex, and just right.
Too simple
summary( simple <- lmer(yij ~ xij + zj + (1 |macro), data=R.intslope, REML=FALSE) )
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: yij ~ xij + zj + (1 | macro)
## Data: R.intslope
##
## AIC BIC logLik deviance df.resid
## 9319.6 9347.6 -4654.8 9309.6 1995
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.9993 -0.6226 0.0217 0.6314 4.0128
##
## Random effects:
## Groups Name Variance Std.Dev.
## macro (Intercept) 2.337 1.529
## Residual 5.498 2.345
## Number of obs: 2000, groups: macro, 100
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 5.36971 0.55418 9.689
## xij 2.91226 0.05326 54.681
## zj 1.99421 0.01115 178.856
##
## Correlation of Fixed Effects:
## (Intr) xij
## xij 0.007
## zj -0.957 -0.006
Correct model.extract
summary( correct <- lmer(yij ~ xij + zj + (1 + xij |macro), data=R.intslope, REML=FALSE) )
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: yij ~ xij + zj + (1 + xij | macro)
## Data: R.intslope
##
## AIC BIC logLik deviance df.resid
## 8943.5 8982.7 -4464.8 8929.5 1993
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.2141 -0.6414 0.0186 0.6346 3.3782
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## macro (Intercept) 2.388 1.545
## xij 1.337 1.156 0.46
## Residual 4.104 2.026
## Number of obs: 2000, groups: macro, 100
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 5.49558 0.50901 10.80
## xij 2.94038 0.12539 23.45
## zj 1.99173 0.01016 196.09
##
## Correlation of Fixed Effects:
## (Intr) xij
## xij 0.131
## zj -0.948 -0.002
Too complex,
summary( complex <- lmer(yij ~ xij + zj + xij*zj + (1 + xij | macro), data=R.intslope, REML=FALSE) )
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: yij ~ xij + zj + xij * zj + (1 + xij | macro)
## Data: R.intslope
##
## AIC BIC logLik deviance df.resid
## 8945.5 8990.3 -4464.8 8929.5 1992
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.2141 -0.6413 0.0179 0.6343 3.3783
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## macro (Intercept) 2.388 1.545
## xij 1.337 1.156 0.46
## Residual 4.104 2.026
## Number of obs: 2000, groups: macro, 100
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 5.4850390 0.5529452 9.920
## xij 2.9203048 0.4301786 6.789
## zj 1.9919519 0.0111262 179.033
## xij:zj 0.0004227 0.0086648 0.049
##
## Correlation of Fixed Effects:
## (Intr) xij zj
## xij 0.409
## zj -0.957 -0.391
## xij:zj -0.391 -0.957 0.408
Lastly to mare comparison easer,
screenreg(list(simple, correct, complex))
##
## ====================================================================
## Model 1 Model 2 Model 3
## --------------------------------------------------------------------
## (Intercept) 5.37 *** 5.50 *** 5.49 ***
## (0.55) (0.51) (0.55)
## xij 2.91 *** 2.94 *** 2.92 ***
## (0.05) (0.13) (0.43)
## zj 1.99 *** 1.99 *** 1.99 ***
## (0.01) (0.01) (0.01)
## xij:zj 0.00
## (0.01)
## --------------------------------------------------------------------
## AIC 9319.60 8943.52 8945.52
## BIC 9347.61 8982.73 8990.33
## Log Likelihood -4654.80 -4464.76 -4464.76
## Num. obs. 2000 2000 2000
## Num. groups: macro 100 100 100
## Var: macro (Intercept) 2.34 2.39 2.39
## Var: Residual 5.50 4.10 4.10
## Var: macro xij 1.34 1.34
## Cov: macro (Intercept) xij 0.82 0.82
## ====================================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05
Some questions for you to answer:
In most stimulation studies, the process of simulating data, fiting models, and saving desired results is iterated a large number of times. After finishing all iterations, say \(Iter=100\) or \(500\), the fixed effects are averaged (their standard deviation is an estimate of the standard errors), varainces and covariances are averaged, and other desirable output. You should have a clear goal so that you saved all the important information. After running a simulation study, which can take a long time, you don’t want to think “I should have saved …”. It’s better to save more than less. Also, when saving data and output, it is often good to write results (and possibly) data to your hard drive at the end of the outer loop just in case your system crashes.