Introduction

At times it can be advantagous to simulate data and study the HLM models and their behavior. In Edpsy 587, some small simulation studies are presented. In the notes on estimation this was done to show the impact of estimation methods on the results (especailly standard errors of fixed effets). Later we will examine a simulation study where we violate assumptions and examine impact on various results.

The basic steps covered here are

  1. Simulate data from an a random intercept model with 2 levels
  2. Fit and compare models fit to random intercept data (too simple, correct model and too complex.
  3. Simulate data from a random intercept and slope model
  4. Fit and comparge models fti to intercept and slope model

These could be writen more efficiently but they work and can be relatively easy to modify

Using screenreg from the texreg package can save you a lot of typing. To get output from texreg into word (and keep spacing, use

Set up

We will use the same set up in terms of packages as previously done this semester

library(lme4)
library(texreg)

Simulate Random Intercept Data

We will simulate data from a model with a level 1 predictor and a level 2 predictor of the random intercept. In particular, \[ \mbox{Level 1:}\hspace{.5in} Y_{ij} = \beta_{0j} + \beta_{1j}x_{ij} + R_{ij}\] where \(R_{ij} \sim N(0,\sigma^2) i.i.d\). The variable \(x_{ij}\) is our level one variable. \[\mbox{Level 2:} \hspace{.5in} \beta_{0j} = \gamma_{00} + \gamma_{01}z_{j} + U_{0j} \\ \beta_{1j} = \gamma_{10}\] where \(U_{0j} \sim N(0,\tau_0^2)\) and independent of \(R_{ij}\). The linear mixed model is \[ Y_{ij} = \gamma_{00} + \gamma_{10} + \gamma_{01}\mbox{age}_{j} + U_{0j} + R_{ij}.\] Given our model, we need to set up some parameters and sample sized. These could be based on previous studies, expected outcomes, the purpose of the study, or other considerations.

N <-  100         # number of clusters/groups = 100 * 50
nj <- 20          # number of observations for group j

The linear mixed model we are Set fixed effects parameters,

gamma00 <- 5
gamma01 <- 2
gamma10 <- 3

We also need to set variance of random intercept, slope, covariance, and level 1 residual variances

tau2 = 2
sigma2 = 4

We need to save the simulated values so that we can later fit models to it, the following null matrix will do the trick

R.intercept <- matrix(,nrow=N*nj,ncol=7)
head(R.intercept)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,]   NA   NA   NA   NA   NA   NA   NA
## [2,]   NA   NA   NA   NA   NA   NA   NA
## [3,]   NA   NA   NA   NA   NA   NA   NA
## [4,]   NA   NA   NA   NA   NA   NA   NA
## [5,]   NA   NA   NA   NA   NA   NA   NA
## [6,]   NA   NA   NA   NA   NA   NA   NA

We also need and index to keep track of rows of the R.intercept object

index <- 1

We will simulate the random intercept data using multiple nexted loops. Think of the sampling of nested data, we first sample marco (e.g., schools) and then within the macro units, sample the micro units (e.g., students)

# (Outer Loop) Sample the marco units
    for (macro in (1:N)) {
      
     U0j= sqrt(tau2) * rnorm(1) 
     age=50 * runif(1) + 20     # this makes age range between 20 - 70
     
#   (Inner loop) Within a marco unit, take a sample of micro units;
       for (micro in (1:nj)) {

# Create level 1 explanatory/predictor variable (draw from standard normal)
        xij = rnorm(1) 

# Create level 1 residual with sigma2= 2    
        z = rnorm(1)   
        Rij=sqrt(sigma2)*z;  

# Computed response/outcome variable using linear mixed model 
#  and save the data in R.intercept
        yij = gamma00 + gamma01*xij + gamma01*age + U0j + Rij
        R.intercept[index,1:7] <- c(macro, micro, yij, age, xij, U0j, Rij)
          index <- index + 1   # up-data index
     }  # --- end inner loop, go to next micro unit. 

} # --- end outer loop, go to next macro unit;

The final step is to change our data to a data frame and make sure we have good names for all columns.

R.intercept <- as.data.frame(R.intercept)
names(R.intercept) <- c("macro","micro","yij","age","xij","U0j","Rij")
head(R.intercept)
##   macro micro      yij      age         xij      U0j        Rij
## 1     1     1 73.13785 29.61643  1.44879064 2.129078  3.8783213
## 2     1     2 68.95338 29.61643  0.06119613 2.129078  2.4690412
## 3     1     3 68.91799 29.61643  1.41622003 2.129078 -0.2763921
## 4     1     4 65.15200 29.61643 -1.48835421 2.129078  1.7667613
## 5     1     5 69.00530 29.61643  1.05524872 2.129078  0.5328597
## 6     1     6 61.82356 29.61643 -0.51149147 2.129078 -3.5153985

Fit Models to Simulated Random Intercept Data

We will fit a series of models here. One will the “correct” model; that is, the one used to simulate data. We will also fit one that is simpler that levels out out variable that I’ve called age. The last models is on that overfits the data; that is, it is more complex that how we’ve siumlated data. The last models includes a random slope

The correct model:

summary( model.1 <- lmer(yij ~ xij + age + (1 | macro),data=R.intercept, REML=FALSE) )
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: yij ~ xij + age + (1 | macro)
##    Data: R.intercept
## 
##      AIC      BIC   logLik deviance df.resid 
##   8635.3   8663.3  -4312.7   8625.3     1995 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.2353 -0.6656  0.0076  0.7008  3.2610 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  macro    (Intercept) 2.237    1.496   
##  Residual             3.850    1.962   
## Number of obs: 2000, groups:  macro, 100
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  4.78153    0.52750   9.065
## xij          1.95593    0.04363  44.829
## age          2.00247    0.01068 187.555
## 
## Correlation of Fixed Effects:
##     (Intr) xij   
## xij -0.005       
## age -0.955  0.005

Too simple

summary( model.0 <- lmer(yij ~ xij  + (1 | macro),data=R.intercept, REML=FALSE) )
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: yij ~ xij + (1 | macro)
##    Data: R.intercept
## 
##      AIC      BIC   logLik deviance df.resid 
##   9219.9   9242.3  -4605.9   9211.9     1996 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.1674 -0.6608 -0.0015  0.7055  3.2345 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  macro    (Intercept) 856.76   29.270  
##  Residual               3.85    1.962  
## Number of obs: 2000, groups:  macro, 100
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 99.29962    2.92737   33.92
## xij          1.95783    0.04371   44.79
## 
## Correlation of Fixed Effects:
##     (Intr)
## xij 0.000

Too complex

summary( model.2 <- lmer(yij ~ xij + age + (1 + xij | macro), data=R.intercept, REML=FALSE) )
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## unable to evaluate scaled gradient
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge: degenerate Hessian with 1 negative eigenvalues
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: yij ~ xij + age + (1 + xij | macro)
##    Data: R.intercept
## 
##      AIC      BIC   logLik deviance df.resid 
##   8639.3   8678.5  -4312.6   8625.3     1993 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.2337 -0.6684  0.0052  0.7005  3.2597 
## 
## Random effects:
##  Groups   Name        Variance  Std.Dev. Corr 
##  macro    (Intercept) 2.238e+00 1.495835      
##           xij         5.426e-05 0.007366 -1.00
##  Residual             3.850e+00 1.962052      
## Number of obs: 2000, groups:  macro, 100
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  4.78708    0.52754   9.074
## xij          1.95608    0.04364  44.827
## age          2.00235    0.01068 187.528
## 
## Correlation of Fixed Effects:
##     (Intr) xij   
## xij -0.010       
## age -0.955  0.005
## optimizer (nloptwrap) convergence code: 0 (OK)
## unable to evaluate scaled gradient
## Model failed to converge: degenerate  Hessian with 1 negative eigenvalues

We can more easily compare the results of the models using `with each other and with the parameters used to simulate

screenreg(list(model.0, model.1, model.2))
## 
## ====================================================================
##                             Model 1       Model 2       Model 3     
## --------------------------------------------------------------------
## (Intercept)                    99.30 ***      4.78 ***      4.79 ***
##                                (2.93)        (0.53)        (0.53)   
## xij                             1.96 ***      1.96 ***      1.96 ***
##                                (0.04)        (0.04)        (0.04)   
## age                                           2.00 ***      2.00 ***
##                                              (0.01)        (0.01)   
## --------------------------------------------------------------------
## AIC                          9219.89       8635.31       8639.28    
## BIC                          9242.30       8663.32       8678.49    
## Log Likelihood              -4605.95      -4312.66      -4312.64    
## Num. obs.                    2000          2000          2000       
## Num. groups: macro            100           100           100       
## Var: macro (Intercept)        856.76          2.24          2.24    
## Var: Residual                   3.85          3.85          3.85    
## Var: macro xij                                              0.00    
## Cov: macro (Intercept) xij                                 -0.01    
## ====================================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05

We see that the correct models (model.1) is the best in terms of both AIC and BIC (more on these measures later). Note that the model that is overly complext have \(\tau_{1}^2=0.00\), basically zero.

Simulate Random Intercept and Slope Model

We’ll use the same sample sizes, fixed effects, and variances as above; however, we also need \(\tau_1^2\) and \(\tau_{01}\), which are set below.

varU1j = 1
covU0jU1j= .4

To get desired covariance matrix for U0j and U1j, I find the use cholskey root to get weights for a linear combination that will create random effects that have the desired variance and covariance.

(Tau.mtx <- matrix(c(tau2, covU0jU1j, covU0jU1j, varU1j),nrow=2,ncol=2) )
##      [,1] [,2]
## [1,]  2.0  0.4
## [2,]  0.4  1.0
w <- chol(Tau.mtx)

We set up a null matrix to save data

R.intslope <- matrix( , nrow=N*nj, ncol=8)

and need to set the index back to 1

index <- 1

The process is the same, except we sample an extra random effect and take a linear combination of them to get the desired covariance

# (Outer Loop) Sample the marco units
    for (macro in (1:N)) {
      U0j <- w[1,1]*rnorm(1)
      U1j <- w[1,2]*U0j + w[2,2]*rnorm(1) 
      zj = 50 * runif(1) + 20      

# (Inner loop) Within a marco unit, take a sample of micro units;
       for (micro in (1:nj)) { 
         xij=rnorm(1)
         z=rnorm(1);   
         Rij=sqrt(sigma2)*z;      

# Computed response/outcome variable of the linear mixed model and save     
        yij = gamma00 + gamma10*xij + gamma01*zj + U0j + + U1j*xij + Rij
          R.intslope[index,1:8] <- c(macro,micro,yij,zj,xij,U0j,U1j,Rij)
        index <- index + 1
     }  # <--- end inner loop, go to next micro unit
} # <--- end outer loop, go to next macro unit

Again, we turn the simulated data into a data frame with names of what’s what

R.intslope <- as.data.frame(R.intslope)
names(R.intslope) <- c("macro","micro","yij","zj","xij","U0j","U1j","Rij")

Every time you randomly sample values, you’ll get different results, unless you set the seed for the random number generator (which I didn’t). Just to check on what the actual correlation matrix of random effects equals and how close it is to the desried one.

x <- R.intslope[ , c(6,7,8)]
cor(x)
##            U0j         U1j         Rij
## U0j  1.0000000  0.53799402 -0.02346160
## U1j  0.5379940  1.00000000 -0.04098402
## Rij -0.0234616 -0.04098402  1.00000000
cov(x)
##             U0j         U1j         Rij
## U0j  2.53591932  0.99900008 -0.07592333
## U1j  0.99900008  1.35969255 -0.09711468
## Rij -0.07592333 -0.09711468  4.12952325

The covariance matrix between \(U_{0j}\) and \(U_{1j}\) should be approximately equal to

Tau.mtx
##      [,1] [,2]
## [1,]  2.0  0.4
## [2,]  0.4  1.0

The covariance bewteen the \(U_{j}\) and \(R_{ij}\) should be very small and close to zero, and \(\sigma\) should be close to 4.

Fitting Random Intercept and Slope Models

Again we fit models that are too simple, too complex, and just right.

Too simple

summary( simple <- lmer(yij ~ xij + zj + (1  |macro), data=R.intslope, REML=FALSE) )
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: yij ~ xij + zj + (1 | macro)
##    Data: R.intslope
## 
##      AIC      BIC   logLik deviance df.resid 
##   9319.6   9347.6  -4654.8   9309.6     1995 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.9993 -0.6226  0.0217  0.6314  4.0128 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  macro    (Intercept) 2.337    1.529   
##  Residual             5.498    2.345   
## Number of obs: 2000, groups:  macro, 100
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  5.36971    0.55418   9.689
## xij          2.91226    0.05326  54.681
## zj           1.99421    0.01115 178.856
## 
## Correlation of Fixed Effects:
##     (Intr) xij   
## xij  0.007       
## zj  -0.957 -0.006

Correct model.extract

summary( correct <- lmer(yij ~ xij + zj + (1 + xij |macro), data=R.intslope, REML=FALSE) )
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: yij ~ xij + zj + (1 + xij | macro)
##    Data: R.intslope
## 
##      AIC      BIC   logLik deviance df.resid 
##   8943.5   8982.7  -4464.8   8929.5     1993 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.2141 -0.6414  0.0186  0.6346  3.3782 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  macro    (Intercept) 2.388    1.545        
##           xij         1.337    1.156    0.46
##  Residual             4.104    2.026        
## Number of obs: 2000, groups:  macro, 100
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  5.49558    0.50901   10.80
## xij          2.94038    0.12539   23.45
## zj           1.99173    0.01016  196.09
## 
## Correlation of Fixed Effects:
##     (Intr) xij   
## xij  0.131       
## zj  -0.948 -0.002

Too complex,

summary( complex <- lmer(yij ~ xij + zj + xij*zj + (1 + xij  | macro), data=R.intslope, REML=FALSE) )
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: yij ~ xij + zj + xij * zj + (1 + xij | macro)
##    Data: R.intslope
## 
##      AIC      BIC   logLik deviance df.resid 
##   8945.5   8990.3  -4464.8   8929.5     1992 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.2141 -0.6413  0.0179  0.6343  3.3783 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  macro    (Intercept) 2.388    1.545        
##           xij         1.337    1.156    0.46
##  Residual             4.104    2.026        
## Number of obs: 2000, groups:  macro, 100
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept) 5.4850390  0.5529452   9.920
## xij         2.9203048  0.4301786   6.789
## zj          1.9919519  0.0111262 179.033
## xij:zj      0.0004227  0.0086648   0.049
## 
## Correlation of Fixed Effects:
##        (Intr) xij    zj    
## xij     0.409              
## zj     -0.957 -0.391       
## xij:zj -0.391 -0.957  0.408

Lastly to mare comparison easer,

screenreg(list(simple, correct, complex))
## 
## ====================================================================
##                             Model 1       Model 2       Model 3     
## --------------------------------------------------------------------
## (Intercept)                     5.37 ***      5.50 ***      5.49 ***
##                                (0.55)        (0.51)        (0.55)   
## xij                             2.91 ***      2.94 ***      2.92 ***
##                                (0.05)        (0.13)        (0.43)   
## zj                              1.99 ***      1.99 ***      1.99 ***
##                                (0.01)        (0.01)        (0.01)   
## xij:zj                                                      0.00    
##                                                            (0.01)   
## --------------------------------------------------------------------
## AIC                          9319.60       8943.52       8945.52    
## BIC                          9347.61       8982.73       8990.33    
## Log Likelihood              -4654.80      -4464.76      -4464.76    
## Num. obs.                    2000          2000          2000       
## Num. groups: macro            100           100           100       
## Var: macro (Intercept)          2.34          2.39          2.39    
## Var: Residual                   5.50          4.10          4.10    
## Var: macro xij                                1.34          1.34    
## Cov: macro (Intercept) xij                    0.82          0.82    
## ====================================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05

Some questions for you to answer:

Final Remarks

In most stimulation studies, the process of simulating data, fiting models, and saving desired results is iterated a large number of times. After finishing all iterations, say \(Iter=100\) or \(500\), the fixed effects are averaged (their standard deviation is an estimate of the standard errors), varainces and covariances are averaged, and other desirable output. You should have a clear goal so that you saved all the important information. After running a simulation study, which can take a long time, you don’t want to think “I should have saved …”. It’s better to save more than less. Also, when saving data and output, it is often good to write results (and possibly) data to your hard drive at the end of the outer loop just in case your system crashes.