Contents
Overview
The ml_lvm family implements two-level random-intercept latent variable models for clustered data — for example, students nested in schools, employees nested in teams, or repeated measures nested in persons (when temporal ordering is not of interest; for panel models with dynamics, see the DLVM1 family).
The model decomposes each observation into a within-cluster component (how units deviate from their cluster average) and a between-cluster component (how cluster averages differ from each other), each with its own measurement model and latent structure. As everywhere in psychonetrics, every latent and residual covariance matrix on either level can independently be parameterized as a Gaussian Graphical Model.
"y.1") are no longer silently dropped. See What's New.
The Two-Level Model
For unit $p$ in cluster $c$, the observed response vector is decomposed as:
$$\boldsymbol{y}_{pc} = \boldsymbol{\nu} + \boldsymbol{\Lambda}\left(\boldsymbol{\eta}^{(B)}_{c} + \boldsymbol{\eta}^{(W)}_{pc}\right) + \boldsymbol{\varepsilon}^{(B)}_{c} + \boldsymbol{\varepsilon}^{(W)}_{pc}$$
- $\boldsymbol{\Lambda}$ (
lambda): factor loadings, shared across levels. - $\boldsymbol{\eta}^{(W)}_{pc}$, $\boldsymbol{\eta}^{(B)}_{c}$: within- and between-cluster latent variables, with covariance structures controlled by
within_latentandbetween_latent("cov","chol","prec", or"ggm"). - $\boldsymbol{\varepsilon}^{(W)}$, $\boldsymbol{\varepsilon}^{(B)}$: residuals on each level, controlled by
within_residualandbetween_residual. - Structural regressions are possible on both levels via
beta_withinandbeta_between.
Key matrices include sigma_zeta_within / sigma_zeta_between (or omega_zeta_within / omega_zeta_between for GGM parameterizations) and sigma_epsilon_within / sigma_epsilon_between. Estimation handles unbalanced cluster sizes naturally, and as of version 0.16 a fast sufficient-statistics two-level ML estimator is used by default for complete data.
Wrapper Functions
| Function | What it sets | Description |
|---|---|---|
ml_lvm() | user chooses | Base function for two-level latent variable models |
ml_lnm() | between_latent = "ggm" | Multi-level latent network model |
ml_rnm() | between_residual = "ggm" | Multi-level residual network model |
ml_lrnm() | both "ggm" | Combined latent + residual network model |
The cluster membership is supplied via the clusters argument (a column name in the data).
Example: Two-Level CFA on Simulated Data
We simulate data for 60 clusters of 10 units each, with two latent factors measured by six indicators on both levels, and recover the structure with ml_lvm(). The example is fully reproducible and runs in a few seconds:
library("psychonetrics")
library("dplyr")
# --- Simulate two-level data ---
set.seed(2)
nCluster <- 60; perCluster <- 10; nv <- 6
cl <- rep(seq_len(nCluster), each = perCluster)
# True loadings (2 factors, 3 indicators each):
loadings <- matrix(0, nv, 2)
loadings[1:3, 1] <- c(1, .8, .6)
loadings[4:6, 2] <- c(1, .8, .6)
# Between-cluster latents, within latents, and residuals on both levels:
eta_b <- matrix(rnorm(nCluster * 2, 0, sqrt(0.5)), nCluster, 2)
eta_w <- matrix(rnorm(nCluster * perCluster * 2), nCluster * perCluster, 2)
u_b <- matrix(rnorm(nCluster * nv, 0, sqrt(0.3)), nCluster, nv)
Y <- (eta_b[cl, ] + eta_w) %*% t(loadings) + u_b[cl, ] +
matrix(rnorm(nCluster * perCluster * nv, 0, sqrt(0.5)), ncol = nv)
Data <- as.data.frame(Y)
colnames(Data) <- paste0("y", 1:nv)
Data$cluster <- cl
# --- Fit the two-level CFA ---
Lambda <- matrix(0, nv, 2)
Lambda[1:3, 1] <- 1
Lambda[4:6, 2] <- 1
mod_ml <- ml_lvm(Data, lambda = Lambda, clusters = "cluster",
vars = paste0("y", 1:nv)) %>% runmodel
mod_ml %>% fit
# Latent covariances on both levels:
round(getmatrix(mod_ml, "sigma_zeta_between"), 2)
round(getmatrix(mod_ml, "sigma_zeta_within"), 2)
Output (abbreviated)
The model fits well (χ²(20) = 15.92, p = 0.72) and recovers the simulated latent variances on both levels (between ≈ 0.5 scaled by the marker loading, within ≈ 1).
Estimation: two-level ML vs FIML 0.16
Historically ml_lvm() estimated the model by full-information maximum likelihood (FIML) on the wide (one row per cluster) data layout. That is exact, but it scales poorly: each cluster contributes a likelihood term whose dimension grows with its size, so large clusters are expensive. Version 0.16 adds a sufficient-statistics two-level ML estimator (estimator = "ML"; McDonald & Goldstein, 1989; Muthen, 1990) that optimizes the numerically identical objective but works from the within- and between-cluster sufficient statistics, with full C++ standard errors and modification indices.
"ML" and "FIML" here are two ways of computing the same exact two-level maximum-likelihood fit; they converge to identical estimates, standard errors and fit (the objective values coincide to ~1e-13). This is not the approximate “MUML” / pseudo-balanced estimator — the sufficient-statistics evaluation is exact, including for unbalanced cluster sizes. The "ML"/"FIML" labels follow the convention used everywhere in psychonetrics ("ML" fits to summary statistics, "FIML" fits row by row to the raw data); they describe the computational route, not complete-versus-missing data — both routes can handle within-cluster missingness.
The estimator argument now defaults to "default", which picks "ML" for complete data whose largest cluster exceeds 5 units, and "FIML" otherwise. Explicit estimator = "FIML" reproduces previous results exactly.
# Same simulation as above, but with larger clusters (perCluster <- 18):
mod_ml <- ml_lvm(Data, lambda = Lambda, clusters = "cluster",
vars = paste0("y", 1:nv), estimator = "ML") %>% runmodel
mod_fiml <- ml_lvm(Data, lambda = Lambda, clusters = "cluster",
vars = paste0("y", 1:nv), estimator = "FIML") %>% runmodel
mod_ml %>% fit
mod_fiml %>% fit
Output
The two estimators reach the same fit; "ML" is about 3× faster here, and the gap widens sharply as clusters grow (the NEWS reports a 100-cluster model with cluster sizes 18–25 fitting in under a second versus ~90 seconds with FIML).
lavaan::sem(…, missing = "ml")). For other missing-data patterns, use estimator = "FIML". Note that sample-size-dependent measures (RMSEA, BIC) for the two-level ML estimator use the number of clusters J, matching lavaan's two-level SEM, so they can differ slightly from the FIML values, while chi-square, df, CFI and TLI agree.
Network Variants
Replacing a covariance structure with a GGM turns the corresponding level into a network model. For example, ml_lnm() models the between-cluster latent structure as a partial-correlation network:
# Multi-level latent network model:
mod_mlnm <- ml_lnm(Data, lambda = Lambda, clusters = "cluster",
vars = paste0("y", 1:nv)) %>% runmodel
# Within-level latent network (partial correlations):
round(getmatrix(mod_mlnm, "omega_zeta_within"), 2)
# Prune non-significant network edges as usual:
mod_mlnm %>% prune(alpha = 0.05)
The same prune() / stepup() / compare() workflow applies as in all other psychonetrics families.
Summary
ml_lvm()fits two-level random-intercept latent variable models for clustered data; a fast sufficient-statistics two-level ML estimator (default) optimizes the same objective as FIML but much faster for large clusters- Within-cluster and between-cluster structures each get their own latent and residual covariance matrices
- Any of these matrices can be parameterized as a network (GGM):
ml_lnm(),ml_rnm(),ml_lrnm() - Cluster membership is a single column passed via
clusters - For clustered longitudinal data with temporal dynamics, use the DLVM1 / panel family instead