Examples

Real data

Load dataset

The cognitive dataset contains data from a study of the effect of an intervention in school lunches among schools in Kenya, accessed via the R package splmm. The data is longitudinal with measurements of students' performance on various tests taken at different points in time. We will fit a model with random intercepts and random growth slopes for each student. Note that while this is a low-dimensional example ($p < n$), the algorithm that this package implements was designed and tested with the high dimensional use-case ($p > n$) in mind.

First, we load the data into Julia and form a categorical variable for the treatment in the study, which was the type of lunch served (assigned at the school level).

using CSV
using DataFrames
using CategoricalArrays
cog_df = CSV.read("../data/cognitive.csv", DataFrame)
# form categorical variable for treatment
cog_df.treatment = categorical(cog_df.treatment, levels=["control", "calorie", "meat", "milk"])

Extract model matrices, cluster ids, and response vector

Next we form model matrices with the help of the StatsModels formula syntax:

using StatsModels
f = @formula(ravens ~ 1 + year + treatment + sex + age_at_time0 +
                      height + weight + head_circ + ses + mom_read + mom_write + mom_edu)
model_frame = ModelFrame(f, cog_df)
model_mat = ModelMatrix(model_frame).m

We form two model matrices. One is low dimensional and includes features we do not wish to penalize, and the other is higher dimensional and includes the many features whose effects will be regularized.

X = model_mat[:, 1:2] # Non-penalized, random effect columns (one for intercept, and the other for year)
G = model_mat[:, 3:end] # High dimensional set of covariates whose effects are regularized

Finally, we get the cluster (in this case, student) ids and the response, the students' Ravens test scores.

student_id = cog_df.id
y = cog_df.ravens

Fitting model

The main function in the package is hdmm(), which accepts the two design matrices, the response, and the group id as required positional arguments, and returns the fitted model:

using HighDimMixedModels
fit = hdmm(X, G, y, student_id)
HDMModel fit with 1562 observations
Log-likelihood at convergence: -3769.5
Random effect covariance matrix:
2×2 Matrix{Float64}:
 1.99509  0.0
 0.0      0.359648
Estimated 6 non-zero fixed effects:
─────────────
     Estimate
─────────────
1    8.65358
2    1.07854
4    0.159416
5   -0.119262
10   0.170265
13   0.019581
─────────────
Estimated σ²: 5.7488

The function has a number of keyword arguments that can be specified to modify the defaults–see the documentation. For example, by default, we fit a model with the SCAD penalty, which produces less bias. To apply the LASSO penalty, specify penalty = "lasso" in the call. Also note that the default value of $\lambda$ (the hyperparameter that controls the degree of penalization) is 10. Since the default (unless standardize = false) is to standardize the design matrices before running the algorithm, this is how much the coefficients of the standardized predictors are penalized. In practice, you can and should try fitting the model for several different choices of $\lambda$ and choose the fit that leads to the lowest fit.BIC.

By default, the features that are assigned random slopes are all those that appear as columns in the matrix X, i.e. those features whose coefficients are not penalized. If you wish to include a feature whose coefficient is not penalized, but do not wish to assign this feature a random slope, then you can specify the argument Z in the call to hdmm to be a matrix whose columns contain only the variables in X that you wish to assign random slopes.

Inspecting model

The object fit which is returned by hdmm() is a struct with fields providing all relevant information about the model fit. These can be accessed using the dot notation, e.g. fit.fixef to retrieve all the fixed effect estimates (including those set to 0) and fit.log_like to get the log likelihood at the estimates. To print all the fields stored in the object, you can type fit. followed by the tab key or check the documentation for the struct.

Several model inspection functions from StatsBase.jl are also available, such as residuals(fit) and fitted(fit). Note that these fitted values and residuals take into account the random effects by incorporating the best prediction of these random effects (BLUPs) for each student into the predictions.

To print a table with only the selected coefficients (i.e. those that are not set to 0), use the function coeftable(). The names of these variables will appear alongside their estimates if you pass them as a second argument:

coeftable(fit, coefnames(model_frame))
Estimate
(Intercept)8.65358
year1.07854
treatment: meat0.159416
treatment: milk-0.119262
head_circ0.170265
mom_write0.019581

Plotting model

Here, we show how to plot the observed scores and our model's predictions for five different students over time:

using Plots
mask = student_id .== 1
plot(cog_df.year[mask], cog_df.ravens[mask], seriestype = :scatter, label = "student 1", color = 1 )
plot!(cog_df.year[mask], fitted(fit)[mask], seriestype = :line, color = 1, linestyle = :dash, linewidth = 3, label = "")
for i in [2,4,5,6]
    mask = student_id .== i
    # add student to plot
    plot!(cog_df.year[mask], cog_df.ravens[mask], seriestype = :scatter, label = "student $i", color = i)
    plot!(cog_df.year[mask], fitted(fit)[mask], seriestype = :line, color = i, linestyle = :dash, linewidth = 3, label = "")
end
plot!(legend=:outerbottom, legendcolumns=3, xlabel = "Year", ylabel = "Ravens Score")
Example block output

Simulated data

We first generate two design matrices. The first, X, will be low dimensional and consist of the predictors that will have associated random effects–this includes a column of constant 1s so that we include random intercepts. The second, G, is high dimensional and includes predictors whose effects we penalize.

In addition, we generate a sparse vector of regression coefficient, $\beta$, whose non-zero components are centered at 0 and have a standard deviation of 1.

using Random
Random.seed!(420)
g, n, p, q = 100, 5, 1000, 4 # #groups, #samples per group, #features, #features with random effect (including random intercept)
N = n*g  # Total sample size
X = [ones(N) randn(N, q-1)]
G = randn(N, p-q)
XG = [X G]
sd_signal = 1
β = [sd_signal .* randn(10); zeros(p-10)]

Now we generate the response from these design matrices

using Distributions
using LinearAlgebra
# Generate group assignments
gr = string.( vcat( [fill(i, n) for i in 1:g]... ) )

# Generate random effects
ψ = Diagonal(1:4) #random effects covariances
dist_b = MvNormal(zeros(q), ψ)
b = rand(dist_b, g)

# Generate response
y_fixed = XG * β
y = Vector{Float64}(undef, N)
for (i, group) in enumerate(unique(gr))
    group_ind = (gr .== group)
    nᵢ = sum(group_ind)
    Xᵢ = X[group_ind,:]
    bᵢ = b[:,i]
    yᵢ = Xᵢ*bᵢ + randn(nᵢ)
    y[group_ind] = y_fixed[group_ind] + yᵢ
end

We can now fit the model with our package. The default is to use the SCAD penalty

using HighDimMixedModels
out_scad = hdmm(X, G, y, gr; λ = 50)
HDMModel fit with 500 observations
Log-likelihood at convergence: -1121.1
Random effect covariance matrix:
4×4 Matrix{Float64}:
 0.597691  0.0      0.0      0.0
 0.0       2.63586  0.0      0.0
 0.0       0.0      3.29354  0.0
 0.0       0.0      0.0      2.69272
Estimated 8 non-zero fixed effects:
─────────────
     Estimate
─────────────
1   1.20002
2   1.18598
3   2.02174
4   0.69393
5   0.0698426
6   0.175662
7  -1.35678
8  -0.128023
─────────────
Estimated σ²: 1.2768

We can experiment with the penalty severity $\lambda$. $\lambda = 50$ seems to provide us the sparsity we want without setting every penalized coefficient to 0.

Similary, we can fit a model with the LASSO penalty:

out_las = hdmm(X, G, y, gr; λ = 50, penalty = "lasso")
HDMModel fit with 500 observations
Log-likelihood at convergence: -1136.3
Random effect covariance matrix:
4×4 Matrix{Float64}:
 0.599867  0.0      0.0     0.0
 0.0       2.58918  0.0     0.0
 0.0       0.0      3.3212  0.0
 0.0       0.0      0.0     2.69691
Estimated 8 non-zero fixed effects:
─────────────
     Estimate
─────────────
1   1.23173
2   1.17688
3   2.03822
4   0.671117
5   0.0190647
6   0.126668
7  -1.02831
8  -0.0887074
─────────────
Estimated σ²: 1.4707

We can compare the estimation performance of the LASSO and SCAD by printing their fixed effect coefficient estimates, saved at out.fixef or out_las.fixef, side by side with the true non-zero parameters values. Since the initialization of our descent algorithm, saved at out.init_coef, is obtained by fitting a (cross-validated) LASSO model that doesn't take the random effects into account, we also display these estimates to see how we improve by accounting for the random effects.

using DataFrames
DataFrame(
    :true_coefs => β[1:10],
    :lasso_no_random => out_las.init_coef.βstart[1:10],
    :lasso_with_random => out_las.fixef[1:10],
    :scad_with_random => out_scad.fixef[1:10]
    )
10×4 DataFrame
Rowtrue_coefslasso_no_randomlasso_with_randomscad_with_random
Float64Float64Float64Float64
11.20271.238561.231731.20002
21.021341.138961.176881.18598
31.865241.893832.038222.02174
40.7548920.7126870.6711170.69393
50.418280.1443030.01906470.0698426
60.4628470.00.1266680.175662
7-1.52023-0.952557-1.02831-1.35678
8-0.4096-0.368311-0.0887074-0.128023
9-0.1063640.06707010.00.0
100.06483960.00.00.0