Chapter 5. Inference and Models
Chapter 5. Statisticcal Tests and Linear Regression
using Pkg; Pkg.add("HypothesisTests")
using HypothesisTests, CSV, DataFrames, RDatasets
[32m[1m Resolving[22m[39m package versions...
[32m[1m No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`
df = dataset("datasets", "iris")
first(df, 5)
5×5 DataFrame
Float64
Float64
Float64
Float64
Cat…
1
5.1
3.5
1.4
0.2
setosa
2
4.9
3.0
1.4
0.2
setosa
3
4.7
3.2
1.3
0.2
setosa
4
4.6
3.1
1.5
0.2
setosa
5
5.0
3.6
1.4
0.2
setosa
T-tests
One-sample
mu = 20
OneSampleTTest(df.PetalLength, mu)
One sample t-test
-----------------
Population details:
parameter of interest: Mean
value under h_0: 20
point estimate: 3.758
95% confidence interval: (3.473, 4.043)
Test summary:
outcome with 95% confidence: reject h_0
two-sided p-value: <1e-99
Details:
number of observations: 150
t-statistic: -112.68524392285155
degrees of freedom: 149
empirical standard error: 0.144135997177411
Two-samples
UnequalVarianceTTest(df.PetalLength, df.PetalWidth)
Two sample t-test (unequal variance)
------------------------------------
Population details:
parameter of interest: Mean difference
value under h_0: 0
point estimate: 2.55867
95% confidence interval: (2.249, 2.868)
Test summary:
outcome with 95% confidence: reject h_0
two-sided p-value: <1e-37
Details:
number of observations: [150,150]
t-statistic: 16.29738511857255
degrees of freedom: 202.6934059810404
empirical standard error: 0.1569986011897579
Paired-samples
EqualVarianceTTest(df.PetalLength, df.PetalWidth)
Two sample t-test (equal variance)
----------------------------------
Population details:
parameter of interest: Mean difference
value under h_0: 0
point estimate: 2.55867
95% confidence interval: (2.25, 2.868)
Test summary:
outcome with 95% confidence: reject h_0
two-sided p-value: <1e-42
Details:
number of observations: [150,150]
t-statistic: 16.297385118572546
degrees of freedom: 298
empirical standard error: 0.15699860118975792
Attention: the usage of a paired sample test is not valid here, just for illustating how to use the function.
ANOVA
g1 = df[df.Species .== "setosa", :PetalLength]
g2 = df[df.Species .!= "setosa", :PetalLength]
groups = [g1, g2]
OneWayANOVATest(groups...)
One-way analysis of variance (ANOVA) test
-----------------------------------------
Population details:
parameter of interest: Means
value under h_0: "all equal"
point estimate: NaN
Test summary:
outcome with 95% confidence: reject h_0
p-value: <1e-65
Details:
number of observations: [50, 100]
F statistic: 954.682
degrees of freedom: (1, 148)
Models in Julia
Pkg.add("GLM"); using GLM
[32m[1m Resolving[22m[39m package versions...
[32m[1m No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`
1. Multivariate Linear Regression
formula = @formula(PetalLength ~ PetalWidth)
lm(formula, df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
PetalLength ~ 1 + PetalWidth
Coefficients:
───────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept) 1.08356 0.072967 14.85 <1e-30 0.939366 1.22775
PetalWidth 2.22994 0.0513962 43.39 <1e-85 2.12838 2.33151
───────────────────────────────────────────────────────────────────────
2. Fixed-effects Models (fast)
Pkg.add("FixedEffectModels"); using FixedEffectModels
[32m[1m Resolving[22m[39m package versions...
[32m[1m No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`
formula = @formula(PetalLength ~ PetalWidth + fe(Species))
reg(df, formula, Vcov.cluster(:Species))
FixedEffectModel
========================================================================
Number of obs: 150 Converged: true
dof (model): 1 dof (residuals): 3
R²: 0.955 R² adjusted: 0.954
F-statistic: 5.65297 P-value: 0.098
R² within: 0.235 Iterations: 1
========================================================================
Estimate Std. Error t-stat Pr(>|t|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────
PetalWidth 1.01871 0.428463 2.3776 0.0978 -0.344848 2.38227
========================================================================
3. Mediation
While it's common to use packages in R and Python to run mediation analysis, I fail to find an appropriate alternative in Julia.
Therefore, we may need to do it by ourselves:
df_mediation = rename(df, [:PetalWidth, :SepalWidth, :PetalLength] .=> [:X, :M, :Y] ) |>
d -> d[:, [:X, :M, :Y] ]
# X -> M
fm_a = @formula(M ~ X)
# M and X -> Y
fm_b_cprime= @formula(Y ~ X + M)
# X -> Y
fm_c = @formula(Y ~ X)
first(df_mediation, 3)
3×3 DataFrame
Float64
Float64
Float64
1
0.2
3.5
1.4
2
0.2
3.0
1.4
3
0.2
3.2
1.3
model_a = lm(fm_a, df_mediation)
a = round(coef(model_a)[2], digits = 3) # Path a (coefficient for X)
model_a
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
M ~ 1 + X
Coefficients:
────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────
(Intercept) 3.30843 0.0620975 53.28 <1e-97 3.18571 3.43114
X -0.20936 0.04374 -4.79 <1e-05 -0.295795 -0.122924
────────────────────────────────────────────────────────────────────────
model_b_cprime = lm(fm_b_cprime, df_mediation)
b = round(coef(model_b_cprime)[3], digits = 3) # Path b (coefficient for M)
c_prime = round(coef(model_b_cprime)[2], digits = 3) # Path c' (coefficient for X)
model_b_cprime
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
Y ~ 1 + X + M
Coefficients:
─────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept) 2.25816 0.313519 7.20 <1e-10 1.63858 2.87775
X 2.15561 0.0528285 40.80 <1e-81 2.05121 2.26001
M -0.355035 0.0923859 -3.84 0.0002 -0.537611 -0.172458
─────────────────────────────────────────────────────────────────────────
model_c = lm(fm_c, df_mediation)
c = round(coef(model_c)[2], digits = 3) # Path c (coefficient for X)
model_c
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
Y ~ 1 + X
Coefficients:
───────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept) 1.08356 0.072967 14.85 <1e-30 0.939366 1.22775
X 2.22994 0.0513962 43.39 <1e-85 2.12838 2.33151
───────────────────────────────────────────────────────────────────────
Calculate direct and indirect effects:
# Indirect effect
indirect_effect = round(a * b, digits=3)
# Total effect
total_effect = c
# Print results
println("Path a (X -> M): $a")
println("Path b (M -> Y controlling for X): $b")
println("Path c' (Direct effect of X on Y): $c_prime")
println("Path c (Total effect of X on Y): $c")
println("Indirect Effect (a * b): $indirect_effect")
Path a (X -> M): -0.209
Path b (M -> Y controlling for X): -0.355
Path c' (Direct effect of X on Y): 2.156
Path c (Total effect of X on Y): 2.23
Indirect Effect (a * b): 0.074
(Attention, the theoritical assumption of this example here does not hold, just for illustration)
4. Moderation
df_moderation = rename(df, [:PetalWidth, :SepalWidth, :PetalLength] .=> [:X, :W, :Y] ) |>
d -> d[:, [:X, :W, :Y] ]
# X -> Y
fm_a = @formula(Y ~ X)
# W moderating X -> Y
fm_b = @formula(Y ~ X * W)
first(df_moderation, 3)
3×3 DataFrame
Float64
Float64
Float64
1
0.2
3.5
1.4
2
0.2
3.0
1.4
3
0.2
3.2
1.3
model_a = lm(fm_a, df_moderation)
b_1 = round(coef(model_a)[2], digits = 3) # Original model
model_a
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
Y ~ 1 + X
Coefficients:
───────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept) 1.08356 0.072967 14.85 <1e-30 0.939366 1.22775
X 2.22994 0.0513962 43.39 <1e-85 2.12838 2.33151
───────────────────────────────────────────────────────────────────────
model_b = lm(fm_b, df_moderation)
b_3 = round(coef(model_b)[4], digits = 3) # Original model
model_b
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
Y ~ 1 + X + W + X & W
Coefficients:
──────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept) 2.28181 0.583452 3.91 0.0001 1.12871 3.43492
X 2.13257 0.481634 4.43 <1e-04 1.18069 3.08445
W -0.362079 0.173239 -2.09 0.0383 -0.704459 -0.019698
X & W 0.0071967 0.149522 0.05 0.9617 -0.288311 0.302704
──────────────────────────────────────────────────────────────────────────
println("Moderation effect: $b_3 (no-sig.)")
Moderation effect: 0.007 (no-sig.)
(Attention, the theoritical assumption of this example here does not hold, just for illustration)
5. Conditional Process
A more complete analysis, therefore, should attempt to model the mechanisms at work linking X to Y while simultaneously allowing those effects to be contingent on context, circumstance, or individual differences. (Andrew F. Hayes, 2018, p. 395)
For instance, we want to test if the relation between X and Y, is mediated by M, and the M on Y is moderated by W. In the current dataset, we assume:
X: PetalWidth
M: SepalLength
W: SepalWidth
Y: PetalLength
df_conditional = rename(df, [:PetalWidth, :SepalLength, :SepalWidth, :PetalLength] .=> [:X, :M, :W, :Y] ) |>
d -> d[:, [:X, :M, :W, :Y] ]
first(df_conditional, 3)
3×4 DataFrame
Float64
Float64
Float64
Float64
1
0.2
5.1
3.5
1.4
2
0.2
4.9
3.0
1.4
3
0.2
4.7
3.2
1.3
Suppose that we expect:
$M=i_M+aX+\epsilon_M$
$Y=i_Y+c'X+b_1M+b_2W+b_3WM+\epsilon_Y$
# a: X -> M
fm_a = @formula(M ~ X * W)
# b and c: M -> Y
fm_b = @formula(Y ~ X + M + W*M)
# c: X -> Y
fm_c = @formula(Y ~ X)
FormulaTerm
Response:
Y(unknown)
Predictors:
X(unknown)
model_a = lm(fm_a, df_conditional)
a = round(coef(model_a)[2], digits = 3) # X -> M
model_a
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
M ~ 1 + X + W + X & W
Coefficients:
──────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept) 3.13046 0.5745 5.45 <1e-06 1.99505 4.26587
X 1.29057 0.474244 2.72 0.0073 0.353302 2.22784
W 0.496425 0.170581 2.91 0.0042 0.159298 0.833551
X & W -0.0994638 0.147228 -0.68 0.5004 -0.390437 0.191509
──────────────────────────────────────────────────────────────────────────
model_b = lm(fm_b, df_conditional)
c_prime = round(coef(model_b)[2], digits = 3) # indirect_effect
b_1 = round(coef(model_b)[3], digits = 3) # M -> Y
b_2 = round(coef(model_b)[4], digits = 3) # W -> Y
b_3 = round(coef(model_b)[5], digits = 3) # M*W -> Y
model_b
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
Y ~ 1 + X + M + W + W & M
Coefficients:
───────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept) 0.714816 1.56623 0.46 0.6488 -2.38078 3.81041
X 1.43584 0.0699066 20.54 <1e-44 1.29767 1.57401
M 0.561752 0.269701 2.08 0.0390 0.0286989 1.09481
W -0.970405 0.514856 -1.88 0.0615 -1.988 0.047187
W & M 0.0564153 0.0887396 0.64 0.5259 -0.118975 0.231806
───────────────────────────────────────────────────────────────────────────
model_c = lm(fm_c, df_conditional)
c = round(coef(model_c)[2], digits = 3) # Original model
model_c
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
Y ~ 1 + X
Coefficients:
───────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept) 1.08356 0.072967 14.85 <1e-30 0.939366 1.22775
X 2.22994 0.0513962 43.39 <1e-85 2.12838 2.33151
───────────────────────────────────────────────────────────────────────
# Indirect effect
indirect_effect = round(a * b_1, digits=3)
# Total effect
total_effect = c
# Print results
println("Path a (X -> M): $a")
println("Path b1 (M -> Y): $b_1")
println("Path b2 (W -> Y): $b_2")
println("Path b3 (MW -> Y): $b_3")
println("Indirect Effect (a * b_1): $indirect_effect")
Path a (X -> M): 1.291
Path b1 (M -> Y): 0.562
Path b2 (W -> Y): -0.97
Path b3 (MW -> Y): 0.056
Indirect Effect (a * b_1): 0.726
If you need bootstraping and everything else introduces by Hayes, we can using R or Python in Julia to do so.
Last updated