Chapter 5. Statisticcal Tests and Linear Regression
using Pkg;
Pkg.add("HypothesisTests");
using HypothesisTests, CSV, DataFrames, RDatasets
[32m[1m Resolving[22m[39m package versions...
[32m[1m No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`
[92m[1mPrecompiling[22m[39m project...
[91m β [39m[90mGtkObservables[39m
[91m β [39mProfileView
0 dependencies successfully precompiled in 4 seconds. 518 already precompiled.
[91m2[39m dependencies errored.
For a report of the errors see `julia> err`. To retry use `pkg> precompile`
df = dataset("datasets", "iris")
first(df, 5)
5Γ5 DataFrame
Float64
Float64
Float64
Float64
Catβ¦
1
5.1
3.5
1.4
0.2
setosa
2
4.9
3.0
1.4
0.2
setosa
3
4.7
3.2
1.3
0.2
setosa
4
4.6
3.1
1.5
0.2
setosa
5
5.0
3.6
1.4
0.2
setosa
T-tests
One-sample
mu = 20
OneSampleTTest(df.PetalLength, mu)
One sample t-test
-----------------
Population details:
parameter of interest: Mean
value under h_0: 20
point estimate: 3.758
95% confidence interval: (3.473, 4.043)
Test summary:
outcome with 95% confidence: reject h_0
two-sided p-value: <1e-99
Details:
number of observations: 150
t-statistic: -112.68524392285158
degrees of freedom: 149
empirical standard error: 0.14413599717741096
Two-samples
UnequalVarianceTTest(df.PetalLength, df.PetalWidth)
Two sample t-test (unequal variance)
------------------------------------
Population details:
parameter of interest: Mean difference
value under h_0: 0
point estimate: 2.55867
95% confidence interval: (2.249, 2.868)
Test summary:
outcome with 95% confidence: reject h_0
two-sided p-value: <1e-37
Details:
number of observations: [150,150]
t-statistic: 16.29738511857255
degrees of freedom: 202.69340598104046
empirical standard error: 0.1569986011897579
Paired-samples
EqualVarianceTTest(df.PetalLength, df.PetalWidth)
Two sample t-test (equal variance)
----------------------------------
Population details:
parameter of interest: Mean difference
value under h_0: 0
point estimate: 2.55867
95% confidence interval: (2.25, 2.868)
Test summary:
outcome with 95% confidence: reject h_0
two-sided p-value: <1e-42
Details:
number of observations: [150,150]
t-statistic: 16.29738511857255
degrees of freedom: 298
empirical standard error: 0.1569986011897579
Attention: the usage of a paired sample test is not valid here, just for illustating how to use the function.
ANOVA
g1 = df[df.Species.=="setosa", :PetalLength]
g2 = df[df.Species.!="setosa", :PetalLength]
groups = [g1, g2]
OneWayANOVATest(groups...)
One-way analysis of variance (ANOVA) test
-----------------------------------------
Population details:
parameter of interest: Means
value under h_0: "all equal"
point estimate: NaN
Test summary:
outcome with 95% confidence: reject h_0
p-value: <1e-62
Details:
number of observations: [50, 100]
F statistic: 848.606
degrees of freedom: (1, 148)
Models in Julia
Pkg.add("GLM");
using GLM;
[32m[1m Resolving[22m[39m package versions...
[32m[1m No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`
[92m[1mPrecompiling[22m[39m project...
[91m β [39m[90mGtkObservables[39m
[91m β [39mProfileView
0 dependencies successfully precompiled in 4 seconds. 518 already precompiled.
[91m2[39m dependencies errored.
For a report of the errors see `julia> err`. To retry use `pkg> precompile`
1. Multivariate Linear Regression
formula = @formula(PetalLength ~ PetalWidth)
lm(formula, df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
PetalLength ~ 1 + PetalWidth
Coefficients:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
(Intercept) 1.08356 0.072967 14.85 <1e-30 0.939366 1.22775
PetalWidth 2.22994 0.0513962 43.39 <1e-85 2.12838 2.33151
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
2. Fixed-effects Models (fast)
Pkg.add("FixedEffectModels");
using FixedEffectModels;
[32m[1m Resolving[22m[39m package versions...
[32m[1m No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`
[92m[1mPrecompiling[22m[39m project...
[91m β [39m[90mGtkObservables[39m
[91m β [39mProfileView
0 dependencies successfully precompiled in 4 seconds. 518 already precompiled.
[91m2[39m dependencies errored.
For a report of the errors see `julia> err`. To retry use `pkg> precompile`
formula = @formula(PetalLength ~ PetalWidth + fe(Species))
reg(df, formula, Vcov.cluster(:Species))
FixedEffectModel
========================================================================
Number of obs: 150 Converged: true
dof (model): 1 dof (residuals): 3
RΒ²: 0.955 RΒ² adjusted: 0.954
F-statistic: 5.65297 P-value: 0.098
RΒ² within: 0.235 Iterations: 1
========================================================================
Estimate Std. Error t-stat Pr(>|t|) Lower 95% Upper 95%
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
PetalWidth 1.01871 0.428463 2.3776 0.0978 -0.344848 2.38227
========================================================================
The package of FixedEffectModels
also supports CUDA, increasing the speed on a computer with Nvidia's graphic toolkit:
using Pkg
Pkg.add("CUDA");
using CUDA
@assert CUDA.functional()
formula = @formula(PetalLength ~ PetalWidth + fe(Species))
reg(df, formula, Vcov.cluster(:Species), method=:CUDA)
[32m[1m Resolving[22m[39m package versions...
[32m[1m No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`
[92m[1mPrecompiling[22m[39m project...
[91m β [39m[90mGtkObservables[39m
[91m β [39mProfileView
0 dependencies successfully precompiled in 4 seconds. 518 already precompiled.
[91m2[39m dependencies errored.
For a report of the errors see `julia> err`. To retry use `pkg> precompile`
[?25h[2K[1mDemean Variables:[22m[39m [================================>] 2/2
FixedEffectModel
========================================================================
Number of obs: 150 Converged: true
dof (model): 1 dof (residuals): 3
RΒ²: 0.955 RΒ² adjusted: 0.954
F-statistic: 5.65296 P-value: 0.098
RΒ² within: 0.235 Iterations: 1
========================================================================
Estimate Std. Error t-stat Pr(>|t|) Lower 95% Upper 95%
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
PetalWidth 1.01871 0.428463 2.3776 0.0978 -0.344848 2.38227
========================================================================
3. Mediation
While it's common to use packages in R and Python to run mediation analysis, I fail to find an appropriate alternative in Julia.
Therefore, we may need to do it by ourselves:
df_mediation = rename(df, [:PetalWidth, :SepalWidth, :PetalLength] .=> [:X, :M, :Y]) |>
d -> d[:, [:X, :M, :Y]]
# X -> M
fm_a = @formula(M ~ X)
# M and X -> Y
fm_b_cprime = @formula(Y ~ X + M)
# X -> Y
fm_c = @formula(Y ~ X)
first(df_mediation, 3)
3Γ3 DataFrame
Float64
Float64
Float64
1
0.2
3.5
1.4
2
0.2
3.0
1.4
3
0.2
3.2
1.3
model_a = lm(fm_a, df_mediation)
a = round(coef(model_a)[2], digits=3) # Path a (coefficient for X)
model_a
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
M ~ 1 + X
Coefficients:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
(Intercept) 3.30843 0.0620975 53.28 <1e-97 3.18571 3.43114
X -0.20936 0.04374 -4.79 <1e-05 -0.295795 -0.122924
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
model_b_cprime = lm(fm_b_cprime, df_mediation)
b = round(coef(model_b_cprime)[3], digits=3) # Path b (coefficient for M)
c_prime = round(coef(model_b_cprime)[2], digits=3) # Path c' (coefficient for X)
model_b_cprime
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
Y ~ 1 + X + M
Coefficients:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
(Intercept) 2.25816 0.313519 7.20 <1e-10 1.63858 2.87775
X 2.15561 0.0528285 40.80 <1e-81 2.05121 2.26001
M -0.355035 0.0923859 -3.84 0.0002 -0.537611 -0.172458
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
model_c = lm(fm_c, df_mediation)
c = round(coef(model_c)[2], digits=3) # Path c (coefficient for X)
model_c
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
Y ~ 1 + X
Coefficients:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
(Intercept) 1.08356 0.072967 14.85 <1e-30 0.939366 1.22775
X 2.22994 0.0513962 43.39 <1e-85 2.12838 2.33151
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Calculate direct and indirect effects:
# Indirect effect
indirect_effect = round(a * b, digits=3)
# Total effect
total_effect = c
# Print results
println("Path a (X -> M): $a")
println("Path b (M -> Y controlling for X): $b")
println("Path c' (Direct effect of X on Y): $c_prime")
println("Path c (Total effect of X on Y): $c")
println("Indirect Effect (a * b): $indirect_effect")
Path a (X -> M): -0.209
Path b (M -> Y controlling for X): -0.355
Path c' (Direct effect of X on Y): 2.156
Path c (Total effect of X on Y): 2.23
Indirect Effect (a * b): 0.074
(Attention, the theoritical assumption of this example here does not hold, just for illustration)
4. Moderation
df_moderation = rename(df, [:PetalWidth, :SepalWidth, :PetalLength] .=> [:X, :W, :Y]) |>
d -> d[:, [:X, :W, :Y]]
# X -> Y
fm_a = @formula(Y ~ X)
# W moderating X -> Y
fm_b = @formula(Y ~ X * W)
first(df_moderation, 3)
3Γ3 DataFrame
Float64
Float64
Float64
1
0.2
3.5
1.4
2
0.2
3.0
1.4
3
0.2
3.2
1.3
model_a = lm(fm_a, df_moderation)
b_1 = round(coef(model_a)[2], digits=3) # Original model
model_a
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
Y ~ 1 + X
Coefficients:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
(Intercept) 1.08356 0.072967 14.85 <1e-30 0.939366 1.22775
X 2.22994 0.0513962 43.39 <1e-85 2.12838 2.33151
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
model_b = lm(fm_b, df_moderation)
b_3 = round(coef(model_b)[4], digits=3) # Original model
model_b
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
Y ~ 1 + X + W + X & W
Coefficients:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
(Intercept) 2.28181 0.583452 3.91 0.0001 1.12871 3.43492
X 2.13257 0.481634 4.43 <1e-04 1.18069 3.08445
W -0.362079 0.173239 -2.09 0.0383 -0.704459 -0.019698
X & W 0.0071967 0.149522 0.05 0.9617 -0.288311 0.302704
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
println("Moderation effect: $b_3 (no-sig.)")
Moderation effect: 0.007 (no-sig.)
(Attention, the theoritical assumption of this example here does not hold, just for illustration)
5. Conditional Process
A more complete analysis, therefore, should attempt to model the mechanisms at work linking X to Y while simultaneously allowing those effects to be contingent on context, circumstance, or individual differences. (Andrew F. Hayes, 2018, p. 395)
For instance, we want to test if the relation between X and Y, is mediated by M, and the M on Y is moderated by W. In the current dataset, we assume:
X: PetalWidth
M: SepalLength
W: SepalWidth
Y: PetalLength
df_conditional = rename(df, [:PetalWidth, :SepalLength, :SepalWidth, :PetalLength] .=> [:X, :M, :W, :Y]) |>
d -> d[:, [:X, :M, :W, :Y]]
first(df_conditional, 3)
3Γ4 DataFrame
Float64
Float64
Float64
Float64
1
0.2
5.1
3.5
1.4
2
0.2
4.9
3.0
1.4
3
0.2
4.7
3.2
1.3
Suppose that we expect:
$M=i_M+aX+\epsilon_M$
$Y=i_Y+c'X+b_1M+b_2W+b_3WM+\epsilon_Y$
# a: X -> M
fm_a = @formula(M ~ X * W)
# b and c: M -> Y
fm_b = @formula(Y ~ X + M + W * M)
# c: X -> Y
fm_c = @formula(Y ~ X)
FormulaTerm
Response:
Y(unknown)
Predictors:
X(unknown)
model_a = lm(fm_a, df_conditional)
a = round(coef(model_a)[2], digits=3) # X -> M
model_a
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
M ~ 1 + X + W + X & W
Coefficients:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
(Intercept) 3.13046 0.5745 5.45 <1e-06 1.99505 4.26587
X 1.29057 0.474244 2.72 0.0073 0.353302 2.22784
W 0.496425 0.170581 2.91 0.0042 0.159298 0.833551
X & W -0.0994638 0.147228 -0.68 0.5004 -0.390437 0.191509
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
model_b = lm(fm_b, df_conditional)
c_prime = round(coef(model_b)[2], digits=3) # indirect_effect
b_1 = round(coef(model_b)[3], digits=3) # M -> Y
b_2 = round(coef(model_b)[4], digits=3) # W -> Y
b_3 = round(coef(model_b)[5], digits=3) # M*W -> Y
model_b
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
Y ~ 1 + X + M + W + W & M
Coefficients:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
(Intercept) 0.714816 1.56623 0.46 0.6488 -2.38078 3.81041
X 1.43584 0.0699066 20.54 <1e-44 1.29767 1.57401
M 0.561752 0.269701 2.08 0.0390 0.0286989 1.09481
W -0.970405 0.514856 -1.88 0.0615 -1.988 0.047187
W & M 0.0564153 0.0887396 0.64 0.5259 -0.118975 0.231806
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
model_c = lm(fm_c, df_conditional)
c = round(coef(model_c)[2], digits=3) # Original model
model_c
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
Y ~ 1 + X
Coefficients:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
(Intercept) 1.08356 0.072967 14.85 <1e-30 0.939366 1.22775
X 2.22994 0.0513962 43.39 <1e-85 2.12838 2.33151
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Indirect effect
indirect_effect = round(a * b_1, digits=3)
# Total effect
total_effect = c
# Print results
println("Path a (X -> M): $a")
println("Path b1 (M -> Y): $b_1")
println("Path b2 (W -> Y): $b_2")
println("Path b3 (MW -> Y): $b_3")
println("Indirect Effect (a * b_1): $indirect_effect")
Path a (X -> M): 1.291
Path b1 (M -> Y): 0.562
Path b2 (W -> Y): -0.97
Path b3 (MW -> Y): 0.056
Indirect Effect (a * b_1): 0.726
If you need bootstraping and everything else introduces by Hayes, we can using R or Python in Julia to do so.
Last updated
Was this helpful?