Chapter 5. Statisticcal Tests and Linear Regression

using Pkg;
Pkg.add("HypothesisTests");
using HypothesisTests, CSV, DataFrames, RDatasets

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`
[92m[1mPrecompiling[22m[39m project...
         [91m  ✗ [39m[90mGtkObservables[39m
         [91m  ✗ [39mProfileView
  0 dependencies successfully precompiled in 4 seconds. 518 already precompiled.
  [91m2[39m dependencies errored.
  For a report of the errors see `julia> err`. To retry use `pkg> precompile`

df = dataset("datasets", "iris")
first(df, 5)

5×5 DataFrame

Row

SepalLength

SepalWidth

PetalLength

PetalWidth

Species

Float64

Cat…

5.1

3.5

1.4

0.2

setosa

4.9

3.0

1.4

0.2

setosa

4.7

3.2

1.3

0.2

setosa

4.6

3.1

1.5

0.2

setosa

5.0

3.6

1.4

0.2

setosa

T-tests

One-sample

mu = 20
OneSampleTTest(df.PetalLength, mu)

One sample t-test
-----------------
Population details:
    parameter of interest:   Mean
    value under h_0:         20
    point estimate:          3.758
    95% confidence interval: (3.473, 4.043)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           <1e-99

Details:
    number of observations:   150
    t-statistic:              -112.68524392285158
    degrees of freedom:       149
    empirical standard error: 0.14413599717741096

Two-samples

UnequalVarianceTTest(df.PetalLength, df.PetalWidth)

Two sample t-test (unequal variance)
------------------------------------
Population details:
    parameter of interest:   Mean difference
    value under h_0:         0
    point estimate:          2.55867
    95% confidence interval: (2.249, 2.868)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           <1e-37

Details:
    number of observations:   [150,150]
    t-statistic:              16.29738511857255
    degrees of freedom:       202.69340598104046
    empirical standard error: 0.1569986011897579

Paired-samples

EqualVarianceTTest(df.PetalLength, df.PetalWidth)

Two sample t-test (equal variance)
----------------------------------
Population details:
    parameter of interest:   Mean difference
    value under h_0:         0
    point estimate:          2.55867
    95% confidence interval: (2.25, 2.868)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           <1e-42

Details:
    number of observations:   [150,150]
    t-statistic:              16.29738511857255
    degrees of freedom:       298
    empirical standard error: 0.1569986011897579

Attention: the usage of a paired sample test is not valid here, just for illustating how to use the function.

ANOVA

g1 = df[df.Species.=="setosa", :PetalLength]
g2 = df[df.Species.!="setosa", :PetalLength]

groups = [g1, g2]
OneWayANOVATest(groups...)

One-way analysis of variance (ANOVA) test
-----------------------------------------
Population details:
    parameter of interest:   Means
    value under h_0:         "all equal"
    point estimate:          NaN

Test summary:
    outcome with 95% confidence: reject h_0
    p-value:                     <1e-62

Details:
    number of observations: [50, 100]
    F statistic:            848.606
    degrees of freedom:     (1, 148)

Models in Julia

Pkg.add("GLM");
using GLM;

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`
[92m[1mPrecompiling[22m[39m project...
         [91m  ✗ [39m[90mGtkObservables[39m
         [91m  ✗ [39mProfileView
  0 dependencies successfully precompiled in 4 seconds. 518 already precompiled.
  [91m2[39m dependencies errored.
  For a report of the errors see `julia> err`. To retry use `pkg> precompile`

1. Multivariate Linear Regression

formula = @formula(PetalLength ~ PetalWidth)
lm(formula, df)

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

PetalLength ~ 1 + PetalWidth

Coefficients:
───────────────────────────────────────────────────────────────────────
               Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept)  1.08356   0.072967   14.85    <1e-30   0.939366    1.22775
PetalWidth   2.22994   0.0513962  43.39    <1e-85   2.12838     2.33151
───────────────────────────────────────────────────────────────────────

2. Fixed-effects Models (fast)

Pkg.add("FixedEffectModels");
using FixedEffectModels;

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`
[92m[1mPrecompiling[22m[39m project...
         [91m  ✗ [39m[90mGtkObservables[39m
         [91m  ✗ [39mProfileView
  0 dependencies successfully precompiled in 4 seconds. 518 already precompiled.
  [91m2[39m dependencies errored.
  For a report of the errors see `julia> err`. To retry use `pkg> precompile`

formula = @formula(PetalLength ~ PetalWidth + fe(Species))
reg(df, formula, Vcov.cluster(:Species))

                            FixedEffectModel                            
========================================================================
Number of obs:                  150  Converged:                     true
dof (model):                      1  dof (residuals):                  3
R²:                           0.955  R² adjusted:                  0.954
F-statistic:                5.65297  P-value:                      0.098
R² within:                    0.235  Iterations:                       1
========================================================================
            Estimate  Std. Error  t-stat  Pr(>|t|)  Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────
PetalWidth   1.01871    0.428463  2.3776    0.0978  -0.344848    2.38227
========================================================================

The package of FixedEffectModels also supports CUDA, increasing the speed on a computer with Nvidia's graphic toolkit:

using Pkg
Pkg.add("CUDA");
using CUDA
@assert CUDA.functional()
formula = @formula(PetalLength ~ PetalWidth + fe(Species))
reg(df, formula, Vcov.cluster(:Species), method=:CUDA)

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`
[92m[1mPrecompiling[22m[39m project...
         [91m  ✗ [39m[90mGtkObservables[39m
         [91m  ✗ [39mProfileView
  0 dependencies successfully precompiled in 4 seconds. 518 already precompiled.
  [91m2[39m dependencies errored.
  For a report of the errors see `julia> err`. To retry use `pkg> precompile`


[?25h[2K[1mDemean Variables:[22m[39m [================================>]  2/2




                            FixedEffectModel                            
========================================================================
Number of obs:                  150  Converged:                     true
dof (model):                      1  dof (residuals):                  3
R²:                           0.955  R² adjusted:                  0.954
F-statistic:                5.65296  P-value:                      0.098
R² within:                    0.235  Iterations:                       1
========================================================================
            Estimate  Std. Error  t-stat  Pr(>|t|)  Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────
PetalWidth   1.01871    0.428463  2.3776    0.0978  -0.344848    2.38227
========================================================================

3. Mediation

While it's common to use packages in R and Python to run mediation analysis, I fail to find an appropriate alternative in Julia.

Therefore, we may need to do it by ourselves:

df_mediation = rename(df, [:PetalWidth, :SepalWidth, :PetalLength] .=> [:X, :M, :Y]) |>
               d -> d[:, [:X, :M, :Y]]

# X -> M
fm_a = @formula(M ~ X)

# M and X -> Y
fm_b_cprime = @formula(Y ~ X + M)

# X -> Y
fm_c = @formula(Y ~ X)

first(df_mediation, 3)

3×3 DataFrame

Row

Float64

0.2

3.5

1.4

0.2

3.0

1.4

0.2

3.2

1.3

model_a = lm(fm_a, df_mediation)
a = round(coef(model_a)[2], digits=3)  # Path a (coefficient for X)
model_a

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

M ~ 1 + X

Coefficients:
────────────────────────────────────────────────────────────────────────
                Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────
(Intercept)   3.30843   0.0620975  53.28    <1e-97   3.18571    3.43114
X            -0.20936   0.04374    -4.79    <1e-05  -0.295795  -0.122924
────────────────────────────────────────────────────────────────────────

model_b_cprime = lm(fm_b_cprime, df_mediation)
b = round(coef(model_b_cprime)[3], digits=3)  # Path b (coefficient for M)
c_prime = round(coef(model_b_cprime)[2], digits=3)  # Path c' (coefficient for X)
model_b_cprime

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Y ~ 1 + X + M

Coefficients:
─────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept)   2.25816    0.313519    7.20    <1e-10   1.63858    2.87775
X             2.15561    0.0528285  40.80    <1e-81   2.05121    2.26001
M            -0.355035   0.0923859  -3.84    0.0002  -0.537611  -0.172458
─────────────────────────────────────────────────────────────────────────

model_c = lm(fm_c, df_mediation)
c = round(coef(model_c)[2], digits=3)  # Path c (coefficient for X)
model_c

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Y ~ 1 + X

Coefficients:
───────────────────────────────────────────────────────────────────────
               Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept)  1.08356   0.072967   14.85    <1e-30   0.939366    1.22775
X            2.22994   0.0513962  43.39    <1e-85   2.12838     2.33151
───────────────────────────────────────────────────────────────────────

Calculate direct and indirect effects:

# Indirect effect
indirect_effect = round(a * b, digits=3)

# Total effect
total_effect = c

# Print results
println("Path a (X -> M): $a")
println("Path b (M -> Y controlling for X): $b")
println("Path c' (Direct effect of X on Y): $c_prime")
println("Path c (Total effect of X on Y): $c")
println("Indirect Effect (a * b): $indirect_effect")

Path a (X -> M): -0.209
Path b (M -> Y controlling for X): -0.355
Path c' (Direct effect of X on Y): 2.156
Path c (Total effect of X on Y): 2.23
Indirect Effect (a * b): 0.074

(Attention, the theoritical assumption of this example here does not hold, just for illustration)

4. Moderation

df_moderation = rename(df, [:PetalWidth, :SepalWidth, :PetalLength] .=> [:X, :W, :Y]) |>
                d -> d[:, [:X, :W, :Y]]


# X -> Y
fm_a = @formula(Y ~ X)

# W moderating X -> Y
fm_b = @formula(Y ~ X * W)

first(df_moderation, 3)

3×3 DataFrame

Row

Float64

0.2

3.5

1.4

0.2

3.0

1.4

0.2

3.2

1.3

model_a = lm(fm_a, df_moderation)
b_1 = round(coef(model_a)[2], digits=3)  # Original model
model_a

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Y ~ 1 + X

Coefficients:
───────────────────────────────────────────────────────────────────────
               Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept)  1.08356   0.072967   14.85    <1e-30   0.939366    1.22775
X            2.22994   0.0513962  43.39    <1e-85   2.12838     2.33151
───────────────────────────────────────────────────────────────────────

model_b = lm(fm_b, df_moderation)
b_3 = round(coef(model_b)[4], digits=3)  # Original model
model_b

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Y ~ 1 + X + W + X & W

Coefficients:
──────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)   2.28181      0.583452   3.91    0.0001   1.12871    3.43492
X             2.13257      0.481634   4.43    <1e-04   1.18069    3.08445
W            -0.362079     0.173239  -2.09    0.0383  -0.704459  -0.019698
X & W         0.0071967    0.149522   0.05    0.9617  -0.288311   0.302704
──────────────────────────────────────────────────────────────────────────

println("Moderation effect: $b_3 (no-sig.)")

Moderation effect: 0.007 (no-sig.)

(Attention, the theoritical assumption of this example here does not hold, just for illustration)

5. Conditional Process

A more complete analysis, therefore, should attempt to model the mechanisms at work linking X to Y while simultaneously allowing those effects to be contingent on context, circumstance, or individual differences. (Andrew F. Hayes, 2018, p. 395)

For instance, we want to test if the relation between X and Y, is mediated by M, and the M on Y is moderated by W. In the current dataset, we assume:

X: PetalWidth
M: SepalLength
W: SepalWidth
Y: PetalLength

df_conditional = rename(df, [:PetalWidth, :SepalLength, :SepalWidth, :PetalLength] .=> [:X, :M, :W, :Y]) |>
                 d -> d[:, [:X, :M, :W, :Y]]

first(df_conditional, 3)

3×4 DataFrame

Row

Float64

0.2

5.1

3.5

1.4

0.2

4.9

3.0

1.4

0.2

4.7

3.2

1.3

Suppose that we expect:

$M=i_M+aX+\epsilon_M$
$Y=i_Y+c'X+b_1M+b_2W+b_3WM+\epsilon_Y$

# a: X -> M
fm_a = @formula(M ~ X * W)

# b and c: M -> Y
fm_b = @formula(Y ~ X + M + W * M)

# c: X -> Y
fm_c = @formula(Y ~ X)

FormulaTerm
Response:
  Y(unknown)
Predictors:
  X(unknown)

model_a = lm(fm_a, df_conditional)
a = round(coef(model_a)[2], digits=3)  # X -> M
model_a

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

M ~ 1 + X + W + X & W

Coefficients:
──────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)   3.13046      0.5745     5.45    <1e-06   1.99505    4.26587
X             1.29057      0.474244   2.72    0.0073   0.353302   2.22784
W             0.496425     0.170581   2.91    0.0042   0.159298   0.833551
X & W        -0.0994638    0.147228  -0.68    0.5004  -0.390437   0.191509
──────────────────────────────────────────────────────────────────────────

model_b = lm(fm_b, df_conditional)

c_prime = round(coef(model_b)[2], digits=3)  # indirect_effect
b_1 = round(coef(model_b)[3], digits=3)  # M -> Y
b_2 = round(coef(model_b)[4], digits=3)  # W -> Y
b_3 = round(coef(model_b)[5], digits=3)  # M*W -> Y
model_b

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Y ~ 1 + X + M + W + W & M

Coefficients:
───────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      t  Pr(>|t|)   Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept)   0.714816    1.56623     0.46    0.6488  -2.38078     3.81041
X             1.43584     0.0699066  20.54    <1e-44   1.29767     1.57401
M             0.561752    0.269701    2.08    0.0390   0.0286989   1.09481
W            -0.970405    0.514856   -1.88    0.0615  -1.988       0.047187
W & M         0.0564153   0.0887396   0.64    0.5259  -0.118975    0.231806
───────────────────────────────────────────────────────────────────────────

model_c = lm(fm_c, df_conditional)
c = round(coef(model_c)[2], digits=3)  # Original model
model_c

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Y ~ 1 + X

Coefficients:
───────────────────────────────────────────────────────────────────────
               Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept)  1.08356   0.072967   14.85    <1e-30   0.939366    1.22775
X            2.22994   0.0513962  43.39    <1e-85   2.12838     2.33151
───────────────────────────────────────────────────────────────────────

# Indirect effect
indirect_effect = round(a * b_1, digits=3)

# Total effect
total_effect = c

# Print results
println("Path a (X -> M): $a")
println("Path b1 (M -> Y): $b_1")
println("Path b2 (W -> Y): $b_2")
println("Path b3 (MW -> Y): $b_3")
println("Indirect Effect (a * b_1): $indirect_effect")

Path a (X -> M): 1.291
Path b1 (M -> Y): 0.562
Path b2 (W -> Y): -0.97
Path b3 (MW -> Y): 0.056
Indirect Effect (a * b_1): 0.726

If you need bootstraping and everything else introduces by Hayes, we can using R or Python in Julia to do so.

PreviousChapter 4. Pipeline and Tools NextChatpter 5 (cont.). Advanced Causal Inference in Julia

Last updated 2 months ago

Was this helpful?