Chapter 5. Inference and Models

Chapter 5. Statisticcal Tests and Linear Regression

using Pkg; Pkg.add("HypothesisTests")
using HypothesisTests, CSV, DataFrames, RDatasets
   Resolving package versions...
  No Changes to `~/.julia/environments/v1.11/Project.toml`
  No Changes to `~/.julia/environments/v1.11/Manifest.toml`
df = dataset("datasets", "iris")
first(df, 5)

5×5 DataFrame

Row
SepalLength
SepalWidth
PetalLength
PetalWidth
Species

Float64

Float64

Float64

Float64

Cat…

1

5.1

3.5

1.4

0.2

setosa

2

4.9

3.0

1.4

0.2

setosa

3

4.7

3.2

1.3

0.2

setosa

4

4.6

3.1

1.5

0.2

setosa

5

5.0

3.6

1.4

0.2

setosa

T-tests

One-sample

mu = 20
OneSampleTTest(df.PetalLength, mu)
One sample t-test
-----------------
Population details:
    parameter of interest:   Mean
    value under h_0:         20
    point estimate:          3.758
    95% confidence interval: (3.473, 4.043)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           <1e-99

Details:
    number of observations:   150
    t-statistic:              -112.68524392285155
    degrees of freedom:       149
    empirical standard error: 0.144135997177411

Two-samples

UnequalVarianceTTest(df.PetalLength, df.PetalWidth)
Two sample t-test (unequal variance)
------------------------------------
Population details:
    parameter of interest:   Mean difference
    value under h_0:         0
    point estimate:          2.55867
    95% confidence interval: (2.249, 2.868)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           <1e-37

Details:
    number of observations:   [150,150]
    t-statistic:              16.29738511857255
    degrees of freedom:       202.6934059810404
    empirical standard error: 0.1569986011897579

Paired-samples

EqualVarianceTTest(df.PetalLength, df.PetalWidth)
Two sample t-test (equal variance)
----------------------------------
Population details:
    parameter of interest:   Mean difference
    value under h_0:         0
    point estimate:          2.55867
    95% confidence interval: (2.25, 2.868)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           <1e-42

Details:
    number of observations:   [150,150]
    t-statistic:              16.297385118572546
    degrees of freedom:       298
    empirical standard error: 0.15699860118975792

Attention: the usage of a paired sample test is not valid here, just for illustating how to use the function.

ANOVA

g1 = df[df.Species .== "setosa", :PetalLength]
g2 = df[df.Species .!= "setosa", :PetalLength]

groups = [g1, g2]
OneWayANOVATest(groups...)
One-way analysis of variance (ANOVA) test
-----------------------------------------
Population details:
    parameter of interest:   Means
    value under h_0:         "all equal"
    point estimate:          NaN

Test summary:
    outcome with 95% confidence: reject h_0
    p-value:                     <1e-65

Details:
    number of observations: [50, 100]
    F statistic:            954.682
    degrees of freedom:     (1, 148)

Models in Julia

Pkg.add("GLM"); using GLM
   Resolving package versions...
  No Changes to `~/.julia/environments/v1.11/Project.toml`
  No Changes to `~/.julia/environments/v1.11/Manifest.toml`

1. Multivariate Linear Regression

formula = @formula(PetalLength ~ PetalWidth)
lm(formula, df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

PetalLength ~ 1 + PetalWidth

Coefficients:
───────────────────────────────────────────────────────────────────────
               Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept)  1.08356   0.072967   14.85    <1e-30   0.939366    1.22775
PetalWidth   2.22994   0.0513962  43.39    <1e-85   2.12838     2.33151
───────────────────────────────────────────────────────────────────────

2. Fixed-effects Models (fast)

Pkg.add("FixedEffectModels"); using FixedEffectModels
   Resolving package versions...
  No Changes to `~/.julia/environments/v1.11/Project.toml`
  No Changes to `~/.julia/environments/v1.11/Manifest.toml`
formula = @formula(PetalLength ~ PetalWidth + fe(Species))
reg(df, formula, Vcov.cluster(:Species))
                            FixedEffectModel                            
========================================================================
Number of obs:                  150  Converged:                     true
dof (model):                      1  dof (residuals):                  3
R²:                           0.955  R² adjusted:                  0.954
F-statistic:                5.65297  P-value:                      0.098
R² within:                    0.235  Iterations:                       1
========================================================================
            Estimate  Std. Error  t-stat  Pr(>|t|)  Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────
PetalWidth   1.01871    0.428463  2.3776    0.0978  -0.344848    2.38227
========================================================================

3. Mediation

While it's common to use packages in R and Python to run mediation analysis, I fail to find an appropriate alternative in Julia.

Therefore, we may need to do it by ourselves:

df_mediation = rename(df, [:PetalWidth, :SepalWidth, :PetalLength] .=>  [:X, :M, :Y] )  |> 
  d -> d[:, [:X, :M, :Y] ]

# X -> M
fm_a = @formula(M ~ X)

# M and X -> Y
fm_b_cprime= @formula(Y ~ X + M)

# X -> Y
fm_c = @formula(Y ~ X)

first(df_mediation, 3)

3×3 DataFrame

Row
X
M
Y

Float64

Float64

Float64

1

0.2

3.5

1.4

2

0.2

3.0

1.4

3

0.2

3.2

1.3

model_a = lm(fm_a, df_mediation)
a = round(coef(model_a)[2], digits = 3)  # Path a (coefficient for X)
model_a
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

M ~ 1 + X

Coefficients:
────────────────────────────────────────────────────────────────────────
                Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────
(Intercept)   3.30843   0.0620975  53.28    <1e-97   3.18571    3.43114
X            -0.20936   0.04374    -4.79    <1e-05  -0.295795  -0.122924
────────────────────────────────────────────────────────────────────────
model_b_cprime = lm(fm_b_cprime, df_mediation)
b = round(coef(model_b_cprime)[3], digits = 3)  # Path b (coefficient for M)
c_prime = round(coef(model_b_cprime)[2], digits = 3)  # Path c' (coefficient for X)
model_b_cprime
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Y ~ 1 + X + M

Coefficients:
─────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept)   2.25816    0.313519    7.20    <1e-10   1.63858    2.87775
X             2.15561    0.0528285  40.80    <1e-81   2.05121    2.26001
M            -0.355035   0.0923859  -3.84    0.0002  -0.537611  -0.172458
─────────────────────────────────────────────────────────────────────────
model_c = lm(fm_c, df_mediation)
c = round(coef(model_c)[2], digits = 3)  # Path c (coefficient for X)
model_c
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Y ~ 1 + X

Coefficients:
───────────────────────────────────────────────────────────────────────
               Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept)  1.08356   0.072967   14.85    <1e-30   0.939366    1.22775
X            2.22994   0.0513962  43.39    <1e-85   2.12838     2.33151
───────────────────────────────────────────────────────────────────────

Calculate direct and indirect effects:

# Indirect effect
indirect_effect = round(a * b, digits=3)

# Total effect
total_effect = c

# Print results
println("Path a (X -> M): $a")
println("Path b (M -> Y controlling for X): $b")
println("Path c' (Direct effect of X on Y): $c_prime")
println("Path c (Total effect of X on Y): $c")
println("Indirect Effect (a * b): $indirect_effect")
Path a (X -> M): -0.209
Path b (M -> Y controlling for X): -0.355
Path c' (Direct effect of X on Y): 2.156
Path c (Total effect of X on Y): 2.23
Indirect Effect (a * b): 0.074

(Attention, the theoritical assumption of this example here does not hold, just for illustration)

4. Moderation

df_moderation = rename(df, [:PetalWidth, :SepalWidth, :PetalLength] .=>  [:X, :W, :Y] )  |> 
  d -> d[:, [:X, :W, :Y] ]


# X -> Y
fm_a = @formula(Y ~ X)

# W moderating X -> Y
fm_b = @formula(Y ~ X * W)

first(df_moderation, 3)

3×3 DataFrame

Row
X
W
Y

Float64

Float64

Float64

1

0.2

3.5

1.4

2

0.2

3.0

1.4

3

0.2

3.2

1.3

model_a = lm(fm_a, df_moderation)
b_1 = round(coef(model_a)[2], digits = 3)  # Original model
model_a
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Y ~ 1 + X

Coefficients:
───────────────────────────────────────────────────────────────────────
               Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept)  1.08356   0.072967   14.85    <1e-30   0.939366    1.22775
X            2.22994   0.0513962  43.39    <1e-85   2.12838     2.33151
───────────────────────────────────────────────────────────────────────
model_b = lm(fm_b, df_moderation)
b_3 = round(coef(model_b)[4], digits = 3)  # Original model
model_b
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Y ~ 1 + X + W + X & W

Coefficients:
──────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)   2.28181      0.583452   3.91    0.0001   1.12871    3.43492
X             2.13257      0.481634   4.43    <1e-04   1.18069    3.08445
W            -0.362079     0.173239  -2.09    0.0383  -0.704459  -0.019698
X & W         0.0071967    0.149522   0.05    0.9617  -0.288311   0.302704
──────────────────────────────────────────────────────────────────────────
println("Moderation effect: $b_3 (no-sig.)")
Moderation effect: 0.007 (no-sig.)

(Attention, the theoritical assumption of this example here does not hold, just for illustration)

5. Conditional Process

A more complete analysis, therefore, should attempt to model the mechanisms at work linking X to Y while simultaneously allowing those effects to be contingent on context, circumstance, or individual differences. (Andrew F. Hayes, 2018, p. 395)

For instance, we want to test if the relation between X and Y, is mediated by M, and the M on Y is moderated by W. In the current dataset, we assume:

  • X: PetalWidth

  • M: SepalLength

  • W: SepalWidth

  • Y: PetalLength

df_conditional = rename(df, [:PetalWidth, :SepalLength, :SepalWidth, :PetalLength] .=>  [:X, :M, :W, :Y] )  |> 
  d -> d[:, [:X, :M, :W, :Y] ]

first(df_conditional, 3)

3×4 DataFrame

Row
X
M
W
Y

Float64

Float64

Float64

Float64

1

0.2

5.1

3.5

1.4

2

0.2

4.9

3.0

1.4

3

0.2

4.7

3.2

1.3

Suppose that we expect:

  • $M=i_M+aX+\epsilon_M$

  • $Y=i_Y+c'X+b_1M+b_2W+b_3WM+\epsilon_Y$

# a: X -> M
fm_a = @formula(M ~ X * W)

# b and c: M -> Y
fm_b = @formula(Y ~ X + M + W*M)

# c: X -> Y
fm_c = @formula(Y ~ X)
FormulaTerm
Response:
  Y(unknown)
Predictors:
  X(unknown)
model_a = lm(fm_a, df_conditional)
a = round(coef(model_a)[2], digits = 3)  # X -> M
model_a
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

M ~ 1 + X + W + X & W

Coefficients:
──────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)   3.13046      0.5745     5.45    <1e-06   1.99505    4.26587
X             1.29057      0.474244   2.72    0.0073   0.353302   2.22784
W             0.496425     0.170581   2.91    0.0042   0.159298   0.833551
X & W        -0.0994638    0.147228  -0.68    0.5004  -0.390437   0.191509
──────────────────────────────────────────────────────────────────────────
model_b = lm(fm_b, df_conditional)

c_prime = round(coef(model_b)[2], digits = 3)  # indirect_effect
b_1 = round(coef(model_b)[3], digits = 3)  # M -> Y
b_2 = round(coef(model_b)[4], digits = 3)  # W -> Y
b_3 = round(coef(model_b)[5], digits = 3)  # M*W -> Y
model_b
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Y ~ 1 + X + M + W + W & M

Coefficients:
───────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      t  Pr(>|t|)   Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept)   0.714816    1.56623     0.46    0.6488  -2.38078     3.81041
X             1.43584     0.0699066  20.54    <1e-44   1.29767     1.57401
M             0.561752    0.269701    2.08    0.0390   0.0286989   1.09481
W            -0.970405    0.514856   -1.88    0.0615  -1.988       0.047187
W & M         0.0564153   0.0887396   0.64    0.5259  -0.118975    0.231806
───────────────────────────────────────────────────────────────────────────
model_c = lm(fm_c, df_conditional)
c = round(coef(model_c)[2], digits = 3)  # Original model
model_c
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Y ~ 1 + X

Coefficients:
───────────────────────────────────────────────────────────────────────
               Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept)  1.08356   0.072967   14.85    <1e-30   0.939366    1.22775
X            2.22994   0.0513962  43.39    <1e-85   2.12838     2.33151
───────────────────────────────────────────────────────────────────────
# Indirect effect
indirect_effect = round(a * b_1, digits=3)

# Total effect
total_effect = c

# Print results
println("Path a (X -> M): $a")
println("Path b1 (M -> Y): $b_1")
println("Path b2 (W -> Y): $b_2")
println("Path b3 (MW -> Y): $b_3")
println("Indirect Effect (a * b_1): $indirect_effect")
Path a (X -> M): 1.291
Path b1 (M -> Y): 0.562
Path b2 (W -> Y): -0.97
Path b3 (MW -> Y): 0.056
Indirect Effect (a * b_1): 0.726

If you need bootstraping and everything else introduces by Hayes, we can using R or Python in Julia to do so.

Last updated