# Chapter 5. Statisticcal Tests and Linear Regression

```julia
using Pkg;
Pkg.add("HypothesisTests");
using HypothesisTests, CSV, DataFrames, RDatasets
```

```
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`
[92m[1mPrecompiling[22m[39m project...
         [91m  ✗ [39m[90mGtkObservables[39m
         [91m  ✗ [39mProfileView
  0 dependencies successfully precompiled in 4 seconds. 518 already precompiled.
  [91m2[39m dependencies errored.
  For a report of the errors see `julia> err`. To retry use `pkg> precompile`
```

```julia
df = dataset("datasets", "iris")
first(df, 5)
```

5×5 DataFrame

| Row | SepalLength | SepalWidth | PetalLength | PetalWidth | Species |
| --- | ----------- | ---------- | ----------- | ---------- | ------- |
|     | Float64     | Float64    | Float64     | Float64    | Cat…    |
| 1   | 5.1         | 3.5        | 1.4         | 0.2        | setosa  |
| 2   | 4.9         | 3.0        | 1.4         | 0.2        | setosa  |
| 3   | 4.7         | 3.2        | 1.3         | 0.2        | setosa  |
| 4   | 4.6         | 3.1        | 1.5         | 0.2        | setosa  |
| 5   | 5.0         | 3.6        | 1.4         | 0.2        | setosa  |

## T-tests

### One-sample

```julia
mu = 20
OneSampleTTest(df.PetalLength, mu)
```

```
One sample t-test
-----------------
Population details:
    parameter of interest:   Mean
    value under h_0:         20
    point estimate:          3.758
    95% confidence interval: (3.473, 4.043)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           <1e-99

Details:
    number of observations:   150
    t-statistic:              -112.68524392285158
    degrees of freedom:       149
    empirical standard error: 0.14413599717741096
```

### Two-samples

```julia
UnequalVarianceTTest(df.PetalLength, df.PetalWidth)
```

```
Two sample t-test (unequal variance)
------------------------------------
Population details:
    parameter of interest:   Mean difference
    value under h_0:         0
    point estimate:          2.55867
    95% confidence interval: (2.249, 2.868)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           <1e-37

Details:
    number of observations:   [150,150]
    t-statistic:              16.29738511857255
    degrees of freedom:       202.69340598104046
    empirical standard error: 0.1569986011897579
```

### Paired-samples

```julia
EqualVarianceTTest(df.PetalLength, df.PetalWidth)
```

```
Two sample t-test (equal variance)
----------------------------------
Population details:
    parameter of interest:   Mean difference
    value under h_0:         0
    point estimate:          2.55867
    95% confidence interval: (2.25, 2.868)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           <1e-42

Details:
    number of observations:   [150,150]
    t-statistic:              16.29738511857255
    degrees of freedom:       298
    empirical standard error: 0.1569986011897579
```

Attention: the usage of a paired sample test is not valid here, just for illustating how to use the function.

## ANOVA

```julia
g1 = df[df.Species.=="setosa", :PetalLength]
g2 = df[df.Species.!="setosa", :PetalLength]

groups = [g1, g2]
OneWayANOVATest(groups...)
```

```
One-way analysis of variance (ANOVA) test
-----------------------------------------
Population details:
    parameter of interest:   Means
    value under h_0:         "all equal"
    point estimate:          NaN

Test summary:
    outcome with 95% confidence: reject h_0
    p-value:                     <1e-62

Details:
    number of observations: [50, 100]
    F statistic:            848.606
    degrees of freedom:     (1, 148)
```

## Models in Julia

```julia
Pkg.add("GLM");
using GLM;
```

```
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`
[92m[1mPrecompiling[22m[39m project...
         [91m  ✗ [39m[90mGtkObservables[39m
         [91m  ✗ [39mProfileView
  0 dependencies successfully precompiled in 4 seconds. 518 already precompiled.
  [91m2[39m dependencies errored.
  For a report of the errors see `julia> err`. To retry use `pkg> precompile`
```

### 1. Multivariate Linear Regression

```julia
formula = @formula(PetalLength ~ PetalWidth)
lm(formula, df)
```

```
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

PetalLength ~ 1 + PetalWidth

Coefficients:
───────────────────────────────────────────────────────────────────────
               Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept)  1.08356   0.072967   14.85    <1e-30   0.939366    1.22775
PetalWidth   2.22994   0.0513962  43.39    <1e-85   2.12838     2.33151
───────────────────────────────────────────────────────────────────────
```

### 2. Fixed-effects Models (fast)

```julia
Pkg.add("FixedEffectModels");
using FixedEffectModels;
```

```
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`
[92m[1mPrecompiling[22m[39m project...
         [91m  ✗ [39m[90mGtkObservables[39m
         [91m  ✗ [39mProfileView
  0 dependencies successfully precompiled in 4 seconds. 518 already precompiled.
  [91m2[39m dependencies errored.
  For a report of the errors see `julia> err`. To retry use `pkg> precompile`
```

```julia
formula = @formula(PetalLength ~ PetalWidth + fe(Species))
reg(df, formula, Vcov.cluster(:Species))
```

```
                            FixedEffectModel                            
========================================================================
Number of obs:                  150  Converged:                     true
dof (model):                      1  dof (residuals):                  3
R²:                           0.955  R² adjusted:                  0.954
F-statistic:                5.65297  P-value:                      0.098
R² within:                    0.235  Iterations:                       1
========================================================================
            Estimate  Std. Error  t-stat  Pr(>|t|)  Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────
PetalWidth   1.01871    0.428463  2.3776    0.0978  -0.344848    2.38227
========================================================================
```

The package of `FixedEffectModels` also supports CUDA, increasing the speed on a computer with Nvidia's graphic toolkit:

```julia
using Pkg
Pkg.add("CUDA");
using CUDA
@assert CUDA.functional()
formula = @formula(PetalLength ~ PetalWidth + fe(Species))
reg(df, formula, Vcov.cluster(:Species), method=:CUDA)
```

```
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`
[92m[1mPrecompiling[22m[39m project...
         [91m  ✗ [39m[90mGtkObservables[39m
         [91m  ✗ [39mProfileView
  0 dependencies successfully precompiled in 4 seconds. 518 already precompiled.
  [91m2[39m dependencies errored.
  For a report of the errors see `julia> err`. To retry use `pkg> precompile`


[?25h[2K[1mDemean Variables:[22m[39m [================================>]  2/2




                            FixedEffectModel                            
========================================================================
Number of obs:                  150  Converged:                     true
dof (model):                      1  dof (residuals):                  3
R²:                           0.955  R² adjusted:                  0.954
F-statistic:                5.65296  P-value:                      0.098
R² within:                    0.235  Iterations:                       1
========================================================================
            Estimate  Std. Error  t-stat  Pr(>|t|)  Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────
PetalWidth   1.01871    0.428463  2.3776    0.0978  -0.344848    2.38227
========================================================================
```

### 3. Mediation

While it's common to use packages in R and Python to run mediation analysis, I fail to find an appropriate alternative in Julia.

Therefore, we may need to do it by ourselves:

```julia
df_mediation = rename(df, [:PetalWidth, :SepalWidth, :PetalLength] .=> [:X, :M, :Y]) |>
               d -> d[:, [:X, :M, :Y]]

# X -> M
fm_a = @formula(M ~ X)

# M and X -> Y
fm_b_cprime = @formula(Y ~ X + M)

# X -> Y
fm_c = @formula(Y ~ X)

first(df_mediation, 3)
```

3×3 DataFrame

| Row | X       | M       | Y       |
| --- | ------- | ------- | ------- |
|     | Float64 | Float64 | Float64 |
| 1   | 0.2     | 3.5     | 1.4     |
| 2   | 0.2     | 3.0     | 1.4     |
| 3   | 0.2     | 3.2     | 1.3     |

```julia
model_a = lm(fm_a, df_mediation)
a = round(coef(model_a)[2], digits=3)  # Path a (coefficient for X)
model_a
```

```
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

M ~ 1 + X

Coefficients:
────────────────────────────────────────────────────────────────────────
                Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────
(Intercept)   3.30843   0.0620975  53.28    <1e-97   3.18571    3.43114
X            -0.20936   0.04374    -4.79    <1e-05  -0.295795  -0.122924
────────────────────────────────────────────────────────────────────────
```

```julia
model_b_cprime = lm(fm_b_cprime, df_mediation)
b = round(coef(model_b_cprime)[3], digits=3)  # Path b (coefficient for M)
c_prime = round(coef(model_b_cprime)[2], digits=3)  # Path c' (coefficient for X)
model_b_cprime
```

```
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Y ~ 1 + X + M

Coefficients:
─────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept)   2.25816    0.313519    7.20    <1e-10   1.63858    2.87775
X             2.15561    0.0528285  40.80    <1e-81   2.05121    2.26001
M            -0.355035   0.0923859  -3.84    0.0002  -0.537611  -0.172458
─────────────────────────────────────────────────────────────────────────
```

```julia
model_c = lm(fm_c, df_mediation)
c = round(coef(model_c)[2], digits=3)  # Path c (coefficient for X)
model_c
```

```
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Y ~ 1 + X

Coefficients:
───────────────────────────────────────────────────────────────────────
               Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept)  1.08356   0.072967   14.85    <1e-30   0.939366    1.22775
X            2.22994   0.0513962  43.39    <1e-85   2.12838     2.33151
───────────────────────────────────────────────────────────────────────
```

Calculate direct and indirect effects:

```julia
# Indirect effect
indirect_effect = round(a * b, digits=3)

# Total effect
total_effect = c

# Print results
println("Path a (X -> M): $a")
println("Path b (M -> Y controlling for X): $b")
println("Path c' (Direct effect of X on Y): $c_prime")
println("Path c (Total effect of X on Y): $c")
println("Indirect Effect (a * b): $indirect_effect")
```

```
Path a (X -> M): -0.209
Path b (M -> Y controlling for X): -0.355
Path c' (Direct effect of X on Y): 2.156
Path c (Total effect of X on Y): 2.23
Indirect Effect (a * b): 0.074
```

*(Attention, the theoritical assumption of this example here does not hold, just for illustration)*

### 4. Moderation

```julia
df_moderation = rename(df, [:PetalWidth, :SepalWidth, :PetalLength] .=> [:X, :W, :Y]) |>
                d -> d[:, [:X, :W, :Y]]


# X -> Y
fm_a = @formula(Y ~ X)

# W moderating X -> Y
fm_b = @formula(Y ~ X * W)

first(df_moderation, 3)
```

3×3 DataFrame

| Row | X       | W       | Y       |
| --- | ------- | ------- | ------- |
|     | Float64 | Float64 | Float64 |
| 1   | 0.2     | 3.5     | 1.4     |
| 2   | 0.2     | 3.0     | 1.4     |
| 3   | 0.2     | 3.2     | 1.3     |

```julia
model_a = lm(fm_a, df_moderation)
b_1 = round(coef(model_a)[2], digits=3)  # Original model
model_a
```

```
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Y ~ 1 + X

Coefficients:
───────────────────────────────────────────────────────────────────────
               Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept)  1.08356   0.072967   14.85    <1e-30   0.939366    1.22775
X            2.22994   0.0513962  43.39    <1e-85   2.12838     2.33151
───────────────────────────────────────────────────────────────────────
```

```julia
model_b = lm(fm_b, df_moderation)
b_3 = round(coef(model_b)[4], digits=3)  # Original model
model_b
```

```
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Y ~ 1 + X + W + X & W

Coefficients:
──────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)   2.28181      0.583452   3.91    0.0001   1.12871    3.43492
X             2.13257      0.481634   4.43    <1e-04   1.18069    3.08445
W            -0.362079     0.173239  -2.09    0.0383  -0.704459  -0.019698
X & W         0.0071967    0.149522   0.05    0.9617  -0.288311   0.302704
──────────────────────────────────────────────────────────────────────────
```

```julia
println("Moderation effect: $b_3 (no-sig.)")
```

```
Moderation effect: 0.007 (no-sig.)
```

*(Attention, the theoritical assumption of this example here does not hold, just for illustration)*

### 5. Conditional Process

*A more complete analysis, therefore, should attempt to model the mechanisms at work linking X to Y while simultaneously allowing those effects to be contingent on context, circumstance, or individual differences. (Andrew F. Hayes, 2018, p. 395)*

For instance, we want to test if the relation between X and Y, is mediated by M, and the M on Y is moderated by W. In the current dataset, we assume:

* **X**: PetalWidth
* **M**: SepalLength
* **W**: SepalWidth
* **Y**: PetalLength

```julia
df_conditional = rename(df, [:PetalWidth, :SepalLength, :SepalWidth, :PetalLength] .=> [:X, :M, :W, :Y]) |>
                 d -> d[:, [:X, :M, :W, :Y]]

first(df_conditional, 3)
```

3×4 DataFrame

| Row | X       | M       | W       | Y       |
| --- | ------- | ------- | ------- | ------- |
|     | Float64 | Float64 | Float64 | Float64 |
| 1   | 0.2     | 5.1     | 3.5     | 1.4     |
| 2   | 0.2     | 4.9     | 3.0     | 1.4     |
| 3   | 0.2     | 4.7     | 3.2     | 1.3     |

Suppose that we expect:

* $M=i\_M+aX+\epsilon\_M$
* $Y=i\_Y+c'X+b\_1M+b\_2W+b\_3WM+\epsilon\_Y$

```julia
# a: X -> M
fm_a = @formula(M ~ X * W)

# b and c: M -> Y
fm_b = @formula(Y ~ X + M + W * M)

# c: X -> Y
fm_c = @formula(Y ~ X)
```

```
FormulaTerm
Response:
  Y(unknown)
Predictors:
  X(unknown)
```

```julia
model_a = lm(fm_a, df_conditional)
a = round(coef(model_a)[2], digits=3)  # X -> M
model_a
```

```
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

M ~ 1 + X + W + X & W

Coefficients:
──────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)   3.13046      0.5745     5.45    <1e-06   1.99505    4.26587
X             1.29057      0.474244   2.72    0.0073   0.353302   2.22784
W             0.496425     0.170581   2.91    0.0042   0.159298   0.833551
X & W        -0.0994638    0.147228  -0.68    0.5004  -0.390437   0.191509
──────────────────────────────────────────────────────────────────────────
```

```julia
model_b = lm(fm_b, df_conditional)

c_prime = round(coef(model_b)[2], digits=3)  # indirect_effect
b_1 = round(coef(model_b)[3], digits=3)  # M -> Y
b_2 = round(coef(model_b)[4], digits=3)  # W -> Y
b_3 = round(coef(model_b)[5], digits=3)  # M*W -> Y
model_b
```

```
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Y ~ 1 + X + M + W + W & M

Coefficients:
───────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      t  Pr(>|t|)   Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept)   0.714816    1.56623     0.46    0.6488  -2.38078     3.81041
X             1.43584     0.0699066  20.54    <1e-44   1.29767     1.57401
M             0.561752    0.269701    2.08    0.0390   0.0286989   1.09481
W            -0.970405    0.514856   -1.88    0.0615  -1.988       0.047187
W & M         0.0564153   0.0887396   0.64    0.5259  -0.118975    0.231806
───────────────────────────────────────────────────────────────────────────
```

```julia
model_c = lm(fm_c, df_conditional)
c = round(coef(model_c)[2], digits=3)  # Original model
model_c
```

```
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Y ~ 1 + X

Coefficients:
───────────────────────────────────────────────────────────────────────
               Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept)  1.08356   0.072967   14.85    <1e-30   0.939366    1.22775
X            2.22994   0.0513962  43.39    <1e-85   2.12838     2.33151
───────────────────────────────────────────────────────────────────────
```

```julia
# Indirect effect
indirect_effect = round(a * b_1, digits=3)

# Total effect
total_effect = c

# Print results
println("Path a (X -> M): $a")
println("Path b1 (M -> Y): $b_1")
println("Path b2 (W -> Y): $b_2")
println("Path b3 (MW -> Y): $b_3")
println("Indirect Effect (a * b_1): $indirect_effect")
```

```
Path a (X -> M): 1.291
Path b1 (M -> Y): 0.562
Path b2 (W -> Y): -0.97
Path b3 (MW -> Y): 0.056
Indirect Effect (a * b_1): 0.726
```

If you need bootstraping and everything else introduces by Hayes, we can using R or Python in Julia to do so.
