# Chapter 5. Statisticcal Tests and Linear Regression

```julia
using Pkg;
Pkg.add("HypothesisTests");
using HypothesisTests, CSV, DataFrames, RDatasets
```

```
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`
[92m[1mPrecompiling[22m[39m project...
         [91m  ✗ [39m[90mGtkObservables[39m
         [91m  ✗ [39mProfileView
  0 dependencies successfully precompiled in 4 seconds. 518 already precompiled.
  [91m2[39m dependencies errored.
  For a report of the errors see `julia> err`. To retry use `pkg> precompile`
```

```julia
df = dataset("datasets", "iris")
first(df, 5)
```

5×5 DataFrame

| Row | SepalLength | SepalWidth | PetalLength | PetalWidth | Species |
| --- | ----------- | ---------- | ----------- | ---------- | ------- |
|     | Float64     | Float64    | Float64     | Float64    | Cat…    |
| 1   | 5.1         | 3.5        | 1.4         | 0.2        | setosa  |
| 2   | 4.9         | 3.0        | 1.4         | 0.2        | setosa  |
| 3   | 4.7         | 3.2        | 1.3         | 0.2        | setosa  |
| 4   | 4.6         | 3.1        | 1.5         | 0.2        | setosa  |
| 5   | 5.0         | 3.6        | 1.4         | 0.2        | setosa  |

## T-tests

### One-sample

```julia
mu = 20
OneSampleTTest(df.PetalLength, mu)
```

```
One sample t-test
-----------------
Population details:
    parameter of interest:   Mean
    value under h_0:         20
    point estimate:          3.758
    95% confidence interval: (3.473, 4.043)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           <1e-99

Details:
    number of observations:   150
    t-statistic:              -112.68524392285158
    degrees of freedom:       149
    empirical standard error: 0.14413599717741096
```

### Two-samples

```julia
UnequalVarianceTTest(df.PetalLength, df.PetalWidth)
```

```
Two sample t-test (unequal variance)
------------------------------------
Population details:
    parameter of interest:   Mean difference
    value under h_0:         0
    point estimate:          2.55867
    95% confidence interval: (2.249, 2.868)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           <1e-37

Details:
    number of observations:   [150,150]
    t-statistic:              16.29738511857255
    degrees of freedom:       202.69340598104046
    empirical standard error: 0.1569986011897579
```

### Paired-samples

```julia
EqualVarianceTTest(df.PetalLength, df.PetalWidth)
```

```
Two sample t-test (equal variance)
----------------------------------
Population details:
    parameter of interest:   Mean difference
    value under h_0:         0
    point estimate:          2.55867
    95% confidence interval: (2.25, 2.868)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           <1e-42

Details:
    number of observations:   [150,150]
    t-statistic:              16.29738511857255
    degrees of freedom:       298
    empirical standard error: 0.1569986011897579
```

Attention: the usage of a paired sample test is not valid here, just for illustating how to use the function.

## ANOVA

```julia
g1 = df[df.Species.=="setosa", :PetalLength]
g2 = df[df.Species.!="setosa", :PetalLength]

groups = [g1, g2]
OneWayANOVATest(groups...)
```

```
One-way analysis of variance (ANOVA) test
-----------------------------------------
Population details:
    parameter of interest:   Means
    value under h_0:         "all equal"
    point estimate:          NaN

Test summary:
    outcome with 95% confidence: reject h_0
    p-value:                     <1e-62

Details:
    number of observations: [50, 100]
    F statistic:            848.606
    degrees of freedom:     (1, 148)
```

## Models in Julia

```julia
Pkg.add("GLM");
using GLM;
```

```
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`
[92m[1mPrecompiling[22m[39m project...
         [91m  ✗ [39m[90mGtkObservables[39m
         [91m  ✗ [39mProfileView
  0 dependencies successfully precompiled in 4 seconds. 518 already precompiled.
  [91m2[39m dependencies errored.
  For a report of the errors see `julia> err`. To retry use `pkg> precompile`
```

### 1. Multivariate Linear Regression

```julia
formula = @formula(PetalLength ~ PetalWidth)
lm(formula, df)
```

```
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

PetalLength ~ 1 + PetalWidth

Coefficients:
───────────────────────────────────────────────────────────────────────
               Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept)  1.08356   0.072967   14.85    <1e-30   0.939366    1.22775
PetalWidth   2.22994   0.0513962  43.39    <1e-85   2.12838     2.33151
───────────────────────────────────────────────────────────────────────
```

### 2. Fixed-effects Models (fast)

```julia
Pkg.add("FixedEffectModels");
using FixedEffectModels;
```

```
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`
[92m[1mPrecompiling[22m[39m project...
         [91m  ✗ [39m[90mGtkObservables[39m
         [91m  ✗ [39mProfileView
  0 dependencies successfully precompiled in 4 seconds. 518 already precompiled.
  [91m2[39m dependencies errored.
  For a report of the errors see `julia> err`. To retry use `pkg> precompile`
```

```julia
formula = @formula(PetalLength ~ PetalWidth + fe(Species))
reg(df, formula, Vcov.cluster(:Species))
```

```
                            FixedEffectModel                            
========================================================================
Number of obs:                  150  Converged:                     true
dof (model):                      1  dof (residuals):                  3
R²:                           0.955  R² adjusted:                  0.954
F-statistic:                5.65297  P-value:                      0.098
R² within:                    0.235  Iterations:                       1
========================================================================
            Estimate  Std. Error  t-stat  Pr(>|t|)  Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────
PetalWidth   1.01871    0.428463  2.3776    0.0978  -0.344848    2.38227
========================================================================
```

The package of `FixedEffectModels` also supports CUDA, increasing the speed on a computer with Nvidia's graphic toolkit:

```julia
using Pkg
Pkg.add("CUDA");
using CUDA
@assert CUDA.functional()
formula = @formula(PetalLength ~ PetalWidth + fe(Species))
reg(df, formula, Vcov.cluster(:Species), method=:CUDA)
```

```
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`
[92m[1mPrecompiling[22m[39m project...
         [91m  ✗ [39m[90mGtkObservables[39m
         [91m  ✗ [39mProfileView
  0 dependencies successfully precompiled in 4 seconds. 518 already precompiled.
  [91m2[39m dependencies errored.
  For a report of the errors see `julia> err`. To retry use `pkg> precompile`


[?25h[2K[1mDemean Variables:[22m[39m [================================>]  2/2




                            FixedEffectModel                            
========================================================================
Number of obs:                  150  Converged:                     true
dof (model):                      1  dof (residuals):                  3
R²:                           0.955  R² adjusted:                  0.954
F-statistic:                5.65296  P-value:                      0.098
R² within:                    0.235  Iterations:                       1
========================================================================
            Estimate  Std. Error  t-stat  Pr(>|t|)  Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────
PetalWidth   1.01871    0.428463  2.3776    0.0978  -0.344848    2.38227
========================================================================
```

### 3. Mediation

While it's common to use packages in R and Python to run mediation analysis, I fail to find an appropriate alternative in Julia.

Therefore, we may need to do it by ourselves:

```julia
df_mediation = rename(df, [:PetalWidth, :SepalWidth, :PetalLength] .=> [:X, :M, :Y]) |>
               d -> d[:, [:X, :M, :Y]]

# X -> M
fm_a = @formula(M ~ X)

# M and X -> Y
fm_b_cprime = @formula(Y ~ X + M)

# X -> Y
fm_c = @formula(Y ~ X)

first(df_mediation, 3)
```

3×3 DataFrame

| Row | X       | M       | Y       |
| --- | ------- | ------- | ------- |
|     | Float64 | Float64 | Float64 |
| 1   | 0.2     | 3.5     | 1.4     |
| 2   | 0.2     | 3.0     | 1.4     |
| 3   | 0.2     | 3.2     | 1.3     |

```julia
model_a = lm(fm_a, df_mediation)
a = round(coef(model_a)[2], digits=3)  # Path a (coefficient for X)
model_a
```

```
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

M ~ 1 + X

Coefficients:
────────────────────────────────────────────────────────────────────────
                Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────
(Intercept)   3.30843   0.0620975  53.28    <1e-97   3.18571    3.43114
X            -0.20936   0.04374    -4.79    <1e-05  -0.295795  -0.122924
────────────────────────────────────────────────────────────────────────
```

```julia
model_b_cprime = lm(fm_b_cprime, df_mediation)
b = round(coef(model_b_cprime)[3], digits=3)  # Path b (coefficient for M)
c_prime = round(coef(model_b_cprime)[2], digits=3)  # Path c' (coefficient for X)
model_b_cprime
```

```
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Y ~ 1 + X + M

Coefficients:
─────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept)   2.25816    0.313519    7.20    <1e-10   1.63858    2.87775
X             2.15561    0.0528285  40.80    <1e-81   2.05121    2.26001
M            -0.355035   0.0923859  -3.84    0.0002  -0.537611  -0.172458
─────────────────────────────────────────────────────────────────────────
```

```julia
model_c = lm(fm_c, df_mediation)
c = round(coef(model_c)[2], digits=3)  # Path c (coefficient for X)
model_c
```

```
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Y ~ 1 + X

Coefficients:
───────────────────────────────────────────────────────────────────────
               Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept)  1.08356   0.072967   14.85    <1e-30   0.939366    1.22775
X            2.22994   0.0513962  43.39    <1e-85   2.12838     2.33151
───────────────────────────────────────────────────────────────────────
```

Calculate direct and indirect effects:

```julia
# Indirect effect
indirect_effect = round(a * b, digits=3)

# Total effect
total_effect = c

# Print results
println("Path a (X -> M): $a")
println("Path b (M -> Y controlling for X): $b")
println("Path c' (Direct effect of X on Y): $c_prime")
println("Path c (Total effect of X on Y): $c")
println("Indirect Effect (a * b): $indirect_effect")
```

```
Path a (X -> M): -0.209
Path b (M -> Y controlling for X): -0.355
Path c' (Direct effect of X on Y): 2.156
Path c (Total effect of X on Y): 2.23
Indirect Effect (a * b): 0.074
```

*(Attention, the theoritical assumption of this example here does not hold, just for illustration)*

### 4. Moderation

```julia
df_moderation = rename(df, [:PetalWidth, :SepalWidth, :PetalLength] .=> [:X, :W, :Y]) |>
                d -> d[:, [:X, :W, :Y]]


# X -> Y
fm_a = @formula(Y ~ X)

# W moderating X -> Y
fm_b = @formula(Y ~ X * W)

first(df_moderation, 3)
```

3×3 DataFrame

| Row | X       | W       | Y       |
| --- | ------- | ------- | ------- |
|     | Float64 | Float64 | Float64 |
| 1   | 0.2     | 3.5     | 1.4     |
| 2   | 0.2     | 3.0     | 1.4     |
| 3   | 0.2     | 3.2     | 1.3     |

```julia
model_a = lm(fm_a, df_moderation)
b_1 = round(coef(model_a)[2], digits=3)  # Original model
model_a
```

```
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Y ~ 1 + X

Coefficients:
───────────────────────────────────────────────────────────────────────
               Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept)  1.08356   0.072967   14.85    <1e-30   0.939366    1.22775
X            2.22994   0.0513962  43.39    <1e-85   2.12838     2.33151
───────────────────────────────────────────────────────────────────────
```

```julia
model_b = lm(fm_b, df_moderation)
b_3 = round(coef(model_b)[4], digits=3)  # Original model
model_b
```

```
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Y ~ 1 + X + W + X & W

Coefficients:
──────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)   2.28181      0.583452   3.91    0.0001   1.12871    3.43492
X             2.13257      0.481634   4.43    <1e-04   1.18069    3.08445
W            -0.362079     0.173239  -2.09    0.0383  -0.704459  -0.019698
X & W         0.0071967    0.149522   0.05    0.9617  -0.288311   0.302704
──────────────────────────────────────────────────────────────────────────
```

```julia
println("Moderation effect: $b_3 (no-sig.)")
```

```
Moderation effect: 0.007 (no-sig.)
```

*(Attention, the theoritical assumption of this example here does not hold, just for illustration)*

### 5. Conditional Process

*A more complete analysis, therefore, should attempt to model the mechanisms at work linking X to Y while simultaneously allowing those effects to be contingent on context, circumstance, or individual differences. (Andrew F. Hayes, 2018, p. 395)*

For instance, we want to test if the relation between X and Y, is mediated by M, and the M on Y is moderated by W. In the current dataset, we assume:

* **X**: PetalWidth
* **M**: SepalLength
* **W**: SepalWidth
* **Y**: PetalLength

```julia
df_conditional = rename(df, [:PetalWidth, :SepalLength, :SepalWidth, :PetalLength] .=> [:X, :M, :W, :Y]) |>
                 d -> d[:, [:X, :M, :W, :Y]]

first(df_conditional, 3)
```

3×4 DataFrame

| Row | X       | M       | W       | Y       |
| --- | ------- | ------- | ------- | ------- |
|     | Float64 | Float64 | Float64 | Float64 |
| 1   | 0.2     | 5.1     | 3.5     | 1.4     |
| 2   | 0.2     | 4.9     | 3.0     | 1.4     |
| 3   | 0.2     | 4.7     | 3.2     | 1.3     |

Suppose that we expect:

* $M=i\_M+aX+\epsilon\_M$
* $Y=i\_Y+c'X+b\_1M+b\_2W+b\_3WM+\epsilon\_Y$

```julia
# a: X -> M
fm_a = @formula(M ~ X * W)

# b and c: M -> Y
fm_b = @formula(Y ~ X + M + W * M)

# c: X -> Y
fm_c = @formula(Y ~ X)
```

```
FormulaTerm
Response:
  Y(unknown)
Predictors:
  X(unknown)
```

```julia
model_a = lm(fm_a, df_conditional)
a = round(coef(model_a)[2], digits=3)  # X -> M
model_a
```

```
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

M ~ 1 + X + W + X & W

Coefficients:
──────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)   3.13046      0.5745     5.45    <1e-06   1.99505    4.26587
X             1.29057      0.474244   2.72    0.0073   0.353302   2.22784
W             0.496425     0.170581   2.91    0.0042   0.159298   0.833551
X & W        -0.0994638    0.147228  -0.68    0.5004  -0.390437   0.191509
──────────────────────────────────────────────────────────────────────────
```

```julia
model_b = lm(fm_b, df_conditional)

c_prime = round(coef(model_b)[2], digits=3)  # indirect_effect
b_1 = round(coef(model_b)[3], digits=3)  # M -> Y
b_2 = round(coef(model_b)[4], digits=3)  # W -> Y
b_3 = round(coef(model_b)[5], digits=3)  # M*W -> Y
model_b
```

```
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Y ~ 1 + X + M + W + W & M

Coefficients:
───────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      t  Pr(>|t|)   Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept)   0.714816    1.56623     0.46    0.6488  -2.38078     3.81041
X             1.43584     0.0699066  20.54    <1e-44   1.29767     1.57401
M             0.561752    0.269701    2.08    0.0390   0.0286989   1.09481
W            -0.970405    0.514856   -1.88    0.0615  -1.988       0.047187
W & M         0.0564153   0.0887396   0.64    0.5259  -0.118975    0.231806
───────────────────────────────────────────────────────────────────────────
```

```julia
model_c = lm(fm_c, df_conditional)
c = round(coef(model_c)[2], digits=3)  # Original model
model_c
```

```
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Y ~ 1 + X

Coefficients:
───────────────────────────────────────────────────────────────────────
               Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept)  1.08356   0.072967   14.85    <1e-30   0.939366    1.22775
X            2.22994   0.0513962  43.39    <1e-85   2.12838     2.33151
───────────────────────────────────────────────────────────────────────
```

```julia
# Indirect effect
indirect_effect = round(a * b_1, digits=3)

# Total effect
total_effect = c

# Print results
println("Path a (X -> M): $a")
println("Path b1 (M -> Y): $b_1")
println("Path b2 (W -> Y): $b_2")
println("Path b3 (MW -> Y): $b_3")
println("Indirect Effect (a * b_1): $indirect_effect")
```

```
Path a (X -> M): 1.291
Path b1 (M -> Y): 0.562
Path b2 (W -> Y): -0.97
Path b3 (MW -> Y): 0.056
Indirect Effect (a * b_1): 0.726
```

If you need bootstraping and everything else introduces by Hayes, we can using R or Python in Julia to do so.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://data-julia.rongxin.me/5.1.models.jl.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
