> For the complete documentation index, see [llms.txt](https://data-julia.rongxin.me/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://data-julia.rongxin.me/3.transform.calculate.jl.md).

# Chapter 3. Dataframe Transformation

```julia
using CSV, DataFrames, RDatasets, Statistics
```

First, we load the iris data again to illustrate how to manipulate a dataframe in Julia:

```julia
df = dataset("datasets", "iris")
first(df, 5)
```

5×5 DataFrame

| Row | SepalLength | SepalWidth | PetalLength | PetalWidth | Species |
| --- | ----------- | ---------- | ----------- | ---------- | ------- |
|     | Float64     | Float64    | Float64     | Float64    | Cat…    |
| 1   | 5.1         | 3.5        | 1.4         | 0.2        | setosa  |
| 2   | 4.9         | 3.0        | 1.4         | 0.2        | setosa  |
| 3   | 4.7         | 3.2        | 1.3         | 0.2        | setosa  |
| 4   | 4.6         | 3.1        | 1.5         | 0.2        | setosa  |
| 5   | 5.0         | 3.6        | 1.4         | 0.2        | setosa  |

## 1. Select a Subset (review)

In the last chapter, we knew how to select a subset. Now, let's do it again.\
In this example, we do not need `SepalWidth` column. For this purpose, we can either use:

```julia
df[:, [:SepalLength, :SepalWidth, :PetalLength, :PetalWidth]]
```

150×4 DataFrame125 rows omitted

| Row | SepalLength | SepalWidth | PetalLength | PetalWidth |
| --- | ----------- | ---------- | ----------- | ---------- |
|     | Float64     | Float64    | Float64     | Float64    |
| 1   | 5.1         | 3.5        | 1.4         | 0.2        |
| 2   | 4.9         | 3.0        | 1.4         | 0.2        |
| 3   | 4.7         | 3.2        | 1.3         | 0.2        |
| 4   | 4.6         | 3.1        | 1.5         | 0.2        |
| 5   | 5.0         | 3.6        | 1.4         | 0.2        |
| 6   | 5.4         | 3.9        | 1.7         | 0.4        |
| 7   | 4.6         | 3.4        | 1.4         | 0.3        |
| 8   | 5.0         | 3.4        | 1.5         | 0.2        |
| 9   | 4.4         | 2.9        | 1.4         | 0.2        |
| 10  | 4.9         | 3.1        | 1.5         | 0.1        |
| 11  | 5.4         | 3.7        | 1.5         | 0.2        |
| 12  | 4.8         | 3.4        | 1.6         | 0.2        |
| 13  | 4.8         | 3.0        | 1.4         | 0.1        |
| ⋮   | ⋮           | ⋮          | ⋮           | ⋮          |
| 139 | 6.0         | 3.0        | 4.8         | 1.8        |
| 140 | 6.9         | 3.1        | 5.4         | 2.1        |
| 141 | 6.7         | 3.1        | 5.6         | 2.4        |
| 142 | 6.9         | 3.1        | 5.1         | 2.3        |
| 143 | 5.8         | 2.7        | 5.1         | 1.9        |
| 144 | 6.8         | 3.2        | 5.9         | 2.3        |
| 145 | 6.7         | 3.3        | 5.7         | 2.5        |
| 146 | 6.7         | 3.0        | 5.2         | 2.3        |
| 147 | 6.3         | 2.5        | 5.0         | 1.9        |
| 148 | 6.5         | 3.0        | 5.2         | 2.0        |
| 149 | 6.2         | 3.4        | 5.4         | 2.3        |
| 150 | 5.9         | 3.0        | 5.1         | 1.8        |

Or a logical calculation based selection:

```julia
df[:, Not(:SepalWidth)]
```

150×4 DataFrame125 rows omitted

| Row | SepalLength | PetalLength | PetalWidth | Species   |
| --- | ----------- | ----------- | ---------- | --------- |
|     | Float64     | Float64     | Float64    | Cat…      |
| 1   | 5.1         | 1.4         | 0.2        | setosa    |
| 2   | 4.9         | 1.4         | 0.2        | setosa    |
| 3   | 4.7         | 1.3         | 0.2        | setosa    |
| 4   | 4.6         | 1.5         | 0.2        | setosa    |
| 5   | 5.0         | 1.4         | 0.2        | setosa    |
| 6   | 5.4         | 1.7         | 0.4        | setosa    |
| 7   | 4.6         | 1.4         | 0.3        | setosa    |
| 8   | 5.0         | 1.5         | 0.2        | setosa    |
| 9   | 4.4         | 1.4         | 0.2        | setosa    |
| 10  | 4.9         | 1.5         | 0.1        | setosa    |
| 11  | 5.4         | 1.5         | 0.2        | setosa    |
| 12  | 4.8         | 1.6         | 0.2        | setosa    |
| 13  | 4.8         | 1.4         | 0.1        | setosa    |
| ⋮   | ⋮           | ⋮           | ⋮          | ⋮         |
| 139 | 6.0         | 4.8         | 1.8        | virginica |
| 140 | 6.9         | 5.4         | 2.1        | virginica |
| 141 | 6.7         | 5.6         | 2.4        | virginica |
| 142 | 6.9         | 5.1         | 2.3        | virginica |
| 143 | 5.8         | 5.1         | 1.9        | virginica |
| 144 | 6.8         | 5.9         | 2.3        | virginica |
| 145 | 6.7         | 5.7         | 2.5        | virginica |
| 146 | 6.7         | 5.2         | 2.3        | virginica |
| 147 | 6.3         | 5.0         | 1.9        | virginica |
| 148 | 6.5         | 5.2         | 2.0        | virginica |
| 149 | 6.2         | 5.4         | 2.3        | virginica |
| 150 | 5.9         | 5.1         | 1.8        | virginica |

## 2. Split and Combine

### By Row

In R, you can use `rbind()` to combine two dataframes by appending new rows to the origional dataframe. Similarly in Julia, you can:

![](/files/IJOZAzRsLJj8CKUeKMs8)

```julia
df_a = DataFrame(col_a=[1, 2, 3], col_b=[4, 5, 6])
df_b = DataFrame(col_a=[2, 3, 3], col_b=[2, 3, 4])
df_bind = [df_a; df_b]
df_bind
```

6×2 DataFrame

| Row | col\_a | col\_b |
| --- | ------ | ------ |
|     | Int64  | Int64  |
| 1   | 1      | 4      |
| 2   | 2      | 5      |
| 3   | 3      | 6      |
| 4   | 2      | 2      |
| 5   | 3      | 3      |
| 6   | 3      | 4      |

Or, because the concatenation is vertical:

```julia
df_bind = vcat(df_a, df_b) # vertical, concat...
df_bind
```

6×2 DataFrame

| Row | col\_a | col\_b |
| --- | ------ | ------ |
|     | Int64  | Int64  |
| 1   | 1      | 4      |
| 2   | 2      | 5      |
| 3   | 3      | 6      |
| 4   | 2      | 2      |
| 5   | 3      | 3      |
| 6   | 3      | 4      |

However, if you want to make it faster on large datasets, `append!` is more efficient:

```julia
df_bind = append!(df_a, df_b)
df_bind
```

6×2 DataFrame

| Row | col\_a | col\_b |
| --- | ------ | ------ |
|     | Int64  | Int64  |
| 1   | 1      | 4      |
| 2   | 2      | 5      |
| 3   | 3      | 6      |
| 4   | 2      | 2      |
| 5   | 3      | 3      |
| 6   | 3      | 4      |

### By Column

![img](/files/L2tcL0ZijHCXRPxzHyfc)

**1. Bind horizontally** Similar to the vertical concatenation, `vcat()`; we can use `hcat()` for horizontal concatenation:

```julia
df_a = DataFrame(col_a=[1, 2, 3], col_b=[4, 5, 6])
df_b = DataFrame(col_c=[2, 3, 3], col_d=[2, 3, 4])
df_bind = hcat(df_a, df_b) # horizontal, concat...
df_bind
```

3×4 DataFrame

| Row | col\_a | col\_b | col\_c | col\_d |
| --- | ------ | ------ | ------ | ------ |
|     | Int64  | Int64  | Int64  | Int64  |
| 1   | 1      | 4      | 2      | 2      |
| 2   | 2      | 5      | 3      | 3      |
| 3   | 3      | 6      | 3      | 4      |

**2. Merge/join horizontally**

While concatenation is convenient, sometime, we have missing values and want to *safely* concat dataframes horizontally by `id` or values of a certain column. In any data pipelinee, there are `join` operations:

(1) Only id(s) existing in both dataframes

```julia
df_a = DataFrame(id=[1, 2, 3], col_a=[4, 5, 6])
df_b = DataFrame(id=[2, 3, 3], col_b=[2, 3, 4])
df_bind = innerjoin(df_a, df_b, on=:id)
df_bind
```

3×3 DataFrame

| Row | id    | col\_a | col\_b |
| --- | ----- | ------ | ------ |
|     | Int64 | Int64  | Int64  |
| 1   | 2     | 5      | 2      |
| 2   | 3     | 6      | 3      |
| 3   | 3     | 6      | 4      |

(2) All id(s) existing in either dataframe

```julia
df_bind = outerjoin(df_a, df_b, on=:id)
df_bind
```

4×3 DataFrame

| Row | id    | col\_a | col\_b  |
| --- | ----- | ------ | ------- |
|     | Int64 | Int64? | Int64?  |
| 1   | 2     | 5      | 2       |
| 2   | 3     | 6      | 3       |
| 3   | 3     | 6      | 4       |
| 4   | 1     | 4      | missing |

(3) Only id(s) existing in the dataframe on the left

```julia
df_bind = leftjoin(df_a, df_b, on=:id)
df_bind
```

4×3 DataFrame

| Row | id    | col\_a | col\_b  |
| --- | ----- | ------ | ------- |
|     | Int64 | Int64  | Int64?  |
| 1   | 2     | 5      | 2       |
| 2   | 3     | 6      | 3       |
| 3   | 3     | 6      | 4       |
| 4   | 1     | 4      | missing |

(4) Only id(s) existing in the dataframe on the right

```julia
df_bind = rightjoin(df_a, df_b, on=:id)
df_bind
```

3×3 DataFrame

| Row | id    | col\_a | col\_b |
| --- | ----- | ------ | ------ |
|     | Int64 | Int64? | Int64  |
| 1   | 2     | 5      | 2      |
| 2   | 3     | 6      | 3      |
| 3   | 3     | 6      | 4      |

## 3. Group by

To see the dataset by group, or to prepare for a sub-group calculation, we use

```julia
gdf = groupby(df, :Species)
first(gdf)
```

50×5 SubDataFrame25 rows omitted

| Row | SepalLength | SepalWidth | PetalLength | PetalWidth | Species |
| --- | ----------- | ---------- | ----------- | ---------- | ------- |
|     | Float64     | Float64    | Float64     | Float64    | Cat…    |
| 1   | 5.1         | 3.5        | 1.4         | 0.2        | setosa  |
| 2   | 4.9         | 3.0        | 1.4         | 0.2        | setosa  |
| 3   | 4.7         | 3.2        | 1.3         | 0.2        | setosa  |
| 4   | 4.6         | 3.1        | 1.5         | 0.2        | setosa  |
| 5   | 5.0         | 3.6        | 1.4         | 0.2        | setosa  |
| 6   | 5.4         | 3.9        | 1.7         | 0.4        | setosa  |
| 7   | 4.6         | 3.4        | 1.4         | 0.3        | setosa  |
| 8   | 5.0         | 3.4        | 1.5         | 0.2        | setosa  |
| 9   | 4.4         | 2.9        | 1.4         | 0.2        | setosa  |
| 10  | 4.9         | 3.1        | 1.5         | 0.1        | setosa  |
| 11  | 5.4         | 3.7        | 1.5         | 0.2        | setosa  |
| 12  | 4.8         | 3.4        | 1.6         | 0.2        | setosa  |
| 13  | 4.8         | 3.0        | 1.4         | 0.1        | setosa  |
| ⋮   | ⋮           | ⋮          | ⋮           | ⋮          | ⋮       |
| 39  | 4.4         | 3.0        | 1.3         | 0.2        | setosa  |
| 40  | 5.1         | 3.4        | 1.5         | 0.2        | setosa  |
| 41  | 5.0         | 3.5        | 1.3         | 0.3        | setosa  |
| 42  | 4.5         | 2.3        | 1.3         | 0.3        | setosa  |
| 43  | 4.4         | 3.2        | 1.3         | 0.2        | setosa  |
| 44  | 5.0         | 3.5        | 1.6         | 0.6        | setosa  |
| 45  | 5.1         | 3.8        | 1.9         | 0.4        | setosa  |
| 46  | 4.8         | 3.0        | 1.4         | 0.3        | setosa  |
| 47  | 5.1         | 3.8        | 1.6         | 0.2        | setosa  |
| 48  | 4.6         | 3.2        | 1.4         | 0.2        | setosa  |
| 49  | 5.3         | 3.7        | 1.5         | 0.2        | setosa  |
| 50  | 5.0         | 3.3        | 1.4         | 0.2        | setosa  |

Based on this grouped dataframe (`gdf`), we can calculate the mean values of `SepalLength`:

```julia
combine(gdf, :SepalLength => mean)
```

3×2 DataFrame

| Row | Species    | SepalLength\_mean |
| --- | ---------- | ----------------- |
|     | Cat…       | Float64           |
| 1   | setosa     | 5.006             |
| 2   | versicolor | 5.936             |
| 3   | virginica  | 6.588             |

What if we wanna know more than one columns?

```julia
combine(groupby(df, :Species), [:SepalLength, :SepalWidth] .=> mean)
```

3×3 DataFrame

| Row | Species    | SepalLength\_mean | SepalWidth\_mean |
| --- | ---------- | ----------------- | ---------------- |
|     | Cat…       | Float64           | Float64          |
| 1   | setosa     | 5.006             | 3.428            |
| 2   | versicolor | 5.936             | 2.77             |
| 3   | virginica  | 6.588             | 2.974            |

## 4. Sorting

In many cases, after calculating the means of groups, we want to see which one is the highest. One way to finish the task is to sort the result:

```julia
sort(combine(groupby(df, :Species), [:SepalLength, :SepalWidth] .=> mean),
    :SepalWidth_mean) # sort by SepalWidth_mean

```

3×3 DataFrame

| Row | Species    | SepalLength\_mean | SepalWidth\_mean |
| --- | ---------- | ----------------- | ---------------- |
|     | Cat…       | Float64           | Float64          |
| 1   | versicolor | 5.936             | 2.77             |
| 2   | virginica  | 6.588             | 2.974            |
| 3   | setosa     | 5.006             | 3.428            |

Reversely:

```julia
sort(combine(groupby(df, :Species), [:SepalLength, :SepalWidth] .=> mean),
    :SepalWidth_mean,
    rev=true) # sort by SepalWidth_mean

```

3×3 DataFrame

| Row | Species    | SepalLength\_mean | SepalWidth\_mean |
| --- | ---------- | ----------------- | ---------------- |
|     | Cat…       | Float64           | Float64          |
| 1   | setosa     | 5.006             | 3.428            |
| 2   | virginica  | 6.588             | 2.974            |
| 3   | versicolor | 5.936             | 2.77             |

## 5. Transforming between long and wide tables

**(1) From a wide table to a long one:**

Let's look at the iris dataset again

```julia
first(df, 3)
```

3×5 DataFrame

| Row | SepalLength | SepalWidth | PetalLength | PetalWidth | Species |
| --- | ----------- | ---------- | ----------- | ---------- | ------- |
|     | Float64     | Float64    | Float64     | Float64    | Cat…    |
| 1   | 5.1         | 3.5        | 1.4         | 0.2        | setosa  |
| 2   | 4.9         | 3.0        | 1.4         | 0.2        | setosa  |
| 3   | 4.7         | 3.2        | 1.3         | 0.2        | setosa  |

Assign an id column:

```julia
df.id = 1:size(df, 1)
first(df, 3)
```

3×6 DataFrame

| Row | SepalLength | SepalWidth | PetalLength | PetalWidth | Species | id    |
| --- | ----------- | ---------- | ----------- | ---------- | ------- | ----- |
|     | Float64     | Float64    | Float64     | Float64    | Cat…    | Int64 |
| 1   | 5.1         | 3.5        | 1.4         | 0.2        | setosa  | 1     |
| 2   | 4.9         | 3.0        | 1.4         | 0.2        | setosa  | 2     |
| 3   | 4.7         | 3.2        | 1.3         | 0.2        | setosa  | 3     |

```julia
df_long = stack(df, [:SepalLength, :SepalWidth, :PetalLength, :PetalWidth], variable_name=:variable, value_name=:value)
df_long

```

600×4 DataFrame575 rows omitted

| Row | Species   | id    | variable    | value   |
| --- | --------- | ----- | ----------- | ------- |
|     | Cat…      | Int64 | String      | Float64 |
| 1   | setosa    | 1     | SepalLength | 5.1     |
| 2   | setosa    | 2     | SepalLength | 4.9     |
| 3   | setosa    | 3     | SepalLength | 4.7     |
| 4   | setosa    | 4     | SepalLength | 4.6     |
| 5   | setosa    | 5     | SepalLength | 5.0     |
| 6   | setosa    | 6     | SepalLength | 5.4     |
| 7   | setosa    | 7     | SepalLength | 4.6     |
| 8   | setosa    | 8     | SepalLength | 5.0     |
| 9   | setosa    | 9     | SepalLength | 4.4     |
| 10  | setosa    | 10    | SepalLength | 4.9     |
| 11  | setosa    | 11    | SepalLength | 5.4     |
| 12  | setosa    | 12    | SepalLength | 4.8     |
| 13  | setosa    | 13    | SepalLength | 4.8     |
| ⋮   | ⋮         | ⋮     | ⋮           | ⋮       |
| 589 | virginica | 139   | PetalWidth  | 1.8     |
| 590 | virginica | 140   | PetalWidth  | 2.1     |
| 591 | virginica | 141   | PetalWidth  | 2.4     |
| 592 | virginica | 142   | PetalWidth  | 2.3     |
| 593 | virginica | 143   | PetalWidth  | 1.9     |
| 594 | virginica | 144   | PetalWidth  | 2.3     |
| 595 | virginica | 145   | PetalWidth  | 2.5     |
| 596 | virginica | 146   | PetalWidth  | 2.3     |
| 597 | virginica | 147   | PetalWidth  | 1.9     |
| 598 | virginica | 148   | PetalWidth  | 2.0     |
| 599 | virginica | 149   | PetalWidth  | 2.3     |
| 600 | virginica | 150   | PetalWidth  | 1.8     |

**(2) From a long table to a wide one:**

```julia
df_wide = unstack(df_long, [:id, :Species], :variable, :value)
first(df_wide, 3)
```

3×6 DataFrame

| Row | id    | Species | SepalLength | SepalWidth | PetalLength | PetalWidth |
| --- | ----- | ------- | ----------- | ---------- | ----------- | ---------- |
|     | Int64 | Cat…    | Float64?    | Float64?   | Float64?    | Float64?   |
| 1   | 1     | setosa  | 5.1         | 3.5        | 1.4         | 0.2        |
| 2   | 2     | setosa  | 4.9         | 3.0        | 1.4         | 0.2        |
| 3   | 3     | setosa  | 4.7         | 3.2        | 1.3         | 0.2        |

Basically, the same to the original `df`:

```julia
first(df, 3)
```

3×6 DataFrame

| Row | SepalLength | SepalWidth | PetalLength | PetalWidth | Species | id    |
| --- | ----------- | ---------- | ----------- | ---------- | ------- | ----- |
|     | Float64     | Float64    | Float64     | Float64    | Cat…    | Int64 |
| 1   | 5.1         | 3.5        | 1.4         | 0.2        | setosa  | 1     |
| 2   | 4.9         | 3.0        | 1.4         | 0.2        | setosa  | 2     |
| 3   | 4.7         | 3.2        | 1.3         | 0.2        | setosa  | 3     |

## 6. Missing values

Another important issue in data analysis is how to find and fill NA values.

```julia
df_bind = outerjoin(df_a, df_b, on=:id)
df_bind
```

4×3 DataFrame

| Row | id    | col\_a | col\_b  |
| --- | ----- | ------ | ------- |
|     | Int64 | Int64? | Int64?  |
| 1   | 2     | 5      | 2       |
| 2   | 3     | 6      | 3       |
| 3   | 3     | 6      | 4       |
| 4   | 1     | 4      | missing |

### Fina NA values

```julia
filter(:col_b => ismissing, df_bind)
```

1×3 DataFrame

| Row | id    | col\_a | col\_b  |
| --- | ----- | ------ | ------- |
|     | Int64 | Int64? | Int64?  |
| 1   | 1     | 4      | missing |

### Fill NA values

```julia
coalesce.(df_bind, 0)
```

4×3 DataFrame

| Row | id    | col\_a | col\_b |
| --- | ----- | ------ | ------ |
|     | Int64 | Int64  | Int64  |
| 1   | 2     | 5      | 2      |
| 2   | 3     | 6      | 3      |
| 3   | 3     | 6      | 4      |
| 4   | 1     | 4      | 0      |

Please pay attention to the dot, `.` after `coalesce`, which means that we are casting the function, `coalesce`, to each element (**cell**) within the dataframe.

### Drop NA values

```julia
dropmissing(df_bind, :col_b)
```

3×3 DataFrame

| Row | id    | col\_a | col\_b |
| --- | ----- | ------ | ------ |
|     | Int64 | Int64? | Int64  |
| 1   | 2     | 5      | 2      |
| 2   | 3     | 6      | 3      |
| 3   | 3     | 6      | 4      |

For how to simplift the process in a long pipeline, it's time to [enter the next chapter](https://data-julia.rongxin.me/data-analysis-in-julia/4.pipeline.tools.jl).


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://data-julia.rongxin.me/3.transform.calculate.jl.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
