Chapter 3. Dataframe Transformation

using CSV, DataFrames, RDatasets, Statistics

First, we load the iris data again to illustrate how to manipulate a dataframe in Julia:

df = dataset("datasets", "iris")
first(df, 5)

5×5 DataFrame

Row

SepalLength

SepalWidth

PetalLength

PetalWidth

Species

Float64

Cat…

5.1

3.5

1.4

0.2

setosa

4.9

3.0

1.4

0.2

setosa

4.7

3.2

1.3

0.2

setosa

4.6

3.1

1.5

0.2

setosa

5.0

3.6

1.4

0.2

setosa

1. Select a Subset (review)

In the last chapter, we knew how to select a subset. Now, let's do it again. In this example, we do not need SepalWidth column. For this purpose, we can either use:

df[:, [:SepalLength, :SepalWidth, :PetalLength, :PetalWidth]]

150×4 DataFrame125 rows omitted

Row

SepalLength

SepalWidth

PetalLength

PetalWidth

Float64

5.1

3.5

1.4

0.2

4.9

3.0

1.4

0.2

4.7

3.2

1.3

0.2

4.6

3.1

1.5

0.2

5.0

3.6

1.4

0.2

5.4

3.9

1.7

0.4

4.6

3.4

1.4

0.3

5.0

3.4

1.5

0.2

4.4

2.9

1.4

0.2

4.9

3.1

1.5

0.1

5.4

3.7

1.5

0.2

4.8

3.4

1.6

0.2

4.8

3.0

1.4

0.1

⋮

139

6.0

3.0

4.8

1.8

140

6.9

3.1

5.4

2.1

141

6.7

3.1

5.6

2.4

142

6.9

3.1

5.1

2.3

143

5.8

2.7

5.1

1.9

144

6.8

3.2

5.9

2.3

145

6.7

3.3

5.7

2.5

146

6.7

3.0

5.2

2.3

147

6.3

2.5

5.0

1.9

148

6.5

3.0

5.2

2.0

149

6.2

3.4

5.4

2.3

150

5.9

3.0

5.1

1.8

Or a logical calculation based selection:

df[:, Not(:SepalWidth)]

150×4 DataFrame125 rows omitted

Row

SepalLength

PetalLength

PetalWidth

Species

Float64

Cat…

5.1

1.4

0.2

setosa

4.9

1.4

0.2

setosa

4.7

1.3

0.2

setosa

4.6

1.5

0.2

setosa

5.0

1.4

0.2

setosa

5.4

1.7

0.4

setosa

4.6

1.4

0.3

setosa

5.0

1.5

0.2

setosa

4.4

1.4

0.2

setosa

4.9

1.5

0.1

setosa

5.4

1.5

0.2

setosa

4.8

1.6

0.2

setosa

4.8

1.4

0.1

setosa

⋮

139

6.0

4.8

1.8

virginica

140

6.9

5.4

2.1

virginica

141

6.7

5.6

2.4

virginica

142

6.9

5.1

2.3

virginica

143

5.8

5.1

1.9

virginica

144

6.8

5.9

2.3

virginica

145

6.7

5.7

2.5

virginica

146

6.7

5.2

2.3

virginica

147

6.3

5.0

1.9

virginica

148

6.5

5.2

2.0

virginica

149

6.2

5.4

2.3

virginica

150

5.9

5.1

1.8

virginica

2. Splition and Combination

By Row

In R, you can use rbind() to combine two dataframes by appending new rows to the origional dataframe. Similarly in Julia, you can:

df_a = DataFrame(col_a=[1, 2, 3], col_b=[4, 5, 6])
df_b = DataFrame(col_a=[2, 3, 3], col_b=[2, 3, 4])
df_bind = [df_a; df_b]
df_bind

6×2 DataFrame

Row

col_a

col_b

Int64

Or, because the concatenation is vertical:

df_bind = vcat(df_a, df_b) # vertical, concat...
df_bind

6×2 DataFrame

Row

col_a

col_b

Int64

However, if you want to make it faster on large datasets, append! is more efficient:

df_bind = append!(df_a, df_b)
df_bind

6×2 DataFrame

Row

col_a

col_b

Int64

By Column

1. Bind horizontally Similar to the vertical concatenation, vcat(); we can use hcat() for horizontal concatenation:

df_a = DataFrame(col_a=[1, 2, 3], col_b=[4, 5, 6])
df_b = DataFrame(col_c=[2, 3, 3], col_d=[2, 3, 4])
df_bind = hcat(df_a, df_b) # horizontal, concat...
df_bind

3×4 DataFrame

Row

col_a

col_b

col_c

col_d

Int64

2. Merge/join horizontally

While concatenation is convenient, sometime, we have missing values and want to safely concat dataframes horizontally by id or values of a certain column. In any data pipelinee, there are join operations:

(1) Only id(s) existing in both dataframes

df_a = DataFrame(id=[1, 2, 3], col_a=[4, 5, 6])
df_b = DataFrame(id=[2, 3, 3], col_b=[2, 3, 4])
df_bind = innerjoin(df_a, df_b, on=:id)
df_bind

3×3 DataFrame

Row

col_a

col_b

Int64

(2) All id(s) existing in either dataframe

df_bind = outerjoin(df_a, df_b, on=:id)
df_bind

4×3 DataFrame

Row

col_a

col_b

Int64

Int64?

missing

(3) Only id(s) existing in the dataframe on the left

df_bind = leftjoin(df_a, df_b, on=:id)
df_bind

4×3 DataFrame

Row

col_a

col_b

Int64

Int64?

missing

(4) Only id(s) existing in the dataframe on the right

df_bind = rightjoin(df_a, df_b, on=:id)
df_bind

3×3 DataFrame

Row

col_a

col_b

Int64

Int64?

Int64

3. Group by

To see the dataset by group, or to prepare for a sub-group calculation, we use

gdf = groupby(df, :Species)
first(gdf)

50×5 SubDataFrame25 rows omitted

Row

SepalLength

SepalWidth

PetalLength

PetalWidth

Species

Float64

Cat…

5.1

3.5

1.4

0.2

setosa

4.9

3.0

1.4

0.2

setosa

4.7

3.2

1.3

0.2

setosa

4.6

3.1

1.5

0.2

setosa

5.0

3.6

1.4

0.2

setosa

5.4

3.9

1.7

0.4

setosa

4.6

3.4

1.4

0.3

setosa

5.0

3.4

1.5

0.2

setosa

4.4

2.9

1.4

0.2

setosa

4.9

3.1

1.5

0.1

setosa

5.4

3.7

1.5

0.2

setosa

4.8

3.4

1.6

0.2

setosa

4.8

3.0

1.4

0.1

setosa

⋮

4.4

3.0

1.3

0.2

setosa

5.1

3.4

1.5

0.2

setosa

5.0

3.5

1.3

0.3

setosa

4.5

2.3

1.3

0.3

setosa

4.4

3.2

1.3

0.2

setosa

5.0

3.5

1.6

0.6

setosa

5.1

3.8

1.9

0.4

setosa

4.8

3.0

1.4

0.3

setosa

5.1

3.8

1.6

0.2

setosa

4.6

3.2

1.4

0.2

setosa

5.3

3.7

1.5

0.2

setosa

5.0

3.3

1.4

0.2

setosa

Based on this grouped dataframe (gdf), we can calculate the mean values of SepalLength:

combine(gdf, :SepalLength => mean)

3×2 DataFrame

Row

Species

SepalLength_mean

Cat…

Float64

setosa

5.006

versicolor

5.936

virginica

6.588

What if we wanna know more than one columns?

combine(groupby(df, :Species), [:SepalLength, :SepalWidth] .=> mean)

3×3 DataFrame

Row

Species

SepalLength_mean

SepalWidth_mean

Cat…

Float64

setosa

5.006

3.428

versicolor

5.936

2.77

virginica

6.588

2.974

4. Sorting

In many cases, after calculating the means of groups, we want to see which one is the highest. One way to finish the task is to sort the result:

sort(combine(groupby(df, :Species), [:SepalLength, :SepalWidth] .=> mean),
    :SepalWidth_mean) # sort by SepalWidth_mean

3×3 DataFrame

Row

Species

SepalLength_mean

SepalWidth_mean

Cat…

Float64

versicolor

5.936

2.77

virginica

6.588

2.974

setosa

5.006

3.428

Reversely:

sort(combine(groupby(df, :Species), [:SepalLength, :SepalWidth] .=> mean),
    :SepalWidth_mean,
    rev=true) # sort by SepalWidth_mean

3×3 DataFrame

Row

Species

SepalLength_mean

SepalWidth_mean

Cat…

Float64

setosa

5.006

3.428

virginica

6.588

2.974

versicolor

5.936

2.77

5. Transforming between long and wide tables

(1) From a wide table to a long one:

Let's look at the iris dataset again

first(df, 3)

3×5 DataFrame

Row

SepalLength

SepalWidth

PetalLength

PetalWidth

Species

Float64

Cat…

5.1

3.5

1.4

0.2

setosa

4.9

3.0

1.4

0.2

setosa

4.7

3.2

1.3

0.2

setosa

Assign an id column:

df.id = 1:size(df, 1)
first(df, 3)

3×6 DataFrame

Row

SepalLength

SepalWidth

PetalLength

PetalWidth

Species

Float64

Cat…

Int64

5.1

3.5

1.4

0.2

setosa

4.9

3.0

1.4

0.2

setosa

4.7

3.2

1.3

0.2

setosa

df_long = stack(df, [:SepalLength, :SepalWidth, :PetalLength, :PetalWidth], variable_name=:variable, value_name=:value)
df_long

600×4 DataFrame575 rows omitted

Row

Species

variable

value

Cat…

Int64

String

Float64

setosa

SepalLength

5.1

setosa

SepalLength

4.9

setosa

SepalLength

4.7

setosa

SepalLength

4.6

setosa

SepalLength

5.0

setosa

SepalLength

5.4

setosa

SepalLength

4.6

setosa

SepalLength

5.0

setosa

SepalLength

4.4

setosa

SepalLength

4.9

setosa

SepalLength

5.4

setosa

SepalLength

4.8

setosa

SepalLength

4.8

⋮

589

virginica

139

PetalWidth

1.8

590

virginica

140

PetalWidth

2.1

591

virginica

141

PetalWidth

2.4

592

virginica

142

PetalWidth

2.3

593

virginica

143

PetalWidth

1.9

594

virginica

144

PetalWidth

2.3

595

virginica

145

PetalWidth

2.5

596

virginica

146

PetalWidth

2.3

597

virginica

147

PetalWidth

1.9

598

virginica

148

PetalWidth

2.0

599

virginica

149

PetalWidth

2.3

600

virginica

150

PetalWidth

1.8

(2) From a long table to a wide one:

df_wide = unstack(df_long, [:id, :Species], :variable, :value)
first(df_wide, 3)

3×6 DataFrame

Row

Species

SepalLength

SepalWidth

PetalLength

PetalWidth

Int64

Cat…

Float64?

setosa

5.1

3.5

1.4

0.2

setosa

4.9

3.0

1.4

0.2

setosa

4.7

3.2

1.3

0.2

Basically, the same to the original df:

first(df, 3)

3×6 DataFrame

Row

SepalLength

SepalWidth

PetalLength

PetalWidth

Species

Float64

Cat…

Int64

5.1

3.5

1.4

0.2

setosa

4.9

3.0

1.4

0.2

setosa

4.7

3.2

1.3

0.2

setosa

6. Missing values

Another important issue in data analysis is how to find and fill NA values.

df_bind = outerjoin(df_a, df_b, on=:id)
df_bind

4×3 DataFrame

Row

col_a

col_b

Int64

Int64?

missing

Fina NA values

filter(:col_b => ismissing, df_bind)

1×3 DataFrame

Row

col_a

col_b

Int64

Int64?

missing

Fill NA values

coalesce.(df_bind, 0)

4×3 DataFrame

Row

col_a

col_b

Int64

Please pay attention to the dot, . after coalesce, which means that we are casting the function, coalesce, to each element (cell) within the dataframe.

Drop NA values

dropmissing(df_bind, :col_b)

3×3 DataFrame

Row

col_a

col_b

Int64

Int64?

Int64

For how to simplift the process in a long pipeline, it's time to enter the next chapter.

PreviousChapter 2. Loading Data in Julia NextChapter 4. Pipeline and Tools

Last updated 12 hours ago