Chapter 3. Dataframe Transformation

using CSV, DataFrames, RDatasets, Statistics

First, we load the iris data again to illustrate how to manipulate a dataframe in Julia:

df = dataset("datasets", "iris")
first(df, 5)

5×5 DataFrame

Row
SepalLength
SepalWidth
PetalLength
PetalWidth
Species

Float64

Float64

Float64

Float64

Cat…

1

5.1

3.5

1.4

0.2

setosa

2

4.9

3.0

1.4

0.2

setosa

3

4.7

3.2

1.3

0.2

setosa

4

4.6

3.1

1.5

0.2

setosa

5

5.0

3.6

1.4

0.2

setosa

1. Select a Subset (review)

In the last chapter, we knew how to select a subset. Now, let's do it again. In this example, we do not need SepalWidth column. For this purpose, we can either use:

df[:, [:SepalLength, :SepalWidth, :PetalLength, :PetalWidth]]

150×4 DataFrame125 rows omitted

Row
SepalLength
SepalWidth
PetalLength
PetalWidth

Float64

Float64

Float64

Float64

1

5.1

3.5

1.4

0.2

2

4.9

3.0

1.4

0.2

3

4.7

3.2

1.3

0.2

4

4.6

3.1

1.5

0.2

5

5.0

3.6

1.4

0.2

6

5.4

3.9

1.7

0.4

7

4.6

3.4

1.4

0.3

8

5.0

3.4

1.5

0.2

9

4.4

2.9

1.4

0.2

10

4.9

3.1

1.5

0.1

11

5.4

3.7

1.5

0.2

12

4.8

3.4

1.6

0.2

13

4.8

3.0

1.4

0.1

139

6.0

3.0

4.8

1.8

140

6.9

3.1

5.4

2.1

141

6.7

3.1

5.6

2.4

142

6.9

3.1

5.1

2.3

143

5.8

2.7

5.1

1.9

144

6.8

3.2

5.9

2.3

145

6.7

3.3

5.7

2.5

146

6.7

3.0

5.2

2.3

147

6.3

2.5

5.0

1.9

148

6.5

3.0

5.2

2.0

149

6.2

3.4

5.4

2.3

150

5.9

3.0

5.1

1.8

Or a logical calculation based selection:

df[:, Not(:SepalWidth)]

150×4 DataFrame125 rows omitted

Row
SepalLength
PetalLength
PetalWidth
Species

Float64

Float64

Float64

Cat…

1

5.1

1.4

0.2

setosa

2

4.9

1.4

0.2

setosa

3

4.7

1.3

0.2

setosa

4

4.6

1.5

0.2

setosa

5

5.0

1.4

0.2

setosa

6

5.4

1.7

0.4

setosa

7

4.6

1.4

0.3

setosa

8

5.0

1.5

0.2

setosa

9

4.4

1.4

0.2

setosa

10

4.9

1.5

0.1

setosa

11

5.4

1.5

0.2

setosa

12

4.8

1.6

0.2

setosa

13

4.8

1.4

0.1

setosa

139

6.0

4.8

1.8

virginica

140

6.9

5.4

2.1

virginica

141

6.7

5.6

2.4

virginica

142

6.9

5.1

2.3

virginica

143

5.8

5.1

1.9

virginica

144

6.8

5.9

2.3

virginica

145

6.7

5.7

2.5

virginica

146

6.7

5.2

2.3

virginica

147

6.3

5.0

1.9

virginica

148

6.5

5.2

2.0

virginica

149

6.2

5.4

2.3

virginica

150

5.9

5.1

1.8

virginica

2. Splition and Combination

By Row

In R, you can use rbind() to combine two dataframes by appending new rows to the origional dataframe. Similarly in Julia, you can:

df_a = DataFrame(col_a=[1, 2, 3], col_b=[4, 5, 6])
df_b = DataFrame(col_a=[2, 3, 3], col_b=[2, 3, 4])
df_bind = [df_a; df_b]
df_bind

6×2 DataFrame

Row
col_a
col_b

Int64

Int64

1

1

4

2

2

5

3

3

6

4

2

2

5

3

3

6

3

4

Or, because the concatenation is vertical:

df_bind = vcat(df_a, df_b) # vertical, concat...
df_bind

6×2 DataFrame

Row
col_a
col_b

Int64

Int64

1

1

4

2

2

5

3

3

6

4

2

2

5

3

3

6

3

4

However, if you want to make it faster on large datasets, append! is more efficient:

df_bind = append!(df_a, df_b)
df_bind

6×2 DataFrame

Row
col_a
col_b

Int64

Int64

1

1

4

2

2

5

3

3

6

4

2

2

5

3

3

6

3

4

By Column

1. Bind horizontally Similar to the vertical concatenation, vcat(); we can use hcat() for horizontal concatenation:

df_a = DataFrame(col_a=[1, 2, 3], col_b=[4, 5, 6])
df_b = DataFrame(col_c=[2, 3, 3], col_d=[2, 3, 4])
df_bind = hcat(df_a, df_b) # horizontal, concat...
df_bind

3×4 DataFrame

Row
col_a
col_b
col_c
col_d

Int64

Int64

Int64

Int64

1

1

4

2

2

2

2

5

3

3

3

3

6

3

4

2. Merge/join horizontally

While concatenation is convenient, sometime, we have missing values and want to safely concat dataframes horizontally by id or values of a certain column. In any data pipelinee, there are join operations:

(1) Only id(s) existing in both dataframes

df_a = DataFrame(id=[1, 2, 3], col_a=[4, 5, 6])
df_b = DataFrame(id=[2, 3, 3], col_b=[2, 3, 4])
df_bind = innerjoin(df_a, df_b, on=:id)
df_bind

3×3 DataFrame

Row
id
col_a
col_b

Int64

Int64

Int64

1

2

5

2

2

3

6

3

3

3

6

4

(2) All id(s) existing in either dataframe

df_bind = outerjoin(df_a, df_b, on=:id)
df_bind

4×3 DataFrame

Row
id
col_a
col_b

Int64

Int64?

Int64?

1

2

5

2

2

3

6

3

3

3

6

4

4

1

4

missing

(3) Only id(s) existing in the dataframe on the left

df_bind = leftjoin(df_a, df_b, on=:id)
df_bind

4×3 DataFrame

Row
id
col_a
col_b

Int64

Int64

Int64?

1

2

5

2

2

3

6

3

3

3

6

4

4

1

4

missing

(4) Only id(s) existing in the dataframe on the right

df_bind = rightjoin(df_a, df_b, on=:id)
df_bind

3×3 DataFrame

Row
id
col_a
col_b

Int64

Int64?

Int64

1

2

5

2

2

3

6

3

3

3

6

4

3. Group by

To see the dataset by group, or to prepare for a sub-group calculation, we use

gdf = groupby(df, :Species)
first(gdf)

50×5 SubDataFrame25 rows omitted

Row
SepalLength
SepalWidth
PetalLength
PetalWidth
Species

Float64

Float64

Float64

Float64

Cat…

1

5.1

3.5

1.4

0.2

setosa

2

4.9

3.0

1.4

0.2

setosa

3

4.7

3.2

1.3

0.2

setosa

4

4.6

3.1

1.5

0.2

setosa

5

5.0

3.6

1.4

0.2

setosa

6

5.4

3.9

1.7

0.4

setosa

7

4.6

3.4

1.4

0.3

setosa

8

5.0

3.4

1.5

0.2

setosa

9

4.4

2.9

1.4

0.2

setosa

10

4.9

3.1

1.5

0.1

setosa

11

5.4

3.7

1.5

0.2

setosa

12

4.8

3.4

1.6

0.2

setosa

13

4.8

3.0

1.4

0.1

setosa

39

4.4

3.0

1.3

0.2

setosa

40

5.1

3.4

1.5

0.2

setosa

41

5.0

3.5

1.3

0.3

setosa

42

4.5

2.3

1.3

0.3

setosa

43

4.4

3.2

1.3

0.2

setosa

44

5.0

3.5

1.6

0.6

setosa

45

5.1

3.8

1.9

0.4

setosa

46

4.8

3.0

1.4

0.3

setosa

47

5.1

3.8

1.6

0.2

setosa

48

4.6

3.2

1.4

0.2

setosa

49

5.3

3.7

1.5

0.2

setosa

50

5.0

3.3

1.4

0.2

setosa

Based on this grouped dataframe (gdf), we can calculate the mean values of SepalLength:

combine(gdf, :SepalLength => mean)

3×2 DataFrame

Row
Species
SepalLength_mean

Cat…

Float64

1

setosa

5.006

2

versicolor

5.936

3

virginica

6.588

What if we wanna know more than one columns?

combine(groupby(df, :Species), [:SepalLength, :SepalWidth] .=> mean)

3×3 DataFrame

Row
Species
SepalLength_mean
SepalWidth_mean

Cat…

Float64

Float64

1

setosa

5.006

3.428

2

versicolor

5.936

2.77

3

virginica

6.588

2.974

4. Sorting

In many cases, after calculating the means of groups, we want to see which one is the highest. One way to finish the task is to sort the result:

sort(combine(groupby(df, :Species), [:SepalLength, :SepalWidth] .=> mean),
    :SepalWidth_mean) # sort by SepalWidth_mean

3×3 DataFrame

Row
Species
SepalLength_mean
SepalWidth_mean

Cat…

Float64

Float64

1

versicolor

5.936

2.77

2

virginica

6.588

2.974

3

setosa

5.006

3.428

Reversely:

sort(combine(groupby(df, :Species), [:SepalLength, :SepalWidth] .=> mean),
    :SepalWidth_mean,
    rev=true) # sort by SepalWidth_mean

3×3 DataFrame

Row
Species
SepalLength_mean
SepalWidth_mean

Cat…

Float64

Float64

1

setosa

5.006

3.428

2

virginica

6.588

2.974

3

versicolor

5.936

2.77

5. Transforming between long and wide tables

(1) From a wide table to a long one:

Let's look at the iris dataset again

first(df, 3)

3×5 DataFrame

Row
SepalLength
SepalWidth
PetalLength
PetalWidth
Species

Float64

Float64

Float64

Float64

Cat…

1

5.1

3.5

1.4

0.2

setosa

2

4.9

3.0

1.4

0.2

setosa

3

4.7

3.2

1.3

0.2

setosa

Assign an id column:

df.id = 1:size(df, 1)
first(df, 3)

3×6 DataFrame

Row
SepalLength
SepalWidth
PetalLength
PetalWidth
Species
id

Float64

Float64

Float64

Float64

Cat…

Int64

1

5.1

3.5

1.4

0.2

setosa

1

2

4.9

3.0

1.4

0.2

setosa

2

3

4.7

3.2

1.3

0.2

setosa

3

df_long = stack(df, [:SepalLength, :SepalWidth, :PetalLength, :PetalWidth], variable_name=:variable, value_name=:value)
df_long

600×4 DataFrame575 rows omitted

Row
Species
id
variable
value

Cat…

Int64

String

Float64

1

setosa

1

SepalLength

5.1

2

setosa

2

SepalLength

4.9

3

setosa

3

SepalLength

4.7

4

setosa

4

SepalLength

4.6

5

setosa

5

SepalLength

5.0

6

setosa

6

SepalLength

5.4

7

setosa

7

SepalLength

4.6

8

setosa

8

SepalLength

5.0

9

setosa

9

SepalLength

4.4

10

setosa

10

SepalLength

4.9

11

setosa

11

SepalLength

5.4

12

setosa

12

SepalLength

4.8

13

setosa

13

SepalLength

4.8

589

virginica

139

PetalWidth

1.8

590

virginica

140

PetalWidth

2.1

591

virginica

141

PetalWidth

2.4

592

virginica

142

PetalWidth

2.3

593

virginica

143

PetalWidth

1.9

594

virginica

144

PetalWidth

2.3

595

virginica

145

PetalWidth

2.5

596

virginica

146

PetalWidth

2.3

597

virginica

147

PetalWidth

1.9

598

virginica

148

PetalWidth

2.0

599

virginica

149

PetalWidth

2.3

600

virginica

150

PetalWidth

1.8

(2) From a long table to a wide one:

df_wide = unstack(df_long, [:id, :Species], :variable, :value)
first(df_wide, 3)

3×6 DataFrame

Row
id
Species
SepalLength
SepalWidth
PetalLength
PetalWidth

Int64

Cat…

Float64?

Float64?

Float64?

Float64?

1

1

setosa

5.1

3.5

1.4

0.2

2

2

setosa

4.9

3.0

1.4

0.2

3

3

setosa

4.7

3.2

1.3

0.2

Basically, the same to the original df:

first(df, 3)

3×6 DataFrame

Row
SepalLength
SepalWidth
PetalLength
PetalWidth
Species
id

Float64

Float64

Float64

Float64

Cat…

Int64

1

5.1

3.5

1.4

0.2

setosa

1

2

4.9

3.0

1.4

0.2

setosa

2

3

4.7

3.2

1.3

0.2

setosa

3

6. Missing values

Another important issue in data analysis is how to find and fill NA values.

df_bind = outerjoin(df_a, df_b, on=:id)
df_bind

4×3 DataFrame

Row
id
col_a
col_b

Int64

Int64?

Int64?

1

2

5

2

2

3

6

3

3

3

6

4

4

1

4

missing

Fina NA values

filter(:col_b => ismissing, df_bind)

1×3 DataFrame

Row
id
col_a
col_b

Int64

Int64?

Int64?

1

1

4

missing

Fill NA values

coalesce.(df_bind, 0)

4×3 DataFrame

Row
id
col_a
col_b

Int64

Int64

Int64

1

2

5

2

2

3

6

3

3

3

6

4

4

1

4

0

Please pay attention to the dot, . after coalesce, which means that we are casting the function, coalesce, to each element (cell) within the dataframe.

Drop NA values

dropmissing(df_bind, :col_b)

3×3 DataFrame

Row
id
col_a
col_b

Int64

Int64?

Int64

1

2

5

2

2

3

6

3

3

3

6

4

For how to simplift the process in a long pipeline, it's time to enter the next chapter.

Last updated