Chapter 3. Dataframe Transformation
First, we load the iris data again to illustrate how to manipulate a dataframe in Julia:
5×5 DataFrame
Float64
Float64
Float64
Float64
Cat…
1
5.1
3.5
1.4
0.2
setosa
2
4.9
3.0
1.4
0.2
setosa
3
4.7
3.2
1.3
0.2
setosa
4
4.6
3.1
1.5
0.2
setosa
5
5.0
3.6
1.4
0.2
setosa
1. Select a Subset (review)
In the last chapter, we knew how to select a subset. Now, let's do it again.
In this example, we do not need SepalWidth
column. For this purpose, we can either use:
150×4 DataFrame125 rows omitted
Float64
Float64
Float64
Float64
1
5.1
3.5
1.4
0.2
2
4.9
3.0
1.4
0.2
3
4.7
3.2
1.3
0.2
4
4.6
3.1
1.5
0.2
5
5.0
3.6
1.4
0.2
6
5.4
3.9
1.7
0.4
7
4.6
3.4
1.4
0.3
8
5.0
3.4
1.5
0.2
9
4.4
2.9
1.4
0.2
10
4.9
3.1
1.5
0.1
11
5.4
3.7
1.5
0.2
12
4.8
3.4
1.6
0.2
13
4.8
3.0
1.4
0.1
⋮
⋮
⋮
⋮
⋮
139
6.0
3.0
4.8
1.8
140
6.9
3.1
5.4
2.1
141
6.7
3.1
5.6
2.4
142
6.9
3.1
5.1
2.3
143
5.8
2.7
5.1
1.9
144
6.8
3.2
5.9
2.3
145
6.7
3.3
5.7
2.5
146
6.7
3.0
5.2
2.3
147
6.3
2.5
5.0
1.9
148
6.5
3.0
5.2
2.0
149
6.2
3.4
5.4
2.3
150
5.9
3.0
5.1
1.8
Or a logical calculation based selection:
150×4 DataFrame125 rows omitted
Float64
Float64
Float64
Cat…
1
5.1
1.4
0.2
setosa
2
4.9
1.4
0.2
setosa
3
4.7
1.3
0.2
setosa
4
4.6
1.5
0.2
setosa
5
5.0
1.4
0.2
setosa
6
5.4
1.7
0.4
setosa
7
4.6
1.4
0.3
setosa
8
5.0
1.5
0.2
setosa
9
4.4
1.4
0.2
setosa
10
4.9
1.5
0.1
setosa
11
5.4
1.5
0.2
setosa
12
4.8
1.6
0.2
setosa
13
4.8
1.4
0.1
setosa
⋮
⋮
⋮
⋮
⋮
139
6.0
4.8
1.8
virginica
140
6.9
5.4
2.1
virginica
141
6.7
5.6
2.4
virginica
142
6.9
5.1
2.3
virginica
143
5.8
5.1
1.9
virginica
144
6.8
5.9
2.3
virginica
145
6.7
5.7
2.5
virginica
146
6.7
5.2
2.3
virginica
147
6.3
5.0
1.9
virginica
148
6.5
5.2
2.0
virginica
149
6.2
5.4
2.3
virginica
150
5.9
5.1
1.8
virginica
2. Splition and Combination
By Row
In R, you can use rbind()
to combine two dataframes by appending new rows to the origional dataframe. Similarly in Julia, you can:
6×2 DataFrame
Int64
Int64
1
1
4
2
2
5
3
3
6
4
2
2
5
3
3
6
3
4
Or, because the concatenation is vertical:
6×2 DataFrame
Int64
Int64
1
1
4
2
2
5
3
3
6
4
2
2
5
3
3
6
3
4
However, if you want to make it faster on large datasets, append!
is more efficient:
6×2 DataFrame
Int64
Int64
1
1
4
2
2
5
3
3
6
4
2
2
5
3
3
6
3
4
By Column
1. Bind horizontally Similar to the vertical concatenation, vcat()
; we can use hcat()
for horizontal concatenation:
3×4 DataFrame
Int64
Int64
Int64
Int64
1
1
4
2
2
2
2
5
3
3
3
3
6
3
4
2. Merge/join horizontally
While concatenation is convenient, sometime, we have missing values and want to safely concat dataframes horizontally by id
or values of a certain column. In any data pipelinee, there are join
operations:
(1) Only id(s) existing in both dataframes
3×3 DataFrame
Int64
Int64
Int64
1
2
5
2
2
3
6
3
3
3
6
4
(2) All id(s) existing in either dataframe
4×3 DataFrame
Int64
Int64?
Int64?
1
2
5
2
2
3
6
3
3
3
6
4
4
1
4
missing
(3) Only id(s) existing in the dataframe on the left
4×3 DataFrame
Int64
Int64
Int64?
1
2
5
2
2
3
6
3
3
3
6
4
4
1
4
missing
(4) Only id(s) existing in the dataframe on the right
3×3 DataFrame
Int64
Int64?
Int64
1
2
5
2
2
3
6
3
3
3
6
4
3. Group by
To see the dataset by group, or to prepare for a sub-group calculation, we use
50×5 SubDataFrame25 rows omitted
Float64
Float64
Float64
Float64
Cat…
1
5.1
3.5
1.4
0.2
setosa
2
4.9
3.0
1.4
0.2
setosa
3
4.7
3.2
1.3
0.2
setosa
4
4.6
3.1
1.5
0.2
setosa
5
5.0
3.6
1.4
0.2
setosa
6
5.4
3.9
1.7
0.4
setosa
7
4.6
3.4
1.4
0.3
setosa
8
5.0
3.4
1.5
0.2
setosa
9
4.4
2.9
1.4
0.2
setosa
10
4.9
3.1
1.5
0.1
setosa
11
5.4
3.7
1.5
0.2
setosa
12
4.8
3.4
1.6
0.2
setosa
13
4.8
3.0
1.4
0.1
setosa
⋮
⋮
⋮
⋮
⋮
⋮
39
4.4
3.0
1.3
0.2
setosa
40
5.1
3.4
1.5
0.2
setosa
41
5.0
3.5
1.3
0.3
setosa
42
4.5
2.3
1.3
0.3
setosa
43
4.4
3.2
1.3
0.2
setosa
44
5.0
3.5
1.6
0.6
setosa
45
5.1
3.8
1.9
0.4
setosa
46
4.8
3.0
1.4
0.3
setosa
47
5.1
3.8
1.6
0.2
setosa
48
4.6
3.2
1.4
0.2
setosa
49
5.3
3.7
1.5
0.2
setosa
50
5.0
3.3
1.4
0.2
setosa
Based on this grouped dataframe (gdf
), we can calculate the mean values of SepalLength
:
3×2 DataFrame
Cat…
Float64
1
setosa
5.006
2
versicolor
5.936
3
virginica
6.588
What if we wanna know more than one columns?
3×3 DataFrame
Cat…
Float64
Float64
1
setosa
5.006
3.428
2
versicolor
5.936
2.77
3
virginica
6.588
2.974
4. Sorting
In many cases, after calculating the means of groups, we want to see which one is the highest. One way to finish the task is to sort the result:
3×3 DataFrame
Cat…
Float64
Float64
1
versicolor
5.936
2.77
2
virginica
6.588
2.974
3
setosa
5.006
3.428
Reversely:
3×3 DataFrame
Cat…
Float64
Float64
1
setosa
5.006
3.428
2
virginica
6.588
2.974
3
versicolor
5.936
2.77
5. Transforming between long and wide tables
(1) From a wide table to a long one:
Let's look at the iris dataset again
3×5 DataFrame
Float64
Float64
Float64
Float64
Cat…
1
5.1
3.5
1.4
0.2
setosa
2
4.9
3.0
1.4
0.2
setosa
3
4.7
3.2
1.3
0.2
setosa
Assign an id column:
3×6 DataFrame
Float64
Float64
Float64
Float64
Cat…
Int64
1
5.1
3.5
1.4
0.2
setosa
1
2
4.9
3.0
1.4
0.2
setosa
2
3
4.7
3.2
1.3
0.2
setosa
3
600×4 DataFrame575 rows omitted
Cat…
Int64
String
Float64
1
setosa
1
SepalLength
5.1
2
setosa
2
SepalLength
4.9
3
setosa
3
SepalLength
4.7
4
setosa
4
SepalLength
4.6
5
setosa
5
SepalLength
5.0
6
setosa
6
SepalLength
5.4
7
setosa
7
SepalLength
4.6
8
setosa
8
SepalLength
5.0
9
setosa
9
SepalLength
4.4
10
setosa
10
SepalLength
4.9
11
setosa
11
SepalLength
5.4
12
setosa
12
SepalLength
4.8
13
setosa
13
SepalLength
4.8
⋮
⋮
⋮
⋮
⋮
589
virginica
139
PetalWidth
1.8
590
virginica
140
PetalWidth
2.1
591
virginica
141
PetalWidth
2.4
592
virginica
142
PetalWidth
2.3
593
virginica
143
PetalWidth
1.9
594
virginica
144
PetalWidth
2.3
595
virginica
145
PetalWidth
2.5
596
virginica
146
PetalWidth
2.3
597
virginica
147
PetalWidth
1.9
598
virginica
148
PetalWidth
2.0
599
virginica
149
PetalWidth
2.3
600
virginica
150
PetalWidth
1.8
(2) From a long table to a wide one:
3×6 DataFrame
Int64
Cat…
Float64?
Float64?
Float64?
Float64?
1
1
setosa
5.1
3.5
1.4
0.2
2
2
setosa
4.9
3.0
1.4
0.2
3
3
setosa
4.7
3.2
1.3
0.2
Basically, the same to the original df
:
3×6 DataFrame
Float64
Float64
Float64
Float64
Cat…
Int64
1
5.1
3.5
1.4
0.2
setosa
1
2
4.9
3.0
1.4
0.2
setosa
2
3
4.7
3.2
1.3
0.2
setosa
3
6. Missing values
Another important issue in data analysis is how to find and fill NA values.
4×3 DataFrame
Int64
Int64?
Int64?
1
2
5
2
2
3
6
3
3
3
6
4
4
1
4
missing
Fina NA values
1×3 DataFrame
Int64
Int64?
Int64?
1
1
4
missing
Fill NA values
4×3 DataFrame
Int64
Int64
Int64
1
2
5
2
2
3
6
3
3
3
6
4
4
1
4
0
Please pay attention to the dot, .
after coalesce
, which means that we are casting the function, coalesce
, to each element (cell) within the dataframe.
Drop NA values
3×3 DataFrame
Int64
Int64?
Int64
1
2
5
2
2
3
6
3
3
3
6
4
For how to simplift the process in a long pipeline, it's time to enter the next chapter.
Last updated