Chapter 4. Pipeline and Tools
Data analysis is time-consuming because of many, many steps. To build a pipeline for it, R has a good module, tidyr
.
In Julia, there's a package named Tidier.jl, doing the same thing.
Load Data
Load a samle dataset, iris
:
5×5 DataFrame
Float64
Float64
Float64
Float64
Cat…
1
5.1
3.5
1.4
0.2
setosa
2
4.9
3.0
1.4
0.2
setosa
3
4.7
3.2
1.3
0.2
setosa
4
4.6
3.1
1.5
0.2
setosa
5
5.0
3.6
1.4
0.2
setosa
Data Pipeline
In the following pipeline, we wanna:
create a new column dubbed
SepalLengthMax
, and fill the column with the maximum number ofSepalLength
;
filter a subset of the rows in the dataframe whose
SepalWidth
is no less than 3.0
select columns of SepalLength, SepalLengthMax, PetalLength
only keep the first five rows for testing
5×3 DataFrame
Float64
Float64
Float64
1
5.1
7.9
1.4
2
4.9
7.9
1.4
3
4.7
7.9
1.3
4
4.6
7.9
1.5
5
5.0
7.9
1.4
Which is equivalent to the following codes in R:
A full list of the Julia implementation of tidyr
in R can be found here.
String Manipulation
In communication studies and many social sciences relevant to textual representations, we often handle large volumes of string data and may need to perform operations like detection and replacement on these strings.
We can use TidierStrings.jl
:
3×5 DataFrame
Float64
Float64
Float64
Float64
Cat…
1
5.1
3.5
1.4
0.2
setosa
2
4.9
3.0
1.4
0.2
setosa
3
4.7
3.2
1.3
0.2
setosa
(1) Detection
For example, we can filter rows with column names starting with r'set.*'
5×5 DataFrame
Float64
Float64
Float64
Float64
String
1
5.1
3.5
1.4
0.2
setosa
2
4.9
3.0
1.4
0.2
setosa
3
4.7
3.2
1.3
0.2
setosa
4
4.6
3.1
1.5
0.2
setosa
5
5.0
3.6
1.4
0.2
setosa
(2) Replacing
5×5 DataFrame
Float64
Float64
Float64
Float64
String
1
5.1
3.5
1.4
0.2
setAAAosa
2
4.9
3.0
1.4
0.2
setAAAosa
3
4.7
3.2
1.3
0.2
setAAAosa
4
4.6
3.1
1.5
0.2
setAAAosa
5
5.0
3.6
1.4
0.2
setAAAosa
(3) Equivalence test
5×6 DataFrame
Float64
Float64
Float64
Float64
String
Bool
1
5.1
3.5
1.4
0.2
setosa
true
2
4.9
3.0
1.4
0.2
setosa
true
3
4.7
3.2
1.3
0.2
setosa
true
4
4.6
3.1
1.5
0.2
setosa
true
5
5.0
3.6
1.4
0.2
setosa
true
These three operations are frequently used in research but if you wanna more, please refer to the documentation.
Network
How to retrieve APIs if we wanna download something? Similar to requests
in Python, we can use HTTP
and JSON3
module in Julia.
For example, if we want to have a look at the structure of GitHub APIs:
Then, how can we run regression and many models? Here's the guide.
Last updated