Chapter 4. Pipeline and Tools

Data analysis is time-consuming because of many, many steps. To build a pipeline for it, R has a good module, tidyr.

In Julia, there's a package named Tidier.jl, doing the same thing.

import Pkg;
Pkg.add(["Tidier", "TidierStrings"]);
using Tidier, RDatasets, TidierStrings
   Resolving package versions...
  No Changes to `~/.julia/environments/v1.11/Project.toml`
  No Changes to `~/.julia/environments/v1.11/Manifest.toml`

Load Data

Load a samle dataset, iris:

df = dataset("datasets", "iris")
first(df, 5)

5×5 DataFrame

Row
SepalLength
SepalWidth
PetalLength
PetalWidth
Species

Float64

Float64

Float64

Float64

Cat…

1

5.1

3.5

1.4

0.2

setosa

2

4.9

3.0

1.4

0.2

setosa

3

4.7

3.2

1.3

0.2

setosa

4

4.6

3.1

1.5

0.2

setosa

5

5.0

3.6

1.4

0.2

setosa

Data Pipeline

In the following pipeline, we wanna:

    1. create a new column dubbed SepalLengthMax, and fill the column with the maximum number of SepalLength;

    1. filter a subset of the rows in the dataframe whose SepalWidth is no less than 3.0

    1. select columns of SepalLength, SepalLengthMax, PetalLength

    1. only keep the first five rows for testing

@chain df begin
    @mutate(SepalLengthMax = maximum(SepalLength))
    @filter(SepalWidth >= 3.0)
    @select(SepalLength, SepalLengthMax, PetalLength)
    @slice(1:5)
end

5×3 DataFrame

Row
SepalLength
SepalLengthMax
PetalLength

Float64

Float64

Float64

1

5.1

7.9

1.4

2

4.9

7.9

1.4

3

4.7

7.9

1.3

4

4.6

7.9

1.5

5

5.0

7.9

1.4

Which is equivalent to the following codes in R:

library(dplyr)
library(tidyr)

df <- df %>%
mutate(SepalLengthMax = max(SepalLength)) %>% # Create a new column with max value
filter(SepalWidth >= 3.0) %>%                # Filter rows
select(SepalLength, SepalLengthMax, PetalLength) %>% # Select specific columns
slice(1:5)                                   # Take the first 5 rows

A full list of the Julia implementation of tidyr in R can be found here.

String Manipulation

In communication studies and many social sciences relevant to textual representations, we often handle large volumes of string data and may need to perform operations like detection and replacement on these strings.

We can use TidierStrings.jl:

first(df, 3)

3×5 DataFrame

Row
SepalLength
SepalWidth
PetalLength
PetalWidth
Species

Float64

Float64

Float64

Float64

Cat…

1

5.1

3.5

1.4

0.2

setosa

2

4.9

3.0

1.4

0.2

setosa

3

4.7

3.2

1.3

0.2

setosa

(1) Detection

For example, we can filter rows with column names starting with r'set.*'

@chain df begin
    @mutate(Species = String(Species)) # convert categoreis into strings
    @filter(str_detect(Species, r"set.*")) # starting with set, using regex
    @slice(1:5) # head
end

5×5 DataFrame

Row
SepalLength
SepalWidth
PetalLength
PetalWidth
Species

Float64

Float64

Float64

Float64

String

1

5.1

3.5

1.4

0.2

setosa

2

4.9

3.0

1.4

0.2

setosa

3

4.7

3.2

1.3

0.2

setosa

4

4.6

3.1

1.5

0.2

setosa

5

5.0

3.6

1.4

0.2

setosa

(2) Replacing

@chain df begin
    @mutate(Species = String(Species))
    @mutate(Species = str_replace(Species, "set", "setAAA"))
    @slice(1:5) # head
end

5×5 DataFrame

Row
SepalLength
SepalWidth
PetalLength
PetalWidth
Species

Float64

Float64

Float64

Float64

String

1

5.1

3.5

1.4

0.2

setAAAosa

2

4.9

3.0

1.4

0.2

setAAAosa

3

4.7

3.2

1.3

0.2

setAAAosa

4

4.6

3.1

1.5

0.2

setAAAosa

5

5.0

3.6

1.4

0.2

setAAAosa

(3) Equivalence test

@chain df begin
    @mutate(Species = String(Species))
    @mutate(IsSetosa = str_equal(Species, "setosa"))
    @slice(1:5) # head
end

5×6 DataFrame

Row
SepalLength
SepalWidth
PetalLength
PetalWidth
Species
IsSetosa

Float64

Float64

Float64

Float64

String

Bool

1

5.1

3.5

1.4

0.2

setosa

true

2

4.9

3.0

1.4

0.2

setosa

true

3

4.7

3.2

1.3

0.2

setosa

true

4

4.6

3.1

1.5

0.2

setosa

true

5

5.0

3.6

1.4

0.2

setosa

true

These three operations are frequently used in research but if you wanna more, please refer to the documentation.

Network

How to retrieve APIs if we wanna download something? Similar to requests in Python, we can use HTTP and JSON3 module in Julia.

import Pkg; Pkg.add("JSON3")
using HTTP
using JSON3
   Resolving package versions...
  No Changes to `~/.julia/environments/v1.11/Project.toml`
  No Changes to `~/.julia/environments/v1.11/Manifest.toml`

For example, if we want to have a look at the structure of GitHub APIs:

url = "https://api.github.com"

req = HTTP.get(url)
req = JSON3.read(req.body)
req
JSON3.Object{Vector{UInt8}, Vector{UInt64}} with 33 entries:
  :current_user_url                     => "https://api.github.com/user"
  :current_user_authorizations_html_url => "https://github.com/settings/connect…
  :authorizations_url                   => "https://api.github.com/authorizatio…
  :code_search_url                      => "https://api.github.com/search/code?…
  :commit_search_url                    => "https://api.github.com/search/commi…
  :emails_url                           => "https://api.github.com/user/emails"
  :emojis_url                           => "https://api.github.com/emojis"
  :events_url                           => "https://api.github.com/events"
  :feeds_url                            => "https://api.github.com/feeds"
  :followers_url                        => "https://api.github.com/user/followe…
  :following_url                        => "https://api.github.com/user/followi…
  :gists_url                            => "https://api.github.com/gists{/gist_…
  :hub_url                              => "https://api.github.com/hub"
  :issue_search_url                     => "https://api.github.com/search/issue…
  :issues_url                           => "https://api.github.com/issues"
  :keys_url                             => "https://api.github.com/user/keys"
  :label_search_url                     => "https://api.github.com/search/label…
  :notifications_url                    => "https://api.github.com/notification…
  :organization_url                     => "https://api.github.com/orgs/{org}"
  :organization_repositories_url        => "https://api.github.com/orgs/{org}/r…
  :organization_teams_url               => "https://api.github.com/orgs/{org}/t…
  :public_gists_url                     => "https://api.github.com/gists/public"
  :rate_limit_url                       => "https://api.github.com/rate_limit"
  :repository_url                       => "https://api.github.com/repos/{owner…
  :repository_search_url                => "https://api.github.com/search/repos…
  ⋮                                     => ⋮

Then, how can we run regression and many models? Here's the guide.

Last updated