Chapter 2. Loading Data in Julia

1. Load common datasets

Firstly, we need to load some sample data, so we can install a common package for convenience:

using Pkg
Pkg.add("RDatasets")
   Resolving package versions...
  No Changes to `~/.julia/environments/v1.11/Project.toml`
  No Changes to `~/.julia/environments/v1.11/Manifest.toml`
using RDatasets
df = dataset("datasets", "iris")
first(df, 5)

5×5 DataFrame

Row
SepalLength
SepalWidth
PetalLength
PetalWidth
Species

Float64

Float64

Float64

Float64

Cat…

1

5.1

3.5

1.4

0.2

setosa

2

4.9

3.0

1.4

0.2

setosa

3

4.7

3.2

1.3

0.2

setosa

4

4.6

3.1

1.5

0.2

setosa

5

5.0

3.6

1.4

0.2

setosa

Here, we are using first() to see the first several rows of the dataframe.

2. Load *.csv files locally

Pkg.add("CSV")
using CSV

df = CSV.read("./res/data/iris.csv", DataFrame)
first(df, 3)
   Resolving package versions...
  No Changes to `~/.julia/environments/v1.11/Project.toml`
  No Changes to `~/.julia/environments/v1.11/Manifest.toml`

3×5 DataFrame

Row
sepal_length
sepal_width
petal_length
petal_width
species

Float64

Float64

Float64

Float64

String15

1

5.1

3.5

1.4

0.2

setosa

2

4.9

3.0

1.4

0.2

setosa

3

4.7

3.2

1.3

0.2

setosa

3. Load datasets online

Pkg.add("HTTP")
using HTTP

url = "https://github.com/mwaskom/seaborn-data/raw/master/iris.csv"
response = HTTP.get(url)
df = CSV.read(IOBuffer(response.body), DataFrame)
first(df, 3)
   Resolving package versions...
  No Changes to `~/.julia/environments/v1.11/Project.toml`
  No Changes to `~/.julia/environments/v1.11/Manifest.toml`

3×5 DataFrame

Row
sepal_length
sepal_width
petal_length
petal_width
species

Float64

Float64

Float64

Float64

String15

1

5.1

3.5

1.4

0.2

setosa

2

4.9

3.0

1.4

0.2

setosa

3

4.7

3.2

1.3

0.2

setosa

4. Creating a data frame from scratch:

Pkg.add("DataFrames")
using DataFrames

df2 = DataFrame(
  title = ["A", "B", "C"],
  published = [1, 2, 3], 
  author = "Rongxin"
)
first(df2, 3)
   Resolving package versions...
  No Changes to `~/.julia/environments/v1.11/Project.toml`
  No Changes to `~/.julia/environments/v1.11/Manifest.toml`

3×3 DataFrame

Row
title
published
author

String

Int64

String

1

A

1

Rongxin

2

B

2

Rongxin

3

C

3

Rongxin

Selecting Data in Julia

1. Indexing a subset

We can select a subset using a pair of row-column indexes. For example, if we want to select the first row to the second row, with all columns, we can:

df[1:2, :]

2×5 DataFrame

Row
sepal_length
sepal_width
petal_length
petal_width
species

Float64

Float64

Float64

Float64

String15

1

5.1

3.5

1.4

0.2

setosa

2

4.9

3.0

1.4

0.2

setosa

2. Select by column names

df[:, [:sepal_width, :petal_length]]

150×2 DataFrame125 rows omitted

Row
sepal_width
petal_length

Float64

Float64

1

3.5

1.4

2

3.0

1.4

3

3.2

1.3

4

3.1

1.5

5

3.6

1.4

6

3.9

1.7

7

3.4

1.4

8

3.4

1.5

9

2.9

1.4

10

3.1

1.5

11

3.7

1.5

12

3.4

1.6

13

3.0

1.4

139

3.0

4.8

140

3.1

5.4

141

3.1

5.6

142

3.1

5.1

143

2.7

5.1

144

3.2

5.9

145

3.3

5.7

146

3.0

5.2

147

2.5

5.0

148

3.0

5.2

149

3.4

5.4

150

3.0

5.1

And the powerful part of it is, we can directly using regex to select columns!

For instance, if we only care about the columns ended with length, we can:

df[:, r".*length$"]

150×2 DataFrame125 rows omitted

Row
sepal_length
petal_length

Float64

Float64

1

5.1

1.4

2

4.9

1.4

3

4.7

1.3

4

4.6

1.5

5

5.0

1.4

6

5.4

1.7

7

4.6

1.4

8

5.0

1.5

9

4.4

1.4

10

4.9

1.5

11

5.4

1.5

12

4.8

1.6

13

4.8

1.4

139

6.0

4.8

140

6.9

5.4

141

6.7

5.6

142

6.9

5.1

143

5.8

5.1

144

6.8

5.9

145

6.7

5.7

146

6.7

5.2

147

6.3

5.0

148

6.5

5.2

149

6.2

5.4

150

5.9

5.1

3. Conditional filtering

It's common in data analysis that we want to subset a dataframe according to a condition.

In this case, we can define a condition, e.g., find out the rows whose species is virginica, as the following lines:

condition = df.species .== "virginica"
df[condition, :]

50×5 DataFrame25 rows omitted

Row
sepal_length
sepal_width
petal_length
petal_width
species

Float64

Float64

Float64

Float64

String15

1

6.3

3.3

6.0

2.5

virginica

2

5.8

2.7

5.1

1.9

virginica

3

7.1

3.0

5.9

2.1

virginica

4

6.3

2.9

5.6

1.8

virginica

5

6.5

3.0

5.8

2.2

virginica

6

7.6

3.0

6.6

2.1

virginica

7

4.9

2.5

4.5

1.7

virginica

8

7.3

2.9

6.3

1.8

virginica

9

6.7

2.5

5.8

1.8

virginica

10

7.2

3.6

6.1

2.5

virginica

11

6.5

3.2

5.1

2.0

virginica

12

6.4

2.7

5.3

1.9

virginica

13

6.8

3.0

5.5

2.1

virginica

39

6.0

3.0

4.8

1.8

virginica

40

6.9

3.1

5.4

2.1

virginica

41

6.7

3.1

5.6

2.4

virginica

42

6.9

3.1

5.1

2.3

virginica

43

5.8

2.7

5.1

1.9

virginica

44

6.8

3.2

5.9

2.3

virginica

45

6.7

3.3

5.7

2.5

virginica

46

6.7

3.0

5.2

2.3

virginica

47

6.3

2.5

5.0

1.9

virginica

48

6.5

3.0

5.2

2.0

virginica

49

6.2

3.4

5.4

2.3

virginica

50

5.9

3.0

5.1

1.8

virginica

Now, you know how to load and select dataframes upon your interests, it's time to know how to transform your data and calculate your variables

Last updated