Chapter 2. Loading Data in Julia

1. Load common datasets

Firstly, we need to load some sample data, so we can install a common package for convenience:

using Pkg
Pkg.add("RDatasets")

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`

using RDatasets
df = dataset("datasets", "iris")
first(df, 5)

5×5 DataFrame

Row

SepalLength

SepalWidth

PetalLength

PetalWidth

Species

Float64

Cat…

5.1

3.5

1.4

0.2

setosa

4.9

3.0

1.4

0.2

setosa

4.7

3.2

1.3

0.2

setosa

4.6

3.1

1.5

0.2

setosa

5.0

3.6

1.4

0.2

setosa

Here, we are using first() to see the first several rows of the dataframe.

2. Load *.csv files locally

Pkg.add("CSV")
using CSV

df = CSV.read("./res/data/iris.csv", DataFrame)
first(df, 3)

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`

3×5 DataFrame

Row

sepal_length

sepal_width

petal_length

petal_width

species

Float64

String15

5.1

3.5

1.4

0.2

setosa

4.9

3.0

1.4

0.2

setosa

4.7

3.2

1.3

0.2

setosa

3. Load datasets online

Pkg.add("HTTP")
using HTTP

url = "https://github.com/mwaskom/seaborn-data/raw/master/iris.csv"
response = HTTP.get(url)
df = CSV.read(IOBuffer(response.body), DataFrame)
first(df, 3)

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`

3×5 DataFrame

Row

sepal_length

sepal_width

petal_length

petal_width

species

Float64

String15

5.1

3.5

1.4

0.2

setosa

4.9

3.0

1.4

0.2

setosa

4.7

3.2

1.3

0.2

setosa

4. Creating a data frame from scratch:

Pkg.add("DataFrames")
using DataFrames

df2 = DataFrame(
  title = ["A", "B", "C"],
  published = [1, 2, 3], 
  author = "Rongxin"
)
first(df2, 3)

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`

3×3 DataFrame

Row

title

published

author

String

Int64

String

Rongxin

Selecting Data in Julia

1. Indexing a subset

We can select a subset using a pair of row-column indexes. For example, if we want to select the first row to the second row, with all columns, we can:

df[1:2, :]

2×5 DataFrame

Row

sepal_length

sepal_width

petal_length

petal_width

species

Float64

String15

5.1

3.5

1.4

0.2

setosa

4.9

3.0

1.4

0.2

setosa

2. Select by column names

df[:, [:sepal_width, :petal_length]]

150×2 DataFrame125 rows omitted

Row

sepal_width

petal_length

Float64

3.5

1.4

3.0

1.4

3.2

1.3

3.1

1.5

3.6

1.4

3.9

1.7

3.4

1.4

3.4

1.5

2.9

1.4

3.1

1.5

3.7

1.5

3.4

1.6

3.0

1.4

⋮

139

3.0

4.8

140

3.1

5.4

141

3.1

5.6

142

3.1

5.1

143

2.7

5.1

144

3.2

5.9

145

3.3

5.7

146

3.0

5.2

147

2.5

5.0

148

3.0

5.2

149

3.4

5.4

150

3.0

5.1

And the powerful part of it is, we can directly using regex to select columns!

For instance, if we only care about the columns ended with length, we can:

df[:, r".*length$"]

150×2 DataFrame125 rows omitted

Row

sepal_length

petal_length

Float64

5.1

1.4

4.9

1.4

4.7

1.3

4.6

1.5

5.0

1.4

5.4

1.7

4.6

1.4

5.0

1.5

4.4

1.4

4.9

1.5

5.4

1.5

4.8

1.6

4.8

1.4

⋮

139

6.0

4.8

140

6.9

5.4

141

6.7

5.6

142

6.9

5.1

143

5.8

5.1

144

6.8

5.9

145

6.7

5.7

146

6.7

5.2

147

6.3

5.0

148

6.5

5.2

149

6.2

5.4

150

5.9

5.1

3. Conditional filtering

It's common in data analysis that we want to subset a dataframe according to a condition.

In this case, we can define a condition, e.g., find out the rows whose species is virginica, as the following lines:

condition = df.species .== "virginica"
df[condition, :]

50×5 DataFrame25 rows omitted

Row

sepal_length

sepal_width

petal_length

petal_width

species

Float64

String15

6.3

3.3

6.0

2.5

virginica

5.8

2.7

5.1

1.9

virginica

7.1

3.0

5.9

2.1

virginica

6.3

2.9

5.6

1.8

virginica

6.5

3.0

5.8

2.2

virginica

7.6

3.0

6.6

2.1

virginica

4.9

2.5

4.5

1.7

virginica

7.3

2.9

6.3

1.8

virginica

6.7

2.5

5.8

1.8

virginica

7.2

3.6

6.1

2.5

virginica

6.5

3.2

5.1

2.0

virginica

6.4

2.7

5.3

1.9

virginica

6.8

3.0

5.5

2.1

virginica

⋮

6.0

3.0

4.8

1.8

virginica

6.9

3.1

5.4

2.1

virginica

6.7

3.1

5.6

2.4

virginica

6.9

3.1

5.1

2.3

virginica

5.8

2.7

5.1

1.9

virginica

6.8

3.2

5.9

2.3

virginica

6.7

3.3

5.7

2.5

virginica

6.7

3.0

5.2

2.3

virginica

6.3

2.5

5.0

1.9

virginica

6.5

3.0

5.2

2.0

virginica

6.2

3.4

5.4

2.3

virginica

5.9

3.0

5.1

1.8

virginica

Now, you know how to load and select dataframes upon your interests, it's time to know how to transform your data and calculate your variables

PreviousChapter 1. Installation and Basics NextChapter 3. Dataframe Transformation

Last updated 2 months ago

Was this helpful?