# Chapter 2. Loading Data in Julia

### 1. Load common datasets

Firstly, we need to load some sample data, so we can install a common package for convenience:

```julia
using Pkg
Pkg.add("RDatasets")
```

```
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`
```

```julia
using RDatasets
df = dataset("datasets", "iris")
first(df, 5)
```

5×5 DataFrame

| Row | SepalLength | SepalWidth | PetalLength | PetalWidth | Species |
| --- | ----------- | ---------- | ----------- | ---------- | ------- |
|     | Float64     | Float64    | Float64     | Float64    | Cat…    |
| 1   | 5.1         | 3.5        | 1.4         | 0.2        | setosa  |
| 2   | 4.9         | 3.0        | 1.4         | 0.2        | setosa  |
| 3   | 4.7         | 3.2        | 1.3         | 0.2        | setosa  |
| 4   | 4.6         | 3.1        | 1.5         | 0.2        | setosa  |
| 5   | 5.0         | 3.6        | 1.4         | 0.2        | setosa  |

Here, we are using `first()` to see the first several rows of the dataframe.

### 2. Load \*.csv files locally

```julia
Pkg.add("CSV")
using CSV

df = CSV.read("./res/data/iris.csv", DataFrame)
first(df, 3)
```

```
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`
```

3×5 DataFrame

| Row | sepal\_length | sepal\_width | petal\_length | petal\_width | species  |
| --- | ------------- | ------------ | ------------- | ------------ | -------- |
|     | Float64       | Float64      | Float64       | Float64      | String15 |
| 1   | 5.1           | 3.5          | 1.4           | 0.2          | setosa   |
| 2   | 4.9           | 3.0          | 1.4           | 0.2          | setosa   |
| 3   | 4.7           | 3.2          | 1.3           | 0.2          | setosa   |

### 3. Load datasets online

```julia
Pkg.add("HTTP")
using HTTP

url = "https://github.com/mwaskom/seaborn-data/raw/master/iris.csv"
response = HTTP.get(url)
df = CSV.read(IOBuffer(response.body), DataFrame)
first(df, 3)
```

```
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`
```

3×5 DataFrame

| Row | sepal\_length | sepal\_width | petal\_length | petal\_width | species  |
| --- | ------------- | ------------ | ------------- | ------------ | -------- |
|     | Float64       | Float64      | Float64       | Float64      | String15 |
| 1   | 5.1           | 3.5          | 1.4           | 0.2          | setosa   |
| 2   | 4.9           | 3.0          | 1.4           | 0.2          | setosa   |
| 3   | 4.7           | 3.2          | 1.3           | 0.2          | setosa   |

### 4. Creating a data frame from scratch:

```julia
Pkg.add("DataFrames")
using DataFrames

df2 = DataFrame(
  title = ["A", "B", "C"],
  published = [1, 2, 3], 
  author = "Rongxin"
)
first(df2, 3)
```

```
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`
```

3×3 DataFrame

| Row | title  | published | author  |
| --- | ------ | --------- | ------- |
|     | String | Int64     | String  |
| 1   | A      | 1         | Rongxin |
| 2   | B      | 2         | Rongxin |
| 3   | C      | 3         | Rongxin |

## Selecting Data in Julia

### 1. Indexing a subset

![](https://805018807-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FL1MD6Rchxcl7KPeAtsi1%2Fuploads%2Fgit-blob-9d5b8329999b686430c6f7ba699ce32253939a37%2Findex.png?alt=media)

We can select a subset using a pair of row-column indexes. For example, if we want to select the first row to the second row, with all columns, we can:

```julia
df[1:2, :]
```

2×5 DataFrame

| Row | sepal\_length | sepal\_width | petal\_length | petal\_width | species  |
| --- | ------------- | ------------ | ------------- | ------------ | -------- |
|     | Float64       | Float64      | Float64       | Float64      | String15 |
| 1   | 5.1           | 3.5          | 1.4           | 0.2          | setosa   |
| 2   | 4.9           | 3.0          | 1.4           | 0.2          | setosa   |

### 2. Select by column names

![](https://805018807-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FL1MD6Rchxcl7KPeAtsi1%2Fuploads%2Fgit-blob-1c99ca43378890169b959f96d48fa273b697601a%2Fcolumns.png?alt=media)

```julia
df[:, [:sepal_width, :petal_length]]
```

150×2 DataFrame125 rows omitted

| Row | sepal\_width | petal\_length |
| --- | ------------ | ------------- |
|     | Float64      | Float64       |
| 1   | 3.5          | 1.4           |
| 2   | 3.0          | 1.4           |
| 3   | 3.2          | 1.3           |
| 4   | 3.1          | 1.5           |
| 5   | 3.6          | 1.4           |
| 6   | 3.9          | 1.7           |
| 7   | 3.4          | 1.4           |
| 8   | 3.4          | 1.5           |
| 9   | 2.9          | 1.4           |
| 10  | 3.1          | 1.5           |
| 11  | 3.7          | 1.5           |
| 12  | 3.4          | 1.6           |
| 13  | 3.0          | 1.4           |
| ⋮   | ⋮            | ⋮             |
| 139 | 3.0          | 4.8           |
| 140 | 3.1          | 5.4           |
| 141 | 3.1          | 5.6           |
| 142 | 3.1          | 5.1           |
| 143 | 2.7          | 5.1           |
| 144 | 3.2          | 5.9           |
| 145 | 3.3          | 5.7           |
| 146 | 3.0          | 5.2           |
| 147 | 2.5          | 5.0           |
| 148 | 3.0          | 5.2           |
| 149 | 3.4          | 5.4           |
| 150 | 3.0          | 5.1           |

And the powerful part of it is, we can directly using regex to select columns!

For instance, if we only care about the columns ended with `length`, we can:

![](https://805018807-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FL1MD6Rchxcl7KPeAtsi1%2Fuploads%2Fgit-blob-ed428fd20e4e91f003f8dd92f5f3e0871b8432f7%2Fcol.reg.png?alt=media)

```julia
df[:, r".*length$"]
```

150×2 DataFrame125 rows omitted

| Row | sepal\_length | petal\_length |
| --- | ------------- | ------------- |
|     | Float64       | Float64       |
| 1   | 5.1           | 1.4           |
| 2   | 4.9           | 1.4           |
| 3   | 4.7           | 1.3           |
| 4   | 4.6           | 1.5           |
| 5   | 5.0           | 1.4           |
| 6   | 5.4           | 1.7           |
| 7   | 4.6           | 1.4           |
| 8   | 5.0           | 1.5           |
| 9   | 4.4           | 1.4           |
| 10  | 4.9           | 1.5           |
| 11  | 5.4           | 1.5           |
| 12  | 4.8           | 1.6           |
| 13  | 4.8           | 1.4           |
| ⋮   | ⋮             | ⋮             |
| 139 | 6.0           | 4.8           |
| 140 | 6.9           | 5.4           |
| 141 | 6.7           | 5.6           |
| 142 | 6.9           | 5.1           |
| 143 | 5.8           | 5.1           |
| 144 | 6.8           | 5.9           |
| 145 | 6.7           | 5.7           |
| 146 | 6.7           | 5.2           |
| 147 | 6.3           | 5.0           |
| 148 | 6.5           | 5.2           |
| 149 | 6.2           | 5.4           |
| 150 | 5.9           | 5.1           |

### 3. Conditional filtering

It's common in data analysis that we want to subset a dataframe according to a condition.

In this case, we can define a condition, e.g., find out the rows whose `species` is `virginica`, as the following lines:

![img](https://805018807-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FL1MD6Rchxcl7KPeAtsi1%2Fuploads%2Fgit-blob-805510775dc67cdbf28f961742b206c85fd4e1c0%2Fcol.if.png?alt=media)

```julia
condition = df.species .== "virginica"
df[condition, :]
```

50×5 DataFrame25 rows omitted

| Row | sepal\_length | sepal\_width | petal\_length | petal\_width | species   |
| --- | ------------- | ------------ | ------------- | ------------ | --------- |
|     | Float64       | Float64      | Float64       | Float64      | String15  |
| 1   | 6.3           | 3.3          | 6.0           | 2.5          | virginica |
| 2   | 5.8           | 2.7          | 5.1           | 1.9          | virginica |
| 3   | 7.1           | 3.0          | 5.9           | 2.1          | virginica |
| 4   | 6.3           | 2.9          | 5.6           | 1.8          | virginica |
| 5   | 6.5           | 3.0          | 5.8           | 2.2          | virginica |
| 6   | 7.6           | 3.0          | 6.6           | 2.1          | virginica |
| 7   | 4.9           | 2.5          | 4.5           | 1.7          | virginica |
| 8   | 7.3           | 2.9          | 6.3           | 1.8          | virginica |
| 9   | 6.7           | 2.5          | 5.8           | 1.8          | virginica |
| 10  | 7.2           | 3.6          | 6.1           | 2.5          | virginica |
| 11  | 6.5           | 3.2          | 5.1           | 2.0          | virginica |
| 12  | 6.4           | 2.7          | 5.3           | 1.9          | virginica |
| 13  | 6.8           | 3.0          | 5.5           | 2.1          | virginica |
| ⋮   | ⋮             | ⋮            | ⋮             | ⋮            | ⋮         |
| 39  | 6.0           | 3.0          | 4.8           | 1.8          | virginica |
| 40  | 6.9           | 3.1          | 5.4           | 2.1          | virginica |
| 41  | 6.7           | 3.1          | 5.6           | 2.4          | virginica |
| 42  | 6.9           | 3.1          | 5.1           | 2.3          | virginica |
| 43  | 5.8           | 2.7          | 5.1           | 1.9          | virginica |
| 44  | 6.8           | 3.2          | 5.9           | 2.3          | virginica |
| 45  | 6.7           | 3.3          | 5.7           | 2.5          | virginica |
| 46  | 6.7           | 3.0          | 5.2           | 2.3          | virginica |
| 47  | 6.3           | 2.5          | 5.0           | 1.9          | virginica |
| 48  | 6.5           | 3.0          | 5.2           | 2.0          | virginica |
| 49  | 6.2           | 3.4          | 5.4           | 2.3          | virginica |
| 50  | 5.9           | 3.0          | 5.1           | 1.8          | virginica |

Now, you know how to load and select dataframes upon your interests, it's time to know how to [transform your data and calculate your variables](https://data-julia.rongxin.me/data-analysis-in-julia/3.transform.calculate.jl)
