# Chapter 2. Loading Data in Julia

### 1. Load common datasets

Firstly, we need to load some sample data, so we can install a common package for convenience:

```julia
using Pkg
Pkg.add("RDatasets")
```

```
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`
```

```julia
using RDatasets
df = dataset("datasets", "iris")
first(df, 5)
```

5×5 DataFrame

| Row | SepalLength | SepalWidth | PetalLength | PetalWidth | Species |
| --- | ----------- | ---------- | ----------- | ---------- | ------- |
|     | Float64     | Float64    | Float64     | Float64    | Cat…    |
| 1   | 5.1         | 3.5        | 1.4         | 0.2        | setosa  |
| 2   | 4.9         | 3.0        | 1.4         | 0.2        | setosa  |
| 3   | 4.7         | 3.2        | 1.3         | 0.2        | setosa  |
| 4   | 4.6         | 3.1        | 1.5         | 0.2        | setosa  |
| 5   | 5.0         | 3.6        | 1.4         | 0.2        | setosa  |

Here, we are using `first()` to see the first several rows of the dataframe.

### 2. Load \*.csv files locally

```julia
Pkg.add("CSV")
using CSV

df = CSV.read("./res/data/iris.csv", DataFrame)
first(df, 3)
```

```
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`
```

3×5 DataFrame

| Row | sepal\_length | sepal\_width | petal\_length | petal\_width | species  |
| --- | ------------- | ------------ | ------------- | ------------ | -------- |
|     | Float64       | Float64      | Float64       | Float64      | String15 |
| 1   | 5.1           | 3.5          | 1.4           | 0.2          | setosa   |
| 2   | 4.9           | 3.0          | 1.4           | 0.2          | setosa   |
| 3   | 4.7           | 3.2          | 1.3           | 0.2          | setosa   |

### 3. Load datasets online

```julia
Pkg.add("HTTP")
using HTTP

url = "https://github.com/mwaskom/seaborn-data/raw/master/iris.csv"
response = HTTP.get(url)
df = CSV.read(IOBuffer(response.body), DataFrame)
first(df, 3)
```

```
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`
```

3×5 DataFrame

| Row | sepal\_length | sepal\_width | petal\_length | petal\_width | species  |
| --- | ------------- | ------------ | ------------- | ------------ | -------- |
|     | Float64       | Float64      | Float64       | Float64      | String15 |
| 1   | 5.1           | 3.5          | 1.4           | 0.2          | setosa   |
| 2   | 4.9           | 3.0          | 1.4           | 0.2          | setosa   |
| 3   | 4.7           | 3.2          | 1.3           | 0.2          | setosa   |

### 4. Creating a data frame from scratch:

```julia
Pkg.add("DataFrames")
using DataFrames

df2 = DataFrame(
  title = ["A", "B", "C"],
  published = [1, 2, 3], 
  author = "Rongxin"
)
first(df2, 3)
```

```
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`
```

3×3 DataFrame

| Row | title  | published | author  |
| --- | ------ | --------- | ------- |
|     | String | Int64     | String  |
| 1   | A      | 1         | Rongxin |
| 2   | B      | 2         | Rongxin |
| 3   | C      | 3         | Rongxin |

## Selecting Data in Julia

### 1. Indexing a subset

![](/files/mwN3c8rdp2A2FkxHfV9E)

We can select a subset using a pair of row-column indexes. For example, if we want to select the first row to the second row, with all columns, we can:

```julia
df[1:2, :]
```

2×5 DataFrame

| Row | sepal\_length | sepal\_width | petal\_length | petal\_width | species  |
| --- | ------------- | ------------ | ------------- | ------------ | -------- |
|     | Float64       | Float64      | Float64       | Float64      | String15 |
| 1   | 5.1           | 3.5          | 1.4           | 0.2          | setosa   |
| 2   | 4.9           | 3.0          | 1.4           | 0.2          | setosa   |

### 2. Select by column names

![](/files/8nISdAcjFLsF4oZ8EmfZ)

```julia
df[:, [:sepal_width, :petal_length]]
```

150×2 DataFrame125 rows omitted

| Row | sepal\_width | petal\_length |
| --- | ------------ | ------------- |
|     | Float64      | Float64       |
| 1   | 3.5          | 1.4           |
| 2   | 3.0          | 1.4           |
| 3   | 3.2          | 1.3           |
| 4   | 3.1          | 1.5           |
| 5   | 3.6          | 1.4           |
| 6   | 3.9          | 1.7           |
| 7   | 3.4          | 1.4           |
| 8   | 3.4          | 1.5           |
| 9   | 2.9          | 1.4           |
| 10  | 3.1          | 1.5           |
| 11  | 3.7          | 1.5           |
| 12  | 3.4          | 1.6           |
| 13  | 3.0          | 1.4           |
| ⋮   | ⋮            | ⋮             |
| 139 | 3.0          | 4.8           |
| 140 | 3.1          | 5.4           |
| 141 | 3.1          | 5.6           |
| 142 | 3.1          | 5.1           |
| 143 | 2.7          | 5.1           |
| 144 | 3.2          | 5.9           |
| 145 | 3.3          | 5.7           |
| 146 | 3.0          | 5.2           |
| 147 | 2.5          | 5.0           |
| 148 | 3.0          | 5.2           |
| 149 | 3.4          | 5.4           |
| 150 | 3.0          | 5.1           |

And the powerful part of it is, we can directly using regex to select columns!

For instance, if we only care about the columns ended with `length`, we can:

![](/files/f4C1j9hElYnyLETTvGhj)

```julia
df[:, r".*length$"]
```

150×2 DataFrame125 rows omitted

| Row | sepal\_length | petal\_length |
| --- | ------------- | ------------- |
|     | Float64       | Float64       |
| 1   | 5.1           | 1.4           |
| 2   | 4.9           | 1.4           |
| 3   | 4.7           | 1.3           |
| 4   | 4.6           | 1.5           |
| 5   | 5.0           | 1.4           |
| 6   | 5.4           | 1.7           |
| 7   | 4.6           | 1.4           |
| 8   | 5.0           | 1.5           |
| 9   | 4.4           | 1.4           |
| 10  | 4.9           | 1.5           |
| 11  | 5.4           | 1.5           |
| 12  | 4.8           | 1.6           |
| 13  | 4.8           | 1.4           |
| ⋮   | ⋮             | ⋮             |
| 139 | 6.0           | 4.8           |
| 140 | 6.9           | 5.4           |
| 141 | 6.7           | 5.6           |
| 142 | 6.9           | 5.1           |
| 143 | 5.8           | 5.1           |
| 144 | 6.8           | 5.9           |
| 145 | 6.7           | 5.7           |
| 146 | 6.7           | 5.2           |
| 147 | 6.3           | 5.0           |
| 148 | 6.5           | 5.2           |
| 149 | 6.2           | 5.4           |
| 150 | 5.9           | 5.1           |

### 3. Conditional filtering

It's common in data analysis that we want to subset a dataframe according to a condition.

In this case, we can define a condition, e.g., find out the rows whose `species` is `virginica`, as the following lines:

![img](/files/SAEgD3xxI1fQfMsJ67i2)

```julia
condition = df.species .== "virginica"
df[condition, :]
```

50×5 DataFrame25 rows omitted

| Row | sepal\_length | sepal\_width | petal\_length | petal\_width | species   |
| --- | ------------- | ------------ | ------------- | ------------ | --------- |
|     | Float64       | Float64      | Float64       | Float64      | String15  |
| 1   | 6.3           | 3.3          | 6.0           | 2.5          | virginica |
| 2   | 5.8           | 2.7          | 5.1           | 1.9          | virginica |
| 3   | 7.1           | 3.0          | 5.9           | 2.1          | virginica |
| 4   | 6.3           | 2.9          | 5.6           | 1.8          | virginica |
| 5   | 6.5           | 3.0          | 5.8           | 2.2          | virginica |
| 6   | 7.6           | 3.0          | 6.6           | 2.1          | virginica |
| 7   | 4.9           | 2.5          | 4.5           | 1.7          | virginica |
| 8   | 7.3           | 2.9          | 6.3           | 1.8          | virginica |
| 9   | 6.7           | 2.5          | 5.8           | 1.8          | virginica |
| 10  | 7.2           | 3.6          | 6.1           | 2.5          | virginica |
| 11  | 6.5           | 3.2          | 5.1           | 2.0          | virginica |
| 12  | 6.4           | 2.7          | 5.3           | 1.9          | virginica |
| 13  | 6.8           | 3.0          | 5.5           | 2.1          | virginica |
| ⋮   | ⋮             | ⋮            | ⋮             | ⋮            | ⋮         |
| 39  | 6.0           | 3.0          | 4.8           | 1.8          | virginica |
| 40  | 6.9           | 3.1          | 5.4           | 2.1          | virginica |
| 41  | 6.7           | 3.1          | 5.6           | 2.4          | virginica |
| 42  | 6.9           | 3.1          | 5.1           | 2.3          | virginica |
| 43  | 5.8           | 2.7          | 5.1           | 1.9          | virginica |
| 44  | 6.8           | 3.2          | 5.9           | 2.3          | virginica |
| 45  | 6.7           | 3.3          | 5.7           | 2.5          | virginica |
| 46  | 6.7           | 3.0          | 5.2           | 2.3          | virginica |
| 47  | 6.3           | 2.5          | 5.0           | 1.9          | virginica |
| 48  | 6.5           | 3.0          | 5.2           | 2.0          | virginica |
| 49  | 6.2           | 3.4          | 5.4           | 2.3          | virginica |
| 50  | 5.9           | 3.0          | 5.1           | 1.8          | virginica |

Now, you know how to load and select dataframes upon your interests, it's time to know how to [transform your data and calculate your variables](https://data-julia.rongxin.me/data-analysis-in-julia/3.transform.calculate.jl)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://data-julia.rongxin.me/2.data.loading.selection.jl.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
