This week, we’re going to look at three different common data formats in R. You’ll find these all the time when you search for data online.
Rather than downloading the files manually, we’re going to use R to download them for us. This is a good way to automate the process, and also makes it easier to share the code with others.
For this, we’ll use the download.file() function. This takes two
arguments:
Make sure the file type matches the file type you’re downloading. If you’re downloading a .jpg file, the file name should end in .jpg.
For example, if we wanted to download a picture from Wikipedia, we could use:
1 download.file(2 "https://upload.wikimedia.org/wikipedia/commons/d/d8/Panthera_tigris_corbetti_(Tierpark_Berlin)_832-714-(118).jpg",
3 "picture_of_a_tiger.jpg"4 )Then you’ll have a cool picture of a tiger on your computer.
First, we’re going to look at a CSV file. CSV stands for “comma-separated values”. These are two-dimensional tables, where each row is a line in the file, and each column is separated by a comma. Here’s an example:
"name","age","married" "Gunther",42,TRUE "Gerhard",38,TRUE "Heidi",29,FALSE
This evaluates to a simple table, like this:
name age married 1 Gunther 42 TRUE 2 Gerhard 38 TRUE 3 Heidi 29 FALSE
These are great because you can open them in a text editor and read them, and are simple enough to edit. They’re also easy to read into R.
Despite the name, CSV files don’t always use commas to separate the columns. Sometimes they use semicolons, or tabs, or other characters; the Swiss government really likes semicolons for some reason.
Let’s take a look at a real-world example. We’re going to use the Swiss government’s Bundesamt für Statistik (BFS) website to download some data, about incomes for every commune in Switzerland, originally from here:
https://www.atlas.bfs.admin.ch/maps/13/de/15830_9164_8282_8281/24776.html
We find the download link, and use download.file() to download it:
1 download.file(2 "https://www.atlas.bfs.admin.ch/core/projects/13/xshared/csv/24776_131.csv",
3 "input_data/income.csv"4 )Once again, let’s take a look at the raw data. Open it in a text editor, and it should look something like this:
"GEO_ID";"GEO_NAME";"VARIABLE";VALUE;"UNIT";"STATUS";"STATUS_DESC";"DESC_VAL";"PERIOD_REF";"SOURCE";"LAST_UPDATE";"GEOM_CODE";"GEOM";"GEOM_PERIOD";"MAP_ID";"MAP_URL" "1";"Aeugst am Albis";"Steuerbares Einkommen, in Mio. Franken";98;"Franken";"A";"Normaler Wert";"";"2017-01-01/2017-12-31";"ESTV";"2021-01-07";"polg";"Politische Gemeinden";"2017-01-01";"24776";"https://www.atlas.bfs.admin.ch/maps/13/map/mapIdOnly/24776_de.html" "1";"Aeugst am Albis";"Steuerbares Einkommen pro Einwohner/-in, in Franken";50443;"Franken";"A";"Normaler Wert";"";"2017-01-01/2017-12-31";"ESTV";"2021-01-07";"polg";"Politische Gemeinden";"2017-01-01";"24776";"https://www.atlas.bfs.admin.ch/maps/13/map/mapIdOnly/24776_de.html" "2";"Affoltern am Albis";"Steuerbares Einkommen, in Mio. Franken";391;"Franken";"A";"Normaler Wert";"";"2017-01-01/2017-12-31";"ESTV";"2021-01-07";"polg";"Politische Gemeinden";"2017-01-01";"24776";"https://www.atlas.bfs.admin.ch/maps/13/map/mapIdOnly/24776_de.html"
We can see the following:
In the same way as we did last week, we can use Import Dataset to import the data into RStudio. You can see complete instructions in the previous chapter. The code that we get back should look something like this:
1 income_per_gemeinde <- read_delim("input_data/income.csv",
2 delim = ";", escape_double = FALSE, trim_ws = TRUE
3 )Another option is to use use read_delim() on the URL itself. This reads the data directly from the URL, without downloading it to your computer:
1 income_per_gemeinde <- read_delim("https://www.atlas.bfs.admin.ch/core/projects/13/xshared/csv/24776_131.csv", delim = ";", escape_double = FALSE, trim_ws = TRUE)
This can be a little dangerous, however, as the data might change, or the website could go down, and your data is lost forever.
This data has a lot of columns, and isn’t always the easiest to read. One convenient way to glimpse at the data is the glimpse() function, which shows us the first few rows of each column:
1 income_per_gemeinde |> glimpse()
Rows: 4,510 Columns: 16 $ GEO_ID <dbl> 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, … $ GEO_NAME <chr> "Aeugst am Albis", "Aeugst am Albis", "Affoltern am Albis"… $ VARIABLE <chr> "Steuerbares Einkommen, in Mio. Franken", "Steuerbares Ein… $ VALUE <dbl> 98, 50443, 391, 32180, 224, 40564, 148, 40398, 155, 41909,… $ UNIT <chr> "Franken", "Franken", "Franken", "Franken", "Franken", "Fr… $ STATUS <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A"… $ STATUS_DESC <chr> "Normaler Wert", "Normaler Wert", "Normaler Wert", "Normal… $ DESC_VAL <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… $ PERIOD_REF <chr> "2017-01-01/2017-12-31", "2017-01-01/2017-12-31", "2017-01… $ SOURCE <chr> "ESTV", "ESTV", "ESTV", "ESTV", "ESTV", "ESTV", "ESTV", "E… $ LAST_UPDATE <date> 2021-01-07, 2021-01-07, 2021-01-07, 2021-01-07, 2021-01-0… $ GEOM_CODE <chr> "polg", "polg", "polg", "polg", "polg", "polg", "polg", "p… $ GEOM <chr> "Politische Gemeinden", "Politische Gemeinden", "Politisch… $ GEOM_PERIOD <date> 2017-01-01, 2017-01-01, 2017-01-01, 2017-01-01, 2017-01-0… $ MAP_ID <dbl> 24776, 24776, 24776, 24776, 24776, 24776, 24776, 24776, 24… $ MAP_URL <chr> "https://www.atlas.bfs.admin.ch/maps/13/map/mapIdOnly/2477…
This flips the data frame on its side, so that the columns are now rows, and the rows are now columns. This makes it easier to see the data types, but is really only useful for taking a peek at our data.
For this example, we’ll want the GEO_NAME, VARIABLE, and VALUE columns. We can use the select() function to select only those columns:
1 income_per_gemeinde <- income_per_gemeinde |>
2 select(GEO_NAME, VARIABLE, VALUE)
We can now easily look at the data that we’re interested in:
1 income_per_gemeinde |> head() |> clark_save_table(100)
# A tibble: 6 × 3 GEO_NAME VARIABLE VALUE <chr> <chr> <dbl> 1 Aeugst am Albis Steuerbares Einkommen, in Mio. Franken 98 2 Aeugst am Albis Steuerbares Einkommen pro Einwohner/-in, in Franken 50443 3 Affoltern am Albis Steuerbares Einkommen, in Mio. Franken 391 4 Affoltern am Albis Steuerbares Einkommen pro Einwohner/-in, in Franken 32180 5 Bonstetten Steuerbares Einkommen, in Mio. Franken 224 6 Bonstetten Steuerbares Einkommen pro Einwohner/-in, in Franken 40564
However, we can see this data still has a pretty big problem: the VARIABLE column contains the name of the variable, and the VALUE column contains the value of the variable. This means that the VALUE column actually represents two things at the same time: The total income of the commune, and the per-capita income of the commune.
This is a common problem in data analysis. Recalling Wickham’s paper, we want every column to represent a single variable, and every row to represent a single observation, which he calls “tidy data”.
We can fix this by using the pivot_wider() function, which takes the
values in one column, and turns them into columns. We’ll use the
VARIABLE column as the column names, and the VALUE column as the values.
To do this, we’ll use two arguments for pivot_wider(): names_from,
which is the column that we want to use as the column names, and
values_from, which is the column that we want to use as the values.
1 income_per_gemeinde <- income_per_gemeinde |>
2 pivot_wider(names_from = VARIABLE, values_from = VALUE)
This can be hard to get your brain around, so let’s take a look at the data before and after:
# A tibble: 4,510 × 3 GEO_NAME VARIABLE VALUE <chr> <chr> <dbl> 1 Aeugst am Albis Steuerbares Einkommen, in Mio. 98 2 Aeugst am Albis Steuerbares Einkommen pro Einw 50443 3 Affoltern am Albis Steuerbares Einkommen, in Mio. 391 4 Affoltern am Albis Steuerbares Einkommen pro Einw 32180 5 Bonstetten Steuerbares Einkommen, in Mio. 224 6 Bonstetten Steuerbares Einkommen pro Einw 40564 7 Hausen am Albis Steuerbares Einkommen, in Mio. 148 8 Hausen am Albis Steuerbares Einkommen pro Einw 40398 9 Hedingen Steuerbares Einkommen, in Mio. 155 10 Hedingen Steuerbares Einkommen pro Einw 41909 # ℹ 4,500 more rows
# A tibble: 2,255 × 3 GEO_NAME `St. Eink, in Mio.` `St. Eink. pro Einw` <chr> <dbl> <dbl> 1 Aeugst am Albis 98 50443 2 Affoltern am Albis 391 32180 3 Bonstetten 224 40564 4 Hausen am Albis 148 40398 5 Hedingen 155 41909 6 Kappel am Albis 50 44353 7 Knonau 84 36395 8 Maschwanden 20 31437 9 Mettmenstetten 200 41023 10 Obfelden 177 33089 # ℹ 2,245 more rows
(Column names were abbreviated to fit on the screen)
The opposite of pivot_wider() is pivot_longer(), which takes columns
and turns them into rows. You can really only understand this from
practice, so you’ll get more exposure to it next week.
This data is now in the shape we want it, but the column names are still
an absolute mess. I really don’t want to type
Steuerbares Einkommen pro Einwohner/-in, in Franken every time I want
to refer to the per-capita income column. We can rename all the columns
by just assigning a vector of names to the colnames() function:
1 colnames(income_per_gemeinde) <- c("name", "total_income", "per_capita_income")
2 income_per_gemeinde |> head()
# A tibble: 6 × 3 name total_income per_capita_income <chr> <dbl> <dbl> 1 Aeugst am Albis 98 50443 2 Affoltern am Albis 391 32180 3 Bonstetten 224 40564 4 Hausen am Albis 148 40398 5 Hedingen 155 41909 6 Kappel am Albis 50 44353
Note that if we only wanted to rename one column, it might be easier to use the rename() function:
With the rename() function, remember that the new name comes first,
and the old name comes second.
1 income_per_gemeinde <- income_per_gemeinde |>
2 rename(gemeinde_name = name)
3 income_per_gemeinde |> head()
# A tibble: 6 × 3 gemeinde_name total_income per_capita_income <chr> <dbl> <dbl> 1 Aeugst am Albis 98 50443 2 Affoltern am Albis 391 32180 3 Bonstetten 224 40564 4 Hausen am Albis 148 40398 5 Hedingen 155 41909 6 Kappel am Albis 50 44353
A little housecleaning: The total income is in millions of francs, so we’ll multiply it by 1,000,000 to get the actual value. This will save some confusion later on.
To change a column, we can just assign a new value to it using
mutate():
We can sort the data by using the arrange() function. This takes the
column that we want to sort by, and the direction that we want to sort
in. We can use desc() to sort in descending order, or asc() to sort
in ascending order. For example, to sort by per-capita income, we can
use:
1 income_per_gemeinde <- income_per_gemeinde |>
2 arrange(desc(per_capita_income))
3 4 income_per_gemeinde |> head(10)
# A tibble: 10 × 3 gemeinde_name total_income per_capita_income <chr> <dbl> <dbl> 1 Vaux-sur-Morges 65000000 324181 2 Mies 334000000 162965 3 Anières 388000000 158061 4 Feusisberg 824000000 156325 5 Wollerau 985000000 138662 6 Crésuz 47000000 137880 7 Cologny 587000000 106112 8 Montricher 104000000 105971 9 Buchillon 67000000 104523 10 Vandoeuvres 258000000 103113
This gives us the 10 communes with the highest per-capita income.
Use this data set to answer the following questions:
Our next data format is JSON. JSON stands for “JavaScript Object Notation”, as it was originally designed to be used in JavaScript. It’s a very flexible format, and is used in pretty much every programming language.
Let’s download and take a look at some JSON, originally from here:
This is a list of names given to babies in Basel, by year. We can download it using:
1 download.file(2 "https://data.bs.ch/api/v2/catalog/datasets/100192/exports/json",
3 "input_data/basel_babies.json"4 )When we look at the raw data, we can see that it’s a list of key-value pairs, where the keys are the column names, and the values are the values. This is a very flexible format, and can be used to represent pretty much any data structure. This is a huge dataset
[{"jahr": "2012", "geschlecht": "M", "vorname": "Jacob", "anzahl": 1},
{"jahr": "2012", "geschlecht": "W", "vorname": "Ja\u00ebl", "anzahl": 1},
{"jahr": "2012", "geschlecht": "M", "vorname": "Jai", "anzahl": 1},
...
...
...
{"jahr": "2019", "geschlecht": "W", "vorname": "Tara", "anzahl": 2},
{"jahr": "2019", "geschlecht": "W", "vorname": "Tatjana", "anzahl": 1},
{"jahr": "2019", "geschlecht": "W", "vorname": "Tenzin", "anzahl": 1}
]However, R doesn’t really have the ability to read JSON on it’s own, so we’ll need to use a package to read it. We’ll use the jsonlite package, which has a function called read_json() that reads JSON files into R. Install and load the library in the usual way:
1 install.packages("jsonlite")
1 library(jsonlite)
Now you can use the function read_json() to read the file into R like
so:
1 basel_babies <-2 read_json("input_data/basel_babies.json", simplifyVector = TRUE)
simplifyVector is a parameter that tells R to simplify the data
structure, assuming that it is in a tabular format. You’ll almost always
want to use this option, unless you’re working with a very complex JSON
file.
Let’s look at the result:
1 basel_babies |> head()
jahr geschlecht vorname anzahl 1 2012 W Neyla 2 2 2012 M Niccolo 1 3 2012 W Nikki 1 4 2012 M Nikola 2 5 2012 M Nils 3 6 2012 W Nina 3
As an English-language class, let’s rename the columns to English:
1 basel_babies <- basel_babies |>
2 rename(3 name = vorname,
4 year = jahr,
5 sex = geschlecht,
6 total = anzahl,
7 )8 9 basel_babies |> head()
year sex name total 1 2012 W Neyla 2 2 2012 M Niccolo 1 3 2012 W Nikki 1 4 2012 M Nikola 2 5 2012 M Nils 3 6 2012 W Nina 3
This is a pretty big data set! We can see the number of rows using the nrow() function:
1 nrow(basel_babies)
[1] 23676
That’s a lot of babies. But sometimes we need to condense this information into a single number.
For this, we can use the group_by() and summarize() [^1] functions.
These are a little tricky to understand, so let’s take a look at an
example. Let’s say we want to know how many babies were born in Basel
per year. We can use group_by() to group the data by year, and then
summarize() to summarize the data.
1 basel_babies |>
2 group_by(year) |>
3 summarise(total_by_year = sum(total))
# A tibble: 19 × 2 year total_by_year <chr> <int> 1 2006 1662 2 2007 1667 3 2008 1695 4 2009 1775 5 2010 1910 6 2011 1868 7 2012 1930 8 2013 1962 9 2014 1957 10 2015 2065 11 2016 2172 12 2017 2083 13 2018 2079 14 2019 2067 15 2020 2000 16 2021 2066 17 2022 1791 18 2023 1878 19 2024 1656
We first grouped the data by year, and then summarized the data by
summing the total column. You can use quite a few different functions in
summarize(), including sum(), mean(), median(), min(),
max(), and many more.
.RDS files are a special format that R uses to save data. They’re a binary format, so you can’t open them in a text editor, but they’re very fast to read and write. They’re also very easy to use, because they save all the metadata about the data frame, including the column names, data types, and more.
These are often used for your intermediary data sets, to just save
something quickly and share it with a colleague. you can simply write
them with the write_rds() function:
1 basel_babies |> write_rds("babies.rds")
Likewise, you can read them with the read_rds() function:
1 read_rds("babies.rds")
However, there are two problems with RDS files:
Let’s say we want to know how many Basel babies have names for each letter of the alphabet.
mutate() to make a new column with the first letter of each
name. One function you can use inside mutate is str_sub().`str_sub()` takes a string, and returns a part of that string. For
example, `str_sub("hello", 1, 4)` returns “hell”, from the first to
the 4th letters of hello.group_by() and summarize() to count the number of babies
with each first letter.3: Bonus: download the package stringi, which has the function
stri_trans_general(). Look up how it works using
?stri_trans_general. Use this to get rid of all the ãccènt märks ìn
thë nàměs.
Your resulting table should look like this:
# A tibble: 10 × 2 first_letter total <chr> <int> 1 A 4907 2 B 816 3 C 1133 4 D 1276 5 E 3200 6 F 962 7 G 766 8 H 774 9 I 778 10 J 2349
Our last data format for the day is XLSX. This is a proprietary format, and is used by Microsoft Excel. I’d discourage your form using this unless you have to, but sometimes you’ll find it in the wild, and you might have less gifted colleagues who insist on using it.
Let’s download and take a look at some XLSX data, originally from the US Census Bureau:
1 download.file(2 "https://www2.census.gov/programs-surveys/decennial/2020/data/apportionment/apportionment-2020-table02.xlsx",
3 "input_data/state_population.xlsx"4 )Of course, you can always open them in Excel, but that’s not very reproducible. Instead, we’ll use the readxl package to read the data into R.
Load the library in the usual way:
1 library(readxl)
Now, you can click on your downloaded file in the file editor, and import it just like you did with the CSV file. You can see complete instructions in the last chapter.
The code that we get back should look something like this:
1 state_population <- read_excel("input_data/state_population.xlsx",
2 skip = 3
3 )Let’s take a look at the data frame we get back:
1 state_population |> head()
# A tibble: 6 × 3 AREA `RESIDENT POPULATION (APRIL 1, 2020)` This cell is intentionally …¹ <chr> <dbl> <lgl> 1 Alabama 5024279 NA 2 Alaska 733391 NA 3 Arizona 7151502 NA 4 Arkansas 3011524 NA 5 California 39538223 NA 6 Colorado 5773714 NA # ℹ abbreviated name: ¹`This cell is intentionally blank.`
We have three columns:
First, let’s rename the columns to something a little more sensible:
1 colnames(state_population) <- c("state_or_territory", "population", "blank")
Next, we can get rid of the blank column. A quick way to do this is to
use the select() function with a minus sign in front of the column
name that we don’t want:
1 state_population <- state_population |>
2 select(-blank)
3 4 state_population
# A tibble: 55 × 2 state_or_territory population <chr> <dbl> 1 Alabama 5024279 2 Alaska 733391 3 Arizona 7151502 4 Arkansas 3011524 5 California 39538223 6 Colorado 5773714 7 Connecticut 3605944 8 Delaware 989948 9 District of Columbia 689545 10 Florida 21538187 # ℹ 45 more rows
When we look at the data frame, we can see that the last few rows should be removed, but maybe Puerto Rico should be included in our calculations. [^2]
# A tibble: 10 × 2 state_or_territory population <chr> <dbl> 1 "Vermont" 643077 2 "Virginia" 8631393 3 "Washington" 7705281 4 "West Virginia" 1793716 5 "Wisconsin" 5893718 6 "Wyoming" 576851 7 "TOTAL RESIDENT POPULATION1" 331449281 8 "Puerto Rico" 3285874 9 "TOTAL RESIDENT POPULATION, INCLUDING PUERTO RICO" 334735155 10 "Footnote: 1 Includes the resident population for the 50 stat… NA
There are a couple ways we could do this, but for now let’s:
First, we use filter() to make a 1-row data frame with just Puerto
Rico:
1 puerto_rico_temp <- state_population |>
2 filter(state_or_territory == "Puerto Rico")
3 puerto_rico_temp
# A tibble: 1 × 2 state_or_territory population <chr> <dbl> 1 Puerto Rico 3285874
Second, we can use head() to select the first 51 rows of the data
frame:
1 state_population <- state_population |>
2 head(51)
3 state_population
# A tibble: 51 × 2 state_or_territory population <chr> <dbl> 1 Alabama 5024279 2 Alaska 733391 3 Arizona 7151502 4 Arkansas 3011524 5 California 39538223 6 Colorado 5773714 7 Connecticut 3605944 8 Delaware 989948 9 District of Columbia 689545 10 Florida 21538187 # ℹ 41 more rows
Third, we row-bind the two data frames together:
1 state_population <- state_population |>
2 bind_rows(puerto_rico_temp)
3 4 state_population
# A tibble: 52 × 2 state_or_territory population <chr> <dbl> 1 Alabama 5024279 2 Alaska 733391 3 Arizona 7151502 4 Arkansas 3011524 5 California 39538223 6 Colorado 5773714 7 Connecticut 3605944 8 Delaware 989948 9 District of Columbia 689545 10 Florida 21538187 # ℹ 42 more rows
When we look at the tail of the data frame, we can see that Puerto Rico is now included.
Finally, we remove the temporary data frame from memory using rm(),
which is short for “remove”:
1 rm(puerto_rico_temp)
Review the functions we’ve learned so far. What do each of these do?
`week_3_homework_(your_name).R`. We will present these in the next class.
[^1]: R is friendly to both Brits and Americans, so it has both the
`summarise()` and `summarize()` functions, which do the exact same thing.