This week is all about practice. I want to make sure you understand the basics before we start making charts, maps and websites.
Let’s clean some data together.
You can get the download link, usually, by finding the button, right clicking, and hitting “Copy link address”. This will be slightly different depending on your browser.

1 download.file("https://dam-api.bfs.admin.ch/hub/api/dam/assets/32007752/master", "input_data/births.csv")
Note: On Windows, you need to make an adjustment. You need to add
mode = "wb" to the download.file() function.
1 download.file("https://dam-api.bfs.admin.ch/hub/api/dam/assets/32007752/master", "input_data/births.csv", mode = "wb")
1 births <- read_csv("input_data/births.csv")
1 births |> head()
# A tibble: 6 × 5 YEAR CANTON AGE_MOTHER SEX_CHILD OBS_VALUE <dbl> <chr> <chr> <chr> <dbl> 1 1969 CH _T T 102520 2 1969 CH _T F 49990 3 1969 CH _T M 52530 4 1969 CH Y10T14 T 6 5 1969 CH Y10T14 F 3 6 1969 CH Y10T14 M 3
1 births |> tail()
# A tibble: 6 × 5 YEAR CANTON AGE_MOTHER SEX_CHILD OBS_VALUE <dbl> <chr> <chr> <chr> <dbl> 1 2023 26 Y60T64 T 0 2 2023 26 Y60T64 F 0 3 2023 26 Y60T64 M 0 4 2023 26 Y65T69 T 0 5 2023 26 Y65T69 F 0 6 2023 26 Y65T69 M 0
Can we guess what every column is?
To review, the filter() function will sort out the data you want from the stuff you don’t. For example, if you only wanted a dataset for boys in Zürich in 1970, you could use the following code:
1 births |>
2 filter(YEAR == 1970) |>
3 filter(CANTON == 1) |>
4 filter(SEX_CHILD == "M")
# A tibble: 13 × 5 YEAR CANTON AGE_MOTHER SEX_CHILD OBS_VALUE <dbl> <chr> <chr> <chr> <dbl> 1 1970 1 _T M 8330 2 1970 1 Y10T14 M 0 3 1970 1 Y15T19 M 279 4 1970 1 Y20T24 M 2274 5 1970 1 Y25T29 M 3119 6 1970 1 Y30T34 M 1745 7 1970 1 Y35T39 M 768 8 1970 1 Y40T44 M 133 9 1970 1 Y45T49 M 12 10 1970 1 Y50T54 M 0 11 1970 1 Y55T59 M 0 12 1970 1 Y60T64 M 0 13 1970 1 Y65T69 M 0
Please answer the following questions: Your answer should be a code block that uses the filter() function to find the answer.
Let’s clean up the data set.
year, canton, age_of_mother, sex_of_child, total_born. You can use`rename()` or `colnames() <- c()`. It should look like this:
# A tibble: 10 × 5 year canton age_of_mother sex_of_child total_born <dbl> <chr> <chr> <chr> <dbl> 1 1969 CH _T T 102520 2 1969 CH _T F 49990 3 1969 CH _T M 52530 4 1969 CH Y10T14 T 6 5 1969 CH Y10T14 F 3 6 1969 CH Y10T14 M 3 7 1969 CH Y15T19 T 3648 8 1969 CH Y15T19 F 1780 9 1969 CH Y15T19 M 1868 10 1969 CH Y20T24 T 30230
ages. **Filter** only the total number of children born, and discard the boy and girl counts. It should look like this:
# A tibble: 10 × 5 year canton age_of_mother sex_of_child total_born <dbl> <chr> <chr> <chr> <dbl> 1 1969 CH _T T 102520 2 1969 CH Y10T14 T 6 3 1969 CH Y15T19 T 3648 4 1969 CH Y20T24 T 30230 5 1969 CH Y25T29 T 36206 6 1969 CH Y30T34 T 20479 7 1969 CH Y35T39 T 9077 8 1969 CH Y40T44 T 2633 9 1969 CH Y45T49 T 240 10 1969 CH Y50T54 T 1
sex_of_child column is pretty useless, isn’t it? Let’s
select() only the columns that we care about. It should looklike this:
# A tibble: 10 × 4 year canton age_of_mother total_born <dbl> <chr> <chr> <dbl> 1 1969 CH _T 102520 2 1969 CH Y10T14 6 3 1969 CH Y15T19 3648 4 1969 CH Y20T24 30230 5 1969 CH Y25T29 36206 6 1969 CH Y30T34 20479 7 1969 CH Y35T39 9077 8 1969 CH Y40T44 2633 9 1969 CH Y45T49 240 10 1969 CH Y50T54 1
age_of_mother column is a bit of a mess. Let’s clean it up. A
function called str_sub() can help us with this. Learn how itworks by typing `?str_sub` into the console. You can also just experiment with it with a test string.
1 str_sub("Maybe sub_str stands for submarine_string.", 1, 5)
2 str_sub("What about substitute_string?", 12, -2)
3 str_sub("Nah, maybe I'm overthinking it.", 16, 27)
[1] "Maybe"
[1] "substitute_string"
[1] "overthinking"
Now that you understand the function, you can use it to mutate() the
age_of_mother column, and make two new columns, called
mother_age_from and mother_age_to. It should look like this:
# A tibble: 19,305 × 6 year canton age_of_mother total_born mother_age_from mother_age_to <dbl> <chr> <chr> <dbl> <chr> <chr> 1 1969 CH _T 102520 T "" 2 1969 CH Y10T14 6 10 "14" 3 1969 CH Y15T19 3648 15 "19" 4 1969 CH Y20T24 30230 20 "24" 5 1969 CH Y25T29 36206 25 "29" 6 1969 CH Y30T34 20479 30 "34" 7 1969 CH Y35T39 9077 35 "39" 8 1969 CH Y40T44 2633 40 "44" 9 1969 CH Y45T49 240 45 "49" 10 1969 CH Y50T54 1 50 "54" # ℹ 19,295 more rows
canton column also
includes the total number of births in Switzerland. We don’t wantthat. Let’s use **filter()** to remove the rows where the `canton` column is equal to “CH”. We also want to get rid of the rows where the age_of_mother is “\_T”. It should look like this:
# A tibble: 17,160 × 6 year canton age_of_mother total_born mother_age_from mother_age_to <dbl> <chr> <chr> <dbl> <chr> <chr> 1 1969 1 Y10T14 0 10 14 2 1969 1 Y15T19 532 15 19 3 1969 1 Y20T24 4608 20 24 4 1969 1 Y25T29 6149 25 29 5 1969 1 Y30T34 3535 30 34 6 1969 1 Y35T39 1423 35 39 7 1969 1 Y40T44 366 40 44 8 1969 1 Y45T49 25 45 49 9 1969 1 Y50T54 0 50 54 10 1969 1 Y55T59 0 55 59 # ℹ 17,150 more rows
mother_age_from and mother_age_to
columns are still characters. It’s still just a string that lookslike a number. We can convert them to integers with the `as.integer()` function. You should mutate over the column again, using the as.integer() function. It should look like this:
# A tibble: 17,160 × 6 year canton age_of_mother total_born mother_age_from mother_age_to <dbl> <chr> <chr> <dbl> <int> <int> 1 1969 1 Y10T14 0 10 14 2 1969 1 Y15T19 532 15 19 3 1969 1 Y20T24 4608 20 24 4 1969 1 Y25T29 6149 25 29 5 1969 1 Y30T34 3535 30 34 6 1969 1 Y35T39 1423 35 39 7 1969 1 Y40T44 366 40 44 8 1969 1 Y45T49 25 45 49 9 1969 1 Y50T54 0 50 54 10 1969 1 Y55T59 0 55 59 # ℹ 17,150 more rows
age_of_mother column anymore. Let’s
select() only the columns that we care about. It should looklike this:
# A tibble: 17,160 × 5 year canton total_born mother_age_from mother_age_to <dbl> <chr> <dbl> <int> <int> 1 1969 1 0 10 14 2 1969 1 532 15 19 3 1969 1 4608 20 24 4 1969 1 6149 25 29 5 1969 1 3535 30 34 6 1969 1 1423 35 39 7 1969 1 366 40 44 8 1969 1 25 45 49 9 1969 1 0 50 54 10 1969 1 0 55 59 # ℹ 17,150 more rows
overwrite the original value, so we simply use `->` to give it the same name as before. It should look like this:
Before
1 births |>
2 rename(3 year = YEAR,
4 ...5 ...6 ...After
1 births <- births |>
2 rename(3 year = YEAR,
4 ...5 ...6 ...There’s one last annoying thing about this data set: the canton numbers. It would be really annoying to have to remember that Zurich is canton number 1, and so on. We can fix this by joining the data set with another data set that has the canton names.
We can find the canton names and numbers here:
https://www.bfs.admin.ch/asset/de/453856
Can you find the download link for the canton names and numbers? Use
download.file() to download it into your project folder.
Now import the data set into your R session. This one is especially messy, so I wrote some code to help you out. You can just copy and paste this code into your own file if you like.
1 library(readxl)
2 3 canton_names <- read_excel("input_data/canton_nums.xlsx")
4 5 canton_names <- canton_names |>
6 select(1:2) |> # I select the first two columns because the rest are filled with junk.
7 tail(-4) |> # I delete the first four rows, because the first four are metadata.
8 head(26) # I select only the first 26 rows, because the rest are metadata.
9 10 # Now I rename the columns to something more useful.11 colnames(canton_names) <- c("bfs_canton_number", "canton_name")
12 13 canton_names
# A tibble: 26 × 2 bfs_canton_number canton_name <chr> <chr> 1 1 Zürich 2 2 Bern 3 3 Luzern 4 4 Uri 5 5 Schwyz 6 6 Obwalden 7 7 Nidwalden 8 8 Glarus 9 9 Zug 10 10 Freiburg # ℹ 16 more rows
Now the tricky part: joining. there’s several different functions to
join things, but the one we’ll use is left_join(). This function takes
two data sets, and joins them together into one. It’s called “left join”
because the data set on the left side of the function is the one that
will be kept, and the data set on the right side will be joined to it.
Now we use left_join() to join the two data sets together. This
function takes two arguments: the first is the data set you want to
join, the second is by=, in which you put the names of the columns you
want to join by. In this case, we want to join by the canton_number
column in the births data set, and the bfs_canton_number column in
the canton_names data set.
When we do this, it will match all the rows in the births data set
with the corresponding row in the canton_names data set. If there’s no
match, it will put NA in the column, meaning that there is no data
there.
1 births |>
2 left_join(canton_names, by = c("canton" = "bfs_canton_number"))
# A tibble: 17,160 × 6 year canton total_born mother_age_from mother_age_to canton_name <dbl> <chr> <dbl> <int> <int> <chr> 1 1969 1 0 10 14 Zürich 2 1969 1 532 15 19 Zürich 3 1969 1 4608 20 24 Zürich 4 1969 1 6149 25 29 Zürich 5 1969 1 3535 30 34 Zürich 6 1969 1 1423 35 39 Zürich 7 1969 1 366 40 44 Zürich 8 1969 1 25 45 49 Zürich 9 1969 1 0 50 54 Zürich 10 1969 1 0 55 59 Zürich # ℹ 17,150 more rows
When this looks good, save it to a variable.
1 births <- births |>
2 left_join(canton_names, by = c("canton" = "bfs_canton_number"))
Make sure that the data types are the same! If they’re not, you need to
use a function to convert them using mutate(). Some functions that can
do this are:
as.integer() 0, 1, 2, 3as.numeric() 0.0, 1.0, 2.0, 3.0as.logical() FALSE, TRUE, TRUE, TRUEas.character() “0”, “one”, “2”, “Zürich”as.roman() I, II, III, IVFor example, here are two data sets that we want to join together. The first data set is a list of students, and the second data set is a list of grades. We want to join them together by the student number, to see which students got which grades. But it won’t work!
Table 1: Two tables
1 # table on the left2 students
3 # table on the right4 grades
# A tibble: 7 × 2 name student_number <chr> <chr> 1 Urs 101 2 Rebekka 102 3 Dario 103 4 Jörg 104 5 Maude 105 6 Daniel 206 7 Mark 207
1 # table on the right2 grades
# A tibble: 10 × 2 student_number grade <dbl> <dbl> 1 101 95 2 102 85 3 103 90 4 104 100 5 105 90 6 106 90 7 107 85 8 108 70 9 109 60 10 110 55
In the first table, the student_number column is a character, and in
the second table, the student_number column is a number. In RStudio,
you can see the data types when you print to the console. We need to
convert one of them so that they match.
1 students |>
2 mutate(student_number = as.numeric(student_number))
# A tibble: 7 × 2 name student_number <chr> <dbl> 1 Urs 101 2 Rebekka 102 3 Dario 103 4 Jörg 104 5 Maude 105 6 Daniel 206 7 Mark 207
Now, we can join them together. Here, the column names are the same, so
we can simplify the by= argument.
1 students |>
2 mutate(student_number = as.numeric(student_number)) |>
3 left_join(grades, by = "student_number")
# A tibble: 7 × 3 name student_number grade <chr> <dbl> <dbl> 1 Urs 101 95 2 Rebekka 102 85 3 Dario 103 90 4 Jörg 104 100 5 Maude 105 90 6 Daniel 206 NA 7 Mark 207 NA
In this example, you’ll notice that there are some students who don’t
have grades. This is because they’re not in the grades data set.
Additionally, there were some grades that weren’t associated with a
student, because they’re not in the students data set.
left_join() is called a “left join” because the data on the left (the
first argument) will be kept, but the stuff on the right will be dropped
if there isn’t a match. However, there are some other kinds of joins
that you can use.
inner_join() will only keep the rows that have a match in both
data sets. If there’s no match, it will be dropped.1 students |>
2 mutate(student_number = as.numeric(student_number)) |>
3 inner_join(grades, by = "student_number")
# A tibble: 5 × 3 name student_number grade <chr> <dbl> <dbl> 1 Urs 101 95 2 Rebekka 102 85 3 Dario 103 90 4 Jörg 104 100 5 Maude 105 90
right_join() is the opposite of left_join(). It will keep the
data on the right, and drop the data on the left if there isn’t amatch.
1 students |>
2 mutate(student_number = as.numeric(student_number)) |>
3 right_join(grades, by = "student_number")
# A tibble: 10 × 3 name student_number grade <chr> <dbl> <dbl> 1 Urs 101 95 2 Rebekka 102 85 3 Dario 103 90 4 Jörg 104 100 5 Maude 105 90 6 <NA> 106 90 7 <NA> 107 85 8 <NA> 108 70 9 <NA> 109 60 10 <NA> 110 55
full_join() will keep all the data, even if there isn’t a match.
If there isn’t a match, it will put NA in the column.1 students |>
2 mutate(student_number = as.numeric(student_number)) |>
3 full_join(grades, by = "student_number")
# A tibble: 12 × 3 name student_number grade <chr> <dbl> <dbl> 1 Urs 101 95 2 Rebekka 102 85 3 Dario 103 90 4 Jörg 104 100 5 Maude 105 90 6 Daniel 206 NA 7 Mark 207 NA 8 <NA> 106 90 9 <NA> 107 85 10 <NA> 108 70 11 <NA> 109 60 12 <NA> 110 55
I find myself using left_join() probably 95% of the time, but it’s
good to know that there are other options.
The new data we’ve created has some missing data. For example, some
students don’t have grades, and some grades don’t have students. Missing
data in R is represented by NA, and can create some problems for you.
There are two ways to deal with this: you can either replace the
missing data with something else, or you can drop the rows with
missing data.
To replace missing data, you can use the replace_na() function. This
function takes two arguments: the first is the data set you want to
replace the missing data in, and the second is the value you want to
replace the missing data with. For example, if you want to replace all
the missing data in the name column with “NO NAME FOUND”, you can use
the following code:
1 students |>
2 mutate(student_number = as.numeric(student_number)) |>
3 full_join(grades, by = "student_number") |>
4 mutate(name = replace_na(name, "NO NAME FOUND"))
# A tibble: 12 × 3 name student_number grade <chr> <dbl> <dbl> 1 Urs 101 95 2 Rebekka 102 85 3 Dario 103 90 4 Jörg 104 100 5 Maude 105 90 6 Daniel 206 NA 7 Mark 207 NA 8 NO NAME FOUND 106 90 9 NO NAME FOUND 107 85 10 NO NAME FOUND 108 70 11 NO NAME FOUND 109 60 12 NO NAME FOUND 110 55
Second, you can use the drop_na() function to drop the rows with
missing data. This function takes one argument: the data set you want to
drop the missing data from. For example, if you want to drop all the
rows with missing data in the grade column, you can use the following
code:
1 students |>
2 mutate(student_number = as.numeric(student_number)) |>
3 full_join(grades, by = "student_number") |>
4 mutate(name = replace_na(name, "NO NAME FOUND")) |>
5 drop_na(grade)
# A tibble: 10 × 3 name student_number grade <chr> <dbl> <dbl> 1 Urs 101 95 2 Rebekka 102 85 3 Dario 103 90 4 Jörg 104 100 5 Maude 105 90 6 NO NAME FOUND 106 90 7 NO NAME FOUND 107 85 8 NO NAME FOUND 108 70 9 NO NAME FOUND 109 60 10 NO NAME FOUND 110 55
Let’s go back to our cleaned birth data set and do some analysis.
For example, suppose we want to know how many children were born each
year to women over 45 years old. First, we need to filter() only the
rows where the mother_age_to column is greater than 45.
1 births |>
2 filter(mother_age_to > 45)
# A tibble: 7,150 × 6 year canton total_born mother_age_from mother_age_to canton_name <dbl> <chr> <dbl> <int> <int> <chr> 1 1969 1 25 45 49 Zürich 2 1969 1 0 50 54 Zürich 3 1969 1 0 55 59 Zürich 4 1969 1 0 60 64 Zürich 5 1969 1 0 65 69 Zürich 6 1969 2 26 45 49 Bern 7 1969 2 0 50 54 Bern 8 1969 2 0 55 59 Bern 9 1969 2 0 60 64 Bern 10 1969 2 0 65 69 Bern # ℹ 7,140 more rows
Now let’s think. We want to know how many children were born each year.
We need to group_by() the year column.
1 births |>
2 filter(mother_age_to > 45) |>
3 group_by(year)
# A tibble: 7,150 × 6 # Groups: year [55] year canton total_born mother_age_from mother_age_to canton_name <dbl> <chr> <dbl> <int> <int> <chr> 1 1969 1 25 45 49 Zürich 2 1969 1 0 50 54 Zürich 3 1969 1 0 55 59 Zürich 4 1969 1 0 60 64 Zürich 5 1969 1 0 65 69 Zürich 6 1969 2 26 45 49 Bern 7 1969 2 0 50 54 Bern 8 1969 2 0 55 59 Bern 9 1969 2 0 60 64 Bern 10 1969 2 0 65 69 Bern # ℹ 7,140 more rows
Now, we want to summarize() the data. We want to know the total
number of children born each year. We can use the sum() function to do
this.
1 births |>
2 filter(mother_age_to > 45) |>
3 group_by(year) |>
4 summarize(total_born = sum(total_born))
# A tibble: 55 × 2 year total_born <dbl> <dbl> 1 1969 241 2 1970 226 3 1971 196 4 1972 180 5 1973 153 6 1974 131 7 1975 102 8 1976 96 9 1977 83 10 1978 79 # ℹ 45 more rows
Finally, use this data to answer the following questions:
Please email me the code you used to find the answers in a document
named week_3_homework_(your_name).R by Tuesday, March 12th.
For people with some experience working with data, these are a bit easy. If you’d like more practice, here are some bonus questions:
These aren’t part of the homework, but might be a good challenge for you.
Can you download and clean this one?
population change in Zürich in 2000? (Excluding immigration and emmigration)
1 # 1. How many teenage births were there in Zürich between 2000 and 2020?2 births |>
3 filter(mother_age_to < 20, canton_name == "Zürich", year >= 2000, year <= 2020) |>
4 summarize(total_born = sum(total_born))
5 6 # 2. What is the approximate age of oldest woman to ever give birth in each canton?7 births |>
8 group_by(canton_name) |>
9 filter(total_born > 0) |>
10 summarize(oldest_mother = max(mother_age_to))
11 12 # 3. How many children were born in Zurich, Bern and Geneva in 2019?13 births |>
14 filter(year == 2019) |>
15 group_by(canton_name) |>
16 summarize(total_born = sum(total_born)) |>
17 filter(canton_name %in% c("Zürich", "Bern", "Genf"))
18 19 # 4. How many children were born in each Canton in 1980?20 births |>
21 filter(year == 1980) |>
22 group_by(canton_name) |>
23 summarize(total_born = sum(total_born))
24 25 # 5. In each canton, what was the most common age range for mothers to give birth in 1970, 1990, and 2010?26 births |>
27 filter(year %in% c(1970, 1990, 2010)) |>
28 group_by(canton_name, year) |>
29 filter(total_born == max(total_born)) |>
30 select(year, canton_name, mother_age_from, mother_age_to) |>
31 arrange(canton_name, year)
32 33 # 6. Between 2010 and 2020, what were the average number of children born each year in each canton?34 35 births |>
36 filter(year >= 2010, year <= 2020) |>
37 group_by(canton_name, year) |>
38 summarise(total_births_per_year = sum(total_born)) |>
39 summarize(avg_born = mean(total_births_per_year))
40 41 # 7. Building off this, which years had an above average birth rate for that decade?42 43 births |>
44 filter(year >= 2010, year <= 2020) |>
45 group_by(canton_name, year) |>
46 summarise(total_births_per_year = sum(total_born)) |>
47 summarize(year = year, total_births_per_year = total_births_per_year, avg_born = mean(total_births_per_year)) |>
48 mutate(above_average = total_births_per_year > avg_born) |>
49 filter(above_average) |>
50 select(canton_name, year)
51 52 # 8. Here's some data about the number of deaths in each canton. <https://opendata.swiss/de/dataset/todesfalle-nach-funf-jahres-altersgruppe-geschlecht-und-kanton-1969-2023> Can you download and clean this one?53 54 download.file("https://dam-api.bfs.admin.ch/hub/api/dam/assets/32007751/master", "input_data/deaths.csv")
55 56 deaths <- read_csv("input_data/deaths.csv")
57 colnames(deaths) <- c("year", "canton", "age", "sex", "total_deaths")
58 59 deaths <- deaths |>
60 filter(age == "_T") |>
61 filter(canton != "CH") |>
62 filter(sex == "T") |>
63 select(-age, -sex)
64 65 deaths
66 67 # 9. Simplify the births data into just number of births by year. Then join this data with the deaths data. What was the total population change in each canton?68 69 births <- births |>
70 group_by(year, canton_number, canton_name) |>
71 summarize(total_births = sum(total_born))
72 73 deaths <- deaths |>
74 rename(canton_number = canton)
75 76 birth_and_deaths <- births |>
77 left_join(deaths)
78 79 birth_and_deaths <- birth_and_deaths |>
80 mutate(total_population_change = total_births - total_deaths)
81 82 birth_and_deaths |>
83 filter(year == 2000) |>
84 filter(canton_name == "Zürich") |>
85 select(total_population_change)
86 87 # 10. Make a plot of the total births and deaths in Basel-Stadt between 1970 and 2000.88 89 birth_and_deaths |>
90 filter(canton_name == "Basel-Stadt") |>
91 select(year, canton_name, total_births, total_deaths) |>
92 pivot_longer(cols = c(total_births, total_deaths), names_to = "event", values_to = "total") |>
93 ggplot(aes(x = year, y = total, color = event)) +
94 geom_line()