Tidyverse 3: Data tips & tricks

Morley J Weston

Review: loading data, head(), tail()

This week is all about practice. I want to make sure you understand the basics before we start making charts, maps and websites.

Let’s clean some data together.

  1. Make a new file
  2. Save your file
  3. Load the Tidyverse
  4. … and let’s get started. Here’s a link to a data set:

You can get the download link, usually, by finding the button, right clicking, and hitting “Copy link address”. This will be slightly different depending on your browser.

You need to make sure you have a link to the file you
want.

  1. Download it into your project folder.
R
1
download.file("https://dam-api.bfs.admin.ch/hub/api/dam/assets/32007752/master", "input_data/births.csv")

Note: On Windows, you need to make an adjustment. You need to add mode = "wb" to the download.file() function.

R
1
download.file("https://dam-api.bfs.admin.ch/hub/api/dam/assets/32007752/master", "input_data/births.csv", mode = "wb") 
  1. Load this data set into your R session
R
1
births <- read_csv("input_data/births.csv")
  1. Take a look at the first 20 rows of the data set. Can you figure out what each column is?
R
1
births |> head()
# A tibble: 6 × 5
YEAR CANTON AGE_MOTHER SEX_CHILD OBS_VALUE
<dbl> <chr>  <chr>      <chr>         <dbl>
1  1969 CH     _T         T            102520
2  1969 CH     _T         F             49990
3  1969 CH     _T         M             52530
4  1969 CH     Y10T14     T                 6
5  1969 CH     Y10T14     F                 3
6  1969 CH     Y10T14     M                 3
  1. Now look at the last 20 rows. What are some of the steps we’re going to have to take?
R
1
births |> tail()
# A tibble: 6 × 5
YEAR CANTON AGE_MOTHER SEX_CHILD OBS_VALUE
<dbl> <chr>  <chr>      <chr>         <dbl>
1  2023 26     Y60T64     T                 0
2  2023 26     Y60T64     F                 0
3  2023 26     Y60T64     M                 0
4  2023 26     Y65T69     T                 0
5  2023 26     Y65T69     F                 0
6  2023 26     Y65T69     M                 0

Can we guess what every column is?

Classwork: filter()

To review, the filter() function will sort out the data you want from the stuff you don’t. For example, if you only wanted a dataset for boys in Zürich in 1970, you could use the following code:

R
1
births |>
2
  filter(YEAR == 1970) |> 
3
  filter(CANTON == 1) |>
4
  filter(SEX_CHILD == "M")
# A tibble: 13 × 5
YEAR CANTON AGE_MOTHER SEX_CHILD OBS_VALUE
<dbl> <chr>  <chr>      <chr>         <dbl>
1  1970 1      _T         M              8330
2  1970 1      Y10T14     M                 0
3  1970 1      Y15T19     M               279
4  1970 1      Y20T24     M              2274
5  1970 1      Y25T29     M              3119
6  1970 1      Y30T34     M              1745
7  1970 1      Y35T39     M               768
8  1970 1      Y40T44     M               133
9  1970 1      Y45T49     M                12
10  1970 1      Y50T54     M                 0
11  1970 1      Y55T59     M                 0
12  1970 1      Y60T64     M                 0
13  1970 1      Y65T69     M                 0

Please answer the following questions: Your answer should be a code block that uses the filter() function to find the answer.

  1. How many babies were born in Vaud (Canton number 22) in 2020?
  2. How many girls were born in Zurich (Canton number 1) to mothers aged 30-34 in 2019?
  3. Between 2000 and 2020, How many years did Ticino (Canton number 21) have fewer than 1400 boys born?
  4. How many boys and girls were born in Switzerland in 2015?

Classwork: select(), rename(), mutate()

Let’s clean up the data set.

  1. TYPING IN ALL CAPS IS ANNOYING. Rename the columns to year, canton, age_of_mother, sex_of_child, total_born. You can use
`rename()` or `colnames() <- c()`. It should look like this:
# A tibble: 10 × 5
year canton age_of_mother sex_of_child total_born
<dbl> <chr>  <chr>         <chr>             <dbl>
1  1969 CH     _T            T                102520
2  1969 CH     _T            F                 49990
3  1969 CH     _T            M                 52530
4  1969 CH     Y10T14        T                     6
5  1969 CH     Y10T14        F                     3
6  1969 CH     Y10T14        M                     3
7  1969 CH     Y15T19        T                  3648
8  1969 CH     Y15T19        F                  1780
9  1969 CH     Y15T19        M                  1868
10  1969 CH     Y20T24        T                 30230
  1. Let’s say I don’t care about gender. I want to know the total number of children born in each Canton each year to mothers of different
ages. **Filter** only the total number of children born, and discard
the boy and girl counts. It should look like this:
# A tibble: 10 × 5
year canton age_of_mother sex_of_child total_born
<dbl> <chr>  <chr>         <chr>             <dbl>
1  1969 CH     _T            T                102520
2  1969 CH     Y10T14        T                     6
3  1969 CH     Y15T19        T                  3648
4  1969 CH     Y20T24        T                 30230
5  1969 CH     Y25T29        T                 36206
6  1969 CH     Y30T34        T                 20479
7  1969 CH     Y35T39        T                  9077
8  1969 CH     Y40T44        T                  2633
9  1969 CH     Y45T49        T                   240
10  1969 CH     Y50T54        T                     1
  1. Now that sex_of_child column is pretty useless, isn’t it? Let’s select() only the columns that we care about. It should look
like this:
# A tibble: 10 × 4
year canton age_of_mother total_born
<dbl> <chr>  <chr>              <dbl>
1  1969 CH     _T                102520
2  1969 CH     Y10T14                 6
3  1969 CH     Y15T19              3648
4  1969 CH     Y20T24             30230
5  1969 CH     Y25T29             36206
6  1969 CH     Y30T34             20479
7  1969 CH     Y35T39              9077
8  1969 CH     Y40T44              2633
9  1969 CH     Y45T49               240
10  1969 CH     Y50T54                 1
  1. The age_of_mother column is a bit of a mess. Let’s clean it up. A function called str_sub() can help us with this. Learn how it
works by typing `?str_sub` into the console. You can also just
experiment with it with a test string.
R
1
str_sub("Maybe sub_str stands for submarine_string.", 1, 5)
2
str_sub("What about substitute_string?", 12, -2)
3
str_sub("Nah, maybe I'm overthinking it.", 16, 27)
[1] "Maybe"
[1] "substitute_string"
[1] "overthinking"

Now that you understand the function, you can use it to mutate() the age_of_mother column, and make two new columns, called mother_age_from and mother_age_to. It should look like this:

# A tibble: 19,305 × 6
year canton age_of_mother total_born mother_age_from mother_age_to
<dbl> <chr>  <chr>              <dbl> <chr>           <chr>        
1  1969 CH     _T                102520 T               ""           
2  1969 CH     Y10T14                 6 10              "14"         
3  1969 CH     Y15T19              3648 15              "19"         
4  1969 CH     Y20T24             30230 20              "24"         
5  1969 CH     Y25T29             36206 25              "29"         
6  1969 CH     Y30T34             20479 30              "34"         
7  1969 CH     Y35T39              9077 35              "39"         
8  1969 CH     Y40T44              2633 40              "44"         
9  1969 CH     Y45T49               240 45              "49"         
10  1969 CH     Y50T54                 1 50              "54"         
# ℹ 19,295 more rows
  1. It looks like we have more filtering to do. the canton column also includes the total number of births in Switzerland. We don’t want
that. Let’s use **filter()** to remove the rows where the `canton`
column is equal to “CH”. We also want to get rid of the rows where
the age_of_mother is “\_T”. It should look like this:
# A tibble: 17,160 × 6
year canton age_of_mother total_born mother_age_from mother_age_to
<dbl> <chr>  <chr>              <dbl> <chr>           <chr>        
1  1969 1      Y10T14                 0 10              14           
2  1969 1      Y15T19               532 15              19           
3  1969 1      Y20T24              4608 20              24           
4  1969 1      Y25T29              6149 25              29           
5  1969 1      Y30T34              3535 30              34           
6  1969 1      Y35T39              1423 35              39           
7  1969 1      Y40T44               366 40              44           
8  1969 1      Y45T49                25 45              49           
9  1969 1      Y50T54                 0 50              54           
10  1969 1      Y55T59                 0 55              59           
# ℹ 17,150 more rows
  1. But wait! Notice that the mother_age_from and mother_age_to columns are still characters. It’s still just a string that looks
like a number. We can convert them to integers with the
`as.integer()` function. You should mutate over the column again,
using the as.integer() function. It should look like this:
# A tibble: 17,160 × 6
year canton age_of_mother total_born mother_age_from mother_age_to
<dbl> <chr>  <chr>              <dbl>           <int>         <int>
1  1969 1      Y10T14                 0              10            14
2  1969 1      Y15T19               532              15            19
3  1969 1      Y20T24              4608              20            24
4  1969 1      Y25T29              6149              25            29
5  1969 1      Y30T34              3535              30            34
6  1969 1      Y35T39              1423              35            39
7  1969 1      Y40T44               366              40            44
8  1969 1      Y45T49                25              45            49
9  1969 1      Y50T54                 0              50            54
10  1969 1      Y55T59                 0              55            59
# ℹ 17,150 more rows
  1. Now we don’t need the age_of_mother column anymore. Let’s select() only the columns that we care about. It should look
like this:
# A tibble: 17,160 × 5
year canton total_born mother_age_from mother_age_to
<dbl> <chr>       <dbl>           <int>         <int>
1  1969 1               0              10            14
2  1969 1             532              15            19
3  1969 1            4608              20            24
4  1969 1            6149              25            29
5  1969 1            3535              30            34
6  1969 1            1423              35            39
7  1969 1             366              40            44
8  1969 1              25              45            49
9  1969 1               0              50            54
10  1969 1               0              55            59
# ℹ 17,150 more rows
  1. Now that we have a clean data set, let’s save it to a variable, if you haven’t been doing that already. This time, I think it’s safe to
overwrite the original value, so we simply use `->` to give it the
same name as before. It should look like this:

Before

R
1
births |>
2
    rename(
3
    year = YEAR,
4
    ...
5
    ...
6
    ...

After

R
1
births <- births |>
2
    rename(
3
    year = YEAR,
4
    ...
5
    ...
6
    ...

Joining two datasets together with `left_join()`

There’s one last annoying thing about this data set: the canton numbers. It would be really annoying to have to remember that Zurich is canton number 1, and so on. We can fix this by joining the data set with another data set that has the canton names.

We can find the canton names and numbers here:

https://www.bfs.admin.ch/asset/de/453856

Can you find the download link for the canton names and numbers? Use download.file() to download it into your project folder.

Now import the data set into your R session. This one is especially messy, so I wrote some code to help you out. You can just copy and paste this code into your own file if you like.

R
1
library(readxl)
2
3
canton_names <- read_excel("input_data/canton_nums.xlsx")
4
5
canton_names <- canton_names |>
6
  select(1:2) |> # I select the first two columns because the rest are filled with junk.
7
  tail(-4) |> # I delete the first four rows, because the first four are metadata.
8
  head(26) # I select only the first 26 rows, because the rest are metadata.
9
10
# Now I rename the columns to something more useful.
11
colnames(canton_names) <- c("bfs_canton_number", "canton_name")
12
13
canton_names
# A tibble: 26 × 2
bfs_canton_number canton_name
<chr>             <chr>      
1 1                 Zürich     
2 2                 Bern       
3 3                 Luzern     
4 4                 Uri        
5 5                 Schwyz     
6 6                 Obwalden   
7 7                 Nidwalden  
8 8                 Glarus     
9 9                 Zug        
10 10                Freiburg   
# ℹ 16 more rows

Now the tricky part: joining. there’s several different functions to join things, but the one we’ll use is left_join(). This function takes two data sets, and joins them together into one. It’s called “left join” because the data set on the left side of the function is the one that will be kept, and the data set on the right side will be joined to it.

Now we use left_join() to join the two data sets together. This function takes two arguments: the first is the data set you want to join, the second is by=, in which you put the names of the columns you want to join by. In this case, we want to join by the canton_number column in the births data set, and the bfs_canton_number column in the canton_names data set.

When we do this, it will match all the rows in the births data set with the corresponding row in the canton_names data set. If there’s no match, it will put NA in the column, meaning that there is no data there.

R
1
births |> 
2
  left_join(canton_names, by = c("canton" = "bfs_canton_number"))
# A tibble: 17,160 × 6
year canton total_born mother_age_from mother_age_to canton_name
<dbl> <chr>       <dbl>           <int>         <int> <chr>      
1  1969 1               0              10            14 Zürich     
2  1969 1             532              15            19 Zürich     
3  1969 1            4608              20            24 Zürich     
4  1969 1            6149              25            29 Zürich     
5  1969 1            3535              30            34 Zürich     
6  1969 1            1423              35            39 Zürich     
7  1969 1             366              40            44 Zürich     
8  1969 1              25              45            49 Zürich     
9  1969 1               0              50            54 Zürich     
10  1969 1               0              55            59 Zürich     
# ℹ 17,150 more rows

When this looks good, save it to a variable.

R
1
births <- births |> 
2
  left_join(canton_names, by = c("canton" = "bfs_canton_number"))

Make sure that the data types are the same! If they’re not, you need to use a function to convert them using mutate(). Some functions that can do this are:

  1. as.integer() 0, 1, 2, 3
  2. as.numeric() 0.0, 1.0, 2.0, 3.0
  3. as.logical() FALSE, TRUE, TRUE, TRUE
  4. as.character() “0”, “one”, “2”, “Zürich”
  5. as.roman() I, II, III, IV

For example, here are two data sets that we want to join together. The first data set is a list of students, and the second data set is a list of grades. We want to join them together by the student number, to see which students got which grades. But it won’t work!

Table 1: Two tables

R
1
# table on the left
2
students
3
# table on the right
4
grades
# A tibble: 7 × 2
name    student_number
<chr>   <chr>         
1 Urs     101           
2 Rebekka 102           
3 Dario   103           
4 Jörg    104           
5 Maude   105           
6 Daniel  206           
7 Mark    207           
R
1
# table on the right
2
grades
# A tibble: 10 × 2
student_number grade
<dbl> <dbl>
1            101    95
2            102    85
3            103    90
4            104   100
5            105    90
6            106    90
7            107    85
8            108    70
9            109    60
10            110    55

In the first table, the student_number column is a character, and in the second table, the student_number column is a number. In RStudio, you can see the data types when you print to the console. We need to convert one of them so that they match.

R
1
students |> 
2
  mutate(student_number = as.numeric(student_number))
# A tibble: 7 × 2
name    student_number
<chr>            <dbl>
1 Urs                101
2 Rebekka            102
3 Dario              103
4 Jörg               104
5 Maude              105
6 Daniel             206
7 Mark               207

Now, we can join them together. Here, the column names are the same, so we can simplify the by= argument.

R
1
students |> 
2
  mutate(student_number = as.numeric(student_number)) |> 
3
  left_join(grades, by = "student_number")
# A tibble: 7 × 3
name    student_number grade
<chr>            <dbl> <dbl>
1 Urs                101    95
2 Rebekka            102    85
3 Dario              103    90
4 Jörg               104   100
5 Maude              105    90
6 Daniel             206    NA
7 Mark               207    NA

Other types of join: `inner_join()`, `right_join()` and `full_join()`

In this example, you’ll notice that there are some students who don’t have grades. This is because they’re not in the grades data set. Additionally, there were some grades that weren’t associated with a student, because they’re not in the students data set.

left_join() is called a “left join” because the data on the left (the first argument) will be kept, but the stuff on the right will be dropped if there isn’t a match. However, there are some other kinds of joins that you can use.

  1. inner_join() will only keep the rows that have a match in both data sets. If there’s no match, it will be dropped.
R
1
students |> 
2
  mutate(student_number = as.numeric(student_number)) |> 
3
  inner_join(grades, by = "student_number")
# A tibble: 5 × 3
name    student_number grade
<chr>            <dbl> <dbl>
1 Urs                101    95
2 Rebekka            102    85
3 Dario              103    90
4 Jörg               104   100
5 Maude              105    90
  1. right_join() is the opposite of left_join(). It will keep the data on the right, and drop the data on the left if there isn’t a
match.
R
1
students |> 
2
  mutate(student_number = as.numeric(student_number)) |> 
3
  right_join(grades, by = "student_number")
# A tibble: 10 × 3
name    student_number grade
<chr>            <dbl> <dbl>
1 Urs                101    95
2 Rebekka            102    85
3 Dario              103    90
4 Jörg               104   100
5 Maude              105    90
6 <NA>               106    90
7 <NA>               107    85
8 <NA>               108    70
9 <NA>               109    60
10 <NA>               110    55
  1. full_join() will keep all the data, even if there isn’t a match. If there isn’t a match, it will put NA in the column.
R
1
students |> 
2
  mutate(student_number = as.numeric(student_number)) |> 
3
  full_join(grades, by = "student_number")
# A tibble: 12 × 3
name    student_number grade
<chr>            <dbl> <dbl>
1 Urs                101    95
2 Rebekka            102    85
3 Dario              103    90
4 Jörg               104   100
5 Maude              105    90
6 Daniel             206    NA
7 Mark               207    NA
8 <NA>               106    90
9 <NA>               107    85
10 <NA>               108    70
11 <NA>               109    60
12 <NA>               110    55

I find myself using left_join() probably 95% of the time, but it’s good to know that there are other options.

Dealing with missing data with `replace_na()` and `drop_na()`

The new data we’ve created has some missing data. For example, some students don’t have grades, and some grades don’t have students. Missing data in R is represented by NA, and can create some problems for you. There are two ways to deal with this: you can either replace the missing data with something else, or you can drop the rows with missing data.

To replace missing data, you can use the replace_na() function. This function takes two arguments: the first is the data set you want to replace the missing data in, and the second is the value you want to replace the missing data with. For example, if you want to replace all the missing data in the name column with “NO NAME FOUND”, you can use the following code:

R
1
students |> 
2
  mutate(student_number = as.numeric(student_number)) |> 
3
  full_join(grades, by = "student_number") |> 
4
  mutate(name = replace_na(name, "NO NAME FOUND"))
# A tibble: 12 × 3
name          student_number grade
<chr>                  <dbl> <dbl>
1 Urs                      101    95
2 Rebekka                  102    85
3 Dario                    103    90
4 Jörg                     104   100
5 Maude                    105    90
6 Daniel                   206    NA
7 Mark                     207    NA
8 NO NAME FOUND            106    90
9 NO NAME FOUND            107    85
10 NO NAME FOUND            108    70
11 NO NAME FOUND            109    60
12 NO NAME FOUND            110    55

Second, you can use the drop_na() function to drop the rows with missing data. This function takes one argument: the data set you want to drop the missing data from. For example, if you want to drop all the rows with missing data in the grade column, you can use the following code:

R
1
students |> 
2
  mutate(student_number = as.numeric(student_number)) |> 
3
  full_join(grades, by = "student_number") |> 
4
  mutate(name = replace_na(name, "NO NAME FOUND")) |> 
5
  drop_na(grade)
# A tibble: 10 × 3
name          student_number grade
<chr>                  <dbl> <dbl>
1 Urs                      101    95
2 Rebekka                  102    85
3 Dario                    103    90
4 Jörg                     104   100
5 Maude                    105    90
6 NO NAME FOUND            106    90
7 NO NAME FOUND            107    85
8 NO NAME FOUND            108    70
9 NO NAME FOUND            109    60
10 NO NAME FOUND            110    55

Review together: group_by(), summarize()

Let’s go back to our cleaned birth data set and do some analysis.

For example, suppose we want to know how many children were born each year to women over 45 years old. First, we need to filter() only the rows where the mother_age_to column is greater than 45.

R
1
births |>
2
  filter(mother_age_to > 45)
# A tibble: 7,150 × 6
year canton total_born mother_age_from mother_age_to canton_name
<dbl> <chr>       <dbl>           <int>         <int> <chr>      
1  1969 1              25              45            49 Zürich     
2  1969 1               0              50            54 Zürich     
3  1969 1               0              55            59 Zürich     
4  1969 1               0              60            64 Zürich     
5  1969 1               0              65            69 Zürich     
6  1969 2              26              45            49 Bern       
7  1969 2               0              50            54 Bern       
8  1969 2               0              55            59 Bern       
9  1969 2               0              60            64 Bern       
10  1969 2               0              65            69 Bern       
# ℹ 7,140 more rows

Now let’s think. We want to know how many children were born each year. We need to group_by() the year column.

R
1
births |>
2
  filter(mother_age_to > 45) |>
3
  group_by(year)
# A tibble: 7,150 × 6
# Groups:   year [55]
year canton total_born mother_age_from mother_age_to canton_name
<dbl> <chr>       <dbl>           <int>         <int> <chr>      
1  1969 1              25              45            49 Zürich     
2  1969 1               0              50            54 Zürich     
3  1969 1               0              55            59 Zürich     
4  1969 1               0              60            64 Zürich     
5  1969 1               0              65            69 Zürich     
6  1969 2              26              45            49 Bern       
7  1969 2               0              50            54 Bern       
8  1969 2               0              55            59 Bern       
9  1969 2               0              60            64 Bern       
10  1969 2               0              65            69 Bern       
# ℹ 7,140 more rows

Now, we want to summarize() the data. We want to know the total number of children born each year. We can use the sum() function to do this.

R
1
births |>
2
  filter(mother_age_to > 45) |>
3
  group_by(year) |>
4
  summarize(total_born = sum(total_born))
# A tibble: 55 × 2
year total_born
<dbl>      <dbl>
1  1969        241
2  1970        226
3  1971        196
4  1972        180
5  1973        153
6  1974        131
7  1975        102
8  1976         96
9  1977         83
10  1978         79
# ℹ 45 more rows

Homework: Answering questions with group_by() and summarize()

Finally, use this data to answer the following questions:

  1. How many teenage births were there in Zürich between 2000 and 2020?
  2. What is the approximate age of oldest woman to ever give birth in each canton?
  3. How many children were born in Zurich, Bern and Geneva in 2019?
  4. How many children were born in each Canton in 1980?
  5. In each canton, what was the most common age range for mothers to give birth in 1970, 1990, and 2010?

Please email me the code you used to find the answers in a document named week_3_homework_(your_name).R by Tuesday, March 12th.

Bonus questions

For people with some experience working with data, these are a bit easy. If you’d like more practice, here are some bonus questions:

These aren’t part of the homework, but might be a good challenge for you.

  1. Between 2010 and 2020, what were the average number of children born each year in each canton?
  2. Building off this, which years had an above average birth rate for that decade?
  3. Here’s some data about the number of deaths in each canton. https://opendata.swiss/de/dataset/todesfalle-nach-funf-jahres-altersgruppe-geschlecht-und-kanton-1969-2023
Can you download and clean this one?
  1. Simplify the births data into just number of births by year. Then join() this data with the deaths data. What was the total
population change in Zürich in 2000? (Excluding immigration and
emmigration)
  1. Make a plot of the total births and deaths in Basel-Stadt between 1970 and 2000.
R
1
# 1. How many teenage births were there in Zürich between 2000 and 2020?
2
births |> 
3
  filter(mother_age_to < 20, canton_name == "Zürich", year >= 2000, year <= 2020) |>
4
  summarize(total_born = sum(total_born))
5
6
# 2. What is the approximate age of oldest woman to ever give birth in each canton?
7
births |> 
8
  group_by(canton_name) |> 
9
  filter(total_born > 0) |> 
10
  summarize(oldest_mother = max(mother_age_to))
11
12
# 3. How many children were born in Zurich, Bern and Geneva in 2019?
13
births |> 
14
  filter(year == 2019) |> 
15
  group_by(canton_name) |>
16
  summarize(total_born = sum(total_born)) |> 
17
  filter(canton_name %in% c("Zürich", "Bern", "Genf"))
18
19
# 4. How many children were born in each Canton in 1980?
20
births |>
21
  filter(year == 1980) |>
22
  group_by(canton_name) |>
23
  summarize(total_born = sum(total_born))
24
25
# 5. In each canton, what was the most common age range for mothers to give birth in 1970, 1990, and 2010?
26
births |> 
27
  filter(year %in% c(1970, 1990, 2010)) |>
28
  group_by(canton_name, year) |> 
29
  filter(total_born == max(total_born)) |> 
30
  select(year, canton_name, mother_age_from, mother_age_to) |> 
31
  arrange(canton_name, year)
32
33
# 6.  Between 2010 and 2020, what were the average number of children born each year in each canton?
34
35
births |>
36
  filter(year >= 2010, year <= 2020) |>
37
  group_by(canton_name, year) |>
38
  summarise(total_births_per_year = sum(total_born)) |>
39
  summarize(avg_born = mean(total_births_per_year))
40
41
# 7.  Building off this, which years had an above average birth rate for that decade?
42
43
births |> 
44
  filter(year >= 2010, year <= 2020) |>
45
  group_by(canton_name, year) |>
46
  summarise(total_births_per_year = sum(total_born)) |>
47
  summarize(year = year, total_births_per_year = total_births_per_year, avg_born = mean(total_births_per_year)) |> 
48
  mutate(above_average = total_births_per_year > avg_born) |> 
49
  filter(above_average) |> 
50
  select(canton_name, year)
51
52
# 8.  Here's some data about the number of deaths in each canton. <https://opendata.swiss/de/dataset/todesfalle-nach-funf-jahres-altersgruppe-geschlecht-und-kanton-1969-2023> Can you download and clean this one?
53
54
download.file("https://dam-api.bfs.admin.ch/hub/api/dam/assets/32007751/master", "input_data/deaths.csv")
55
56
deaths <- read_csv("input_data/deaths.csv")
57
colnames(deaths) <- c("year", "canton", "age", "sex", "total_deaths")
58
59
deaths <- deaths |> 
60
  filter(age == "_T") |> 
61
  filter(canton != "CH") |> 
62
  filter(sex == "T") |> 
63
  select(-age, -sex)
64
65
deaths
66
67
# 9.  Simplify the births data into just number of births by year. Then join this data with the deaths data. What was the total population change in each canton?
68
69
births <- births |> 
70
  group_by(year, canton_number, canton_name) |> 
71
  summarize(total_births = sum(total_born))
72
73
deaths <- deaths |>
74
  rename(canton_number = canton)
75
76
birth_and_deaths <- births |> 
77
  left_join(deaths)
78
79
birth_and_deaths <- birth_and_deaths |> 
80
  mutate(total_population_change = total_births - total_deaths)
81
82
birth_and_deaths |> 
83
  filter(year == 2000) |>
84
  filter(canton_name == "Zürich") |>
85
  select(total_population_change)
86
87
# 10. Make a plot of the total births and deaths in Basel-Stadt between 1970 and 2000.
88
89
birth_and_deaths |> 
90
  filter(canton_name == "Basel-Stadt") |>
91
  select(year, canton_name, total_births, total_deaths) |> 
92
  pivot_longer(cols = c(total_births, total_deaths), names_to = "event", values_to = "total") |> 
93
  ggplot(aes(x = year, y = total, color = event)) +
94
  geom_line()

MJW