library(erikmisc)
# contains various utility functions that Erhardt uses in his work and teaching, particularly within the context of statistical analysis and data handling in R.
# To load package erikmisc, you first need to install the package by using the following command
# install.packages("devtools")
# install.packages("Hmisc")
# devtools::install_github("erikerhardt/erikmisc", dependencies = TRUE)
library(tidyverse)
library(openintro)
library(statsr)
library(broom)ADA1: Class 08, Introduction to linear regression
[Advanced Data Analysis 1]
Rubric
The context of this assignment comes from OpenIntro Labs for R and tidyverse:
This file is a template for the assignment. Please keep the questions in their original numbered order, insert your solutions below each question, and delete any unnecessary code or text before submitting.
Some questions are answered by the code you’ve written. In those cases, in your answer write “see code”.
The Human Freedom Index is a report that attempts to summarize the idea of “freedom” through a bunch of different variables for many countries around the globe. It serves as a rough objective measure for the relationships between the different types of freedom - whether it’s political, religious, economical or personal freedom - and other social and economic circumstances. The Human Freedom Index is an annually co-published report by the Cato Institute, the Fraser Institute, and the Liberales Institut at the Friedrich Naumann Foundation for Freedom.
In this lab, you’ll be analysing data from the Human Freedom Index reports. Your aim will be to summarize a few of the relationships within the data both graphically and numerically in order to find which variables can help tell a story about freedom.
Getting Started
The data we’re working with is in the openintro package and it’s called hfi, short for Human Freedom Index.
(1 p) 1. Data basics
- What are the dimensions of the dataset?
- What does each row represent?
- Write the text of your answer here…
# Insert code here
data(hfi, package = "openintro")
attr(hfi, "spec") <- NULL # remove variable class specification attribute
dim(hfi)[1] 1458 123
hfi# A tibble: 1,458 × 123
year ISO_code countries region pf_rol_procedural pf_rol_civil
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 2016 ALB Albania Eastern Europe 6.66 4.55
2 2016 DZA Algeria Middle East & North… NA NA
3 2016 AGO Angola Sub-Saharan Africa NA NA
4 2016 ARG Argentina Latin America & the… 7.10 5.79
5 2016 ARM Armenia Caucasus & Central … NA NA
6 2016 AUS Australia Oceania 8.44 7.53
7 2016 AUT Austria Western Europe 8.97 7.87
8 2016 AZE Azerbaijan Caucasus & Central … NA NA
9 2016 BHS Bahamas Latin America & the… 6.93 6.01
10 2016 BHR Bahrain Middle East & North… NA NA
# ℹ 1,448 more rows
# ℹ 117 more variables: pf_rol_criminal <dbl>, pf_rol <dbl>,
# pf_ss_homicide <dbl>, pf_ss_disappearances_disap <dbl>,
# pf_ss_disappearances_violent <dbl>, pf_ss_disappearances_organized <dbl>,
# pf_ss_disappearances_fatalities <dbl>, pf_ss_disappearances_injuries <dbl>,
# pf_ss_disappearances <dbl>, pf_ss_women_fgm <dbl>,
# pf_ss_women_missing <dbl>, pf_ss_women_inheritance_widows <dbl>, …
(1 p) 2. Data subset hfi_2016
- The dataset spans a lot of years, but we are only interested in data from year 2016.
- Filter the data
hfidata frame for year 2016, select the listed variables, and assign the result to a data frame namedhfi_2016.
yearISO_codecountriesregionpf_expression_controlpf_scorehf_scoreWrite the text of your answer here…
# Insert code here
hfi_2016 <-
hfi
# more code here...(1 p) 3. Model 1, Linear relationship?
- What type of plot would you use to display the relationship between the personal freedom score,
pf_score, andpf_expression_control? - Plot this relationship using the variable
pf_expression_controlas the predictor. Does the relationship look linear? - If you knew a country’s
pf_expression_control, or its score out of 10, with 0 being the most, of political pressures and controls on media content, would you be comfortable using a linear model to predict the personal freedom score? - If the relationship looks linear, quantify the strength of the relationship with the correlation coefficient.
- Write the text of your answer here…
# Insert code hereSum of squared residuals
(1 p) 4. Model 1, Residuals
- Looking at your plot from the previous exercise, describe the relationship between these two variables.
- Make sure to discuss the form, direction, and strength of the relationship as well as any unusual observations.
- Write the text of your answer here…
# Insert code here(0 p) 5. Model 1, Least squares line, building intuition (practice, only)
- Using
plot_ss, choose a line that does a good job of minimizing the sum of squares. - Run the function several times. What was the smallest sum of squares that you got?
- How does it compare to your neighbours?
statsr::plot_ss(x = pf_expression_control, y = pf_score, data = hfi_2016, showSquares = TRUE)The linear model
(2 p) 6. Model 2, Linear relationship
- Plot the relationship between x=
pf_expression_controland y=hf_score, or the total human freedom score. - Fit a new model for this relationship.
- Using the estimates from the R output, write the equation of the regression line.
- What does the slope tell us in the context of the relationship between human freedom and the amount of political pressure on media content?
- Write the text of your answer here…
(I recommend that you follow the example from the website. I’ve included two example equations for you to select from and complete by replacing “beta0” and “beta1” with numbers.)
\[ \hat{y} = \hat{\beta}_{0} + \hat{\beta}_{1} \times \textrm{pf\_expression\_control} \]
\[ \widehat{\textrm{hf\_score}} = \hat{\beta}_{0} + \hat{\beta}_{1} \times \textrm{pf\_expression\_control} \]
# Insert code herePrediction and prediction errors
(1 p) 7. Model 1, Prediction
- If someone saw the least squares regression line and not the actual data, how would they predict a country’s personal freedom score (
pf_score) for one with a 3 rating forpf_expression_control? - Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?
- Write the text of your answer here…
# Insert code here
m1 <- lm(pf_score ~ pf_expression_control, data = hfi_2016)
#summary(m1)
tidy(m1)# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 4.62 0.0575 80.4 0
2 pf_expression_control 0.491 0.0101 48.8 8.19e-303
#glance(m1)Model diagnostics
To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.
(1 p) 8. Model 1, Residual plots, linearity
- Is there any apparent pattern in the residuals plot?
- What does this indicate about the linearity of the relationship between the two variables?
The e_plot_lm_diagnostics() function with the “simple” set of plots gives 6 plots to review. We will focus on 3 of these for now.
- QQ-plot for Normality (points stay within band)
- Cook’s Distance for influential points (ignore for now)
- Cook’s Distance by leverage for explaining influential points (ignore for now)
- Residuals vs predicted values
- Residuals vs each x variable
- Box-Cox transformation for y-variable transformation if residuals weren’t normal (ignore for now)
- Write the text of your answer here…
e_plot_lm_diagnostics(m1, sw_plot_set = "simple")


# or you can simply use plot(m1) to derive the four commonly used residual plots
#| fig-width: 8
#| fig-height: 8
par(mfrow = c(2, 2)) # 2 rows, 2 columns
plot(m1)
par(mfrow = c(1, 1)) # reset to default (optional)(1 p) 9. Model 1, Residual and histogram plots, normality
- Based on the histogram, does the nearly normal residuals condition appear to be violated? Why or why not?
- Write the text of your answer here…
hist(residuals(m1), 30)
(1 p) 10. Model 1, Constant variability
- Based on the residuals vs. fitted plot, does the constant variability condition appear to be violated? Why or why not?
- Write the text of your answer here…

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.