ADA1: Class 08, Introduction to linear regression

[Advanced Data Analysis 1]

Author

Your Name

Published

September 4, 2025

Rubric

The context of this assignment comes from OpenIntro Labs for R and tidyverse:

8. Simple linear regression

This file is a template for the assignment. Please keep the questions in their original numbered order, insert your solutions below each question, and delete any unnecessary code or text before submitting.

Some questions are answered by the code you’ve written. In those cases, in your answer write “see code”.

 library(erikmisc)
# contains various utility functions that Erhardt uses in his work and teaching, particularly within the context of statistical analysis and data handling in R.
# To load package erikmisc, you first need to install the package by using the following command
# install.packages("devtools")
# install.packages("Hmisc")
# devtools::install_github("erikerhardt/erikmisc", dependencies = TRUE)

library(tidyverse)
library(openintro)
library(statsr)
library(broom)

The Human Freedom Index is a report that attempts to summarize the idea of “freedom” through a bunch of different variables for many countries around the globe. It serves as a rough objective measure for the relationships between the different types of freedom - whether it’s political, religious, economical or personal freedom - and other social and economic circumstances. The Human Freedom Index is an annually co-published report by the Cato Institute, the Fraser Institute, and the Liberales Institut at the Friedrich Naumann Foundation for Freedom.

In this lab, you’ll be analysing data from the Human Freedom Index reports. Your aim will be to summarize a few of the relationships within the data both graphically and numerically in order to find which variables can help tell a story about freedom.

Getting Started

The data we’re working with is in the openintro package and it’s called hfi, short for Human Freedom Index.

(1 p) 1. Data basics

What are the dimensions of the dataset?
What does each row represent?

Write the text of your answer here…

# Insert code here
data(hfi, package = "openintro")
attr(hfi, "spec") <- NULL  # remove variable class specification attribute

dim(hfi)

[1] 1458  123

hfi

# A tibble: 1,458 × 123
    year ISO_code countries  region               pf_rol_procedural pf_rol_civil
   <dbl> <chr>    <chr>      <chr>                            <dbl>        <dbl>
 1  2016 ALB      Albania    Eastern Europe                    6.66         4.55
 2  2016 DZA      Algeria    Middle East & North…             NA           NA   
 3  2016 AGO      Angola     Sub-Saharan Africa               NA           NA   
 4  2016 ARG      Argentina  Latin America & the…              7.10         5.79
 5  2016 ARM      Armenia    Caucasus & Central …             NA           NA   
 6  2016 AUS      Australia  Oceania                           8.44         7.53
 7  2016 AUT      Austria    Western Europe                    8.97         7.87
 8  2016 AZE      Azerbaijan Caucasus & Central …             NA           NA   
 9  2016 BHS      Bahamas    Latin America & the…              6.93         6.01
10  2016 BHR      Bahrain    Middle East & North…             NA           NA   
# ℹ 1,448 more rows
# ℹ 117 more variables: pf_rol_criminal <dbl>, pf_rol <dbl>,
#   pf_ss_homicide <dbl>, pf_ss_disappearances_disap <dbl>,
#   pf_ss_disappearances_violent <dbl>, pf_ss_disappearances_organized <dbl>,
#   pf_ss_disappearances_fatalities <dbl>, pf_ss_disappearances_injuries <dbl>,
#   pf_ss_disappearances <dbl>, pf_ss_women_fgm <dbl>,
#   pf_ss_women_missing <dbl>, pf_ss_women_inheritance_widows <dbl>, …

(1 p) 2. Data subset `hfi_2016`

The dataset spans a lot of years, but we are only interested in data from year 2016.
Filter the data hfi data frame for year 2016, select the listed variables, and assign the result to a data frame named hfi_2016.

year
ISO_code
countries
region
pf_expression_control
pf_score
hf_score
Write the text of your answer here…

# Insert code here

hfi_2016 <-
  hfi
  # more code here...

(1 p) 3. Model 1, Linear relationship?

What type of plot would you use to display the relationship between the personal freedom score, pf_score, and pf_expression_control?
Plot this relationship using the variable pf_expression_control as the predictor. Does the relationship look linear?
If you knew a country’s pf_expression_control, or its score out of 10, with 0 being the most, of political pressures and controls on media content, would you be comfortable using a linear model to predict the personal freedom score?
If the relationship looks linear, quantify the strength of the relationship with the correlation coefficient.

Write the text of your answer here…

# Insert code here

Sum of squared residuals

(1 p) 4. Model 1, Residuals

Looking at your plot from the previous exercise, describe the relationship between these two variables.
Make sure to discuss the form, direction, and strength of the relationship as well as any unusual observations.

Write the text of your answer here…

# Insert code here

(0 p) 5. Model 1, Least squares line, building intuition (practice, only)

Using plot_ss, choose a line that does a good job of minimizing the sum of squares.
Run the function several times. What was the smallest sum of squares that you got?
How does it compare to your neighbours?

statsr::plot_ss(x = pf_expression_control, y = pf_score, data = hfi_2016, showSquares = TRUE)

The linear model

(2 p) 6. Model 2, Linear relationship

Plot the relationship between x=pf_expression_control and y=hf_score, or the total human freedom score.
Fit a new model for this relationship.
Using the estimates from the R output, write the equation of the regression line.
What does the slope tell us in the context of the relationship between human freedom and the amount of political pressure on media content?

Write the text of your answer here…

(I recommend that you follow the example from the website. I’ve included two example equations for you to select from and complete by replacing “beta0” and “beta1” with numbers.)

\[ \hat{y} = \hat{\beta}_{0} + \hat{\beta}_{1} \times \textrm{pf\_expression\_control} \]

\[ \widehat{\textrm{hf\_score}} = \hat{\beta}_{0} + \hat{\beta}_{1} \times \textrm{pf\_expression\_control} \]

# Insert code here

Prediction and prediction errors

(1 p) 7. Model 1, Prediction

If someone saw the least squares regression line and not the actual data, how would they predict a country’s personal freedom score (pf_score) for one with a 3 rating for pf_expression_control?
Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?

Write the text of your answer here…

# Insert code here

m1 <- lm(pf_score ~ pf_expression_control, data = hfi_2016)
#summary(m1)
tidy(m1)

# A tibble: 2 × 5
  term                  estimate std.error statistic   p.value
  <chr>                    <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)              4.62     0.0575      80.4 0        
2 pf_expression_control    0.491    0.0101      48.8 8.19e-303

#glance(m1)

Model diagnostics

To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.

(1 p) 8. Model 1, Residual plots, linearity

Is there any apparent pattern in the residuals plot?
What does this indicate about the linearity of the relationship between the two variables?

Note

The e_plot_lm_diagnostics() function with the “simple” set of plots gives 6 plots to review. We will focus on 3 of these for now.

QQ-plot for Normality (points stay within band)
Cook’s Distance for influential points (ignore for now)
Cook’s Distance by leverage for explaining influential points (ignore for now)
Residuals vs predicted values
Residuals vs each x variable
Box-Cox transformation for y-variable transformation if residuals weren’t normal (ignore for now)

Write the text of your answer here…

e_plot_lm_diagnostics(m1, sw_plot_set = "simple")

# or you can simply use plot(m1) to derive the four commonly used residual plots
#| fig-width:  8
#| fig-height: 8
par(mfrow = c(2, 2))  # 2 rows, 2 columns
plot(m1)

par(mfrow = c(1, 1))  # reset to default (optional)

(1 p) 9. Model 1, Residual and histogram plots, normality

Based on the histogram, does the nearly normal residuals condition appear to be violated? Why or why not?

Write the text of your answer here…

hist(residuals(m1), 30)

(1 p) 10. Model 1, Constant variability

Based on the residuals vs. fitted plot, does the constant variability condition appear to be violated? Why or why not?

Write the text of your answer here…

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.