library(tidyverse)
library(openintro)ADA1: Class 07, Introduction to Data
Advanced Data Analysis 1, Stat 427/527, Fall 2025
Rubric
The context of this assignment comes from OpenIntro Labs for R and tidyverse:
This is a template for the assignment. Modify this and turn it in.
(1 p) 1. Histograms
Look carefully at these two histograms. How do they compare? Are features revealed in one that are obscured in another?
- [SOLUTION] Write the text of your answer here…
data(nycflights)
nycflights |> names() [1] "year" "month" "day" "dep_time" "dep_delay" "arr_time"
[7] "arr_delay" "carrier" "tailnum" "flight" "origin" "dest"
[13] "air_time" "distance" "hour" "minute"
nycflights |> str()tibble [32,735 × 16] (S3: tbl_df/tbl/data.frame)
$ year : int [1:32735] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
$ month : int [1:32735] 6 5 12 5 7 1 12 8 9 4 ...
$ day : int [1:32735] 30 7 8 14 21 1 9 13 26 30 ...
$ dep_time : int [1:32735] 940 1657 859 1841 1102 1817 1259 1920 725 1323 ...
$ dep_delay: num [1:32735] 15 -3 -1 -4 -3 -3 14 85 -10 62 ...
$ arr_time : int [1:32735] 1216 2104 1238 2122 1230 2008 1617 2032 1027 1549 ...
$ arr_delay: num [1:32735] -4 10 11 -34 -8 3 22 71 -8 60 ...
$ carrier : chr [1:32735] "VX" "DL" "DL" "DL" ...
$ tailnum : chr [1:32735] "N626VA" "N3760C" "N712TW" "N914DL" ...
$ flight : int [1:32735] 407 329 422 2391 3652 353 1428 1407 2279 4162 ...
$ origin : chr [1:32735] "JFK" "JFK" "JFK" "JFK" ...
$ dest : chr [1:32735] "LAX" "SJU" "LAX" "TPA" ...
$ air_time : num [1:32735] 313 216 376 135 50 138 240 48 148 110 ...
$ distance : num [1:32735] 2475 1598 2475 1005 296 ...
$ hour : num [1:32735] 9 16 8 18 11 18 12 19 7 13 ...
$ minute : num [1:32735] 40 57 59 41 2 17 59 20 25 23 ...
p1 <- ggplot(data = nycflights, aes(x = dep_delay))
p1 <- p1 + geom_histogram(binwidth = 15)
p2 <- ggplot(data = nycflights, aes(x = dep_delay))
p2 <- p2 + geom_histogram(binwidth = 150)
# grid of plots with patchwork
# https://patchwork.data-imaginist.com/articles/guides/assembly.html
library(patchwork)
p1 / p2
(1 p) 2. SFO in February
Create a new data frame that includes flights headed to SFO in February, and save this data frame as sfo_feb_flights. How many flights meet these criteria?
- [SOLUTION] Write the text of your answer here…
# Insert code here
sfo_feb_flights <-
nycflights |>
filter(
dest == "SFO"
, month == 2
)(1 p) 3. Arrival delays
Describe the distribution of the arrival delays (variable: arr_delay) of the flights headed to SFO in February using a histogram and appropriate summary statistics. Hint: The summary statistics you use should depend on the shape of the distribution.
- [SOLUTION] Write the text of your answer here…
# Insert code here(1 p) 4. Carrier median and IQR
Calculate the median and interquartile range for arr_delays of flights in the sfo_feb_flights data frame, grouped by carrier. Which carrier has the most variable arrival delays?
- [SOLUTION] Write the text of your answer here…
# Insert code here(1 p) 5. Departure delays by month
Suppose you really dislike departure delays and you want to schedule your travel in a month that minimizes your potential departure delay leaving NYC. One option is to choose the month with the lowest mean departure delay. Another option is to choose the month with the lowest median departure delay. What are the pros and cons of these two choices?
- Write the text of your answer here…
# Insert code here(1 p) 6. On time departure percentage
If you were selecting an airport simply based on on time departure percentage, which NYC airport (variable “origin”) would you choose to fly out of?
- [SOLUTION] Write the text of your answer here…
# Insert code here
nycflights <-
nycflights |>
mutate(
dep_type = ifelse(dep_delay < 5, "on time", "delayed")
)
p <- ggplot(data = nycflights, aes(x = origin, fill = dep_type))
p <- p + geom_bar()
p
sum_ontime <-
nycflights |>
group_by(
origin
) |>
summarize(
ot_dep_rate = sum(dep_type == "on time") / n()
) |>
arrange(
desc(ot_dep_rate)
)
sum_ontime# A tibble: 3 × 2
origin ot_dep_rate
<chr> <dbl>
1 LGA 0.728
2 JFK 0.694
3 EWR 0.637
(1 p) 7. Average speed
Mutate the data frame so that it includes a new variable that contains the average speed, avg_speed traveled by the plane for each flight (in mph). Hint: Average speed can be calculated as distance divided by number of hours of travel, and note that air_time is given in minutes.
- Write the text of your answer here…
# Insert code here(1 p) 8. Scatterplot of speed and distance
Make a scatterplot of avg_speed vs. distance. Describe the relationship between average speed and distance. Hint: Use geom_point().
- [SOLUTION] Write the text of your answer here…
# Insert code here(2 p) 9. Scatterplot of arrival and departure delay
Replicate the following plot. Hint: The data frame plotted only contains flights from American Airlines, Delta Airlines, and United Airlines, and the points are colored by carrier. Once you replicate the plot, determine (roughly) what the cutoff point is for departure delays where you can still expect to get to your destination on time.
- [SOLUTION] Write the text of your answer here…
# Insert code here