ADA1: Class 04, Study Design and Sampling

Advanced Data Analysis 1, Stat 427/527, Fall 2025

Author

Your Name

Published

August 26, 2025

Rubric

Answer each question and specify the supporting evidence.

(6 p) 1. IRS Sampling

In this case study there are two errors in the methodology: one regarding sampling and the other regarding calculation. Focus on the sampling issue.

Case Study 1: The stated case of the IRS1

The entire details of the lawsuit brought by the IRS against the defendant will not be covered in this paper. However, parts of this case are statistically interesting. The defendant was the owner of a tax preparation firm with several locations, and he was directly or indirectly responsible for the preparation and filing of at least 24,399 federal income tax returns for the tax years 2003 through 2007. The IRS stated that they reviewed 345 returns of the 24,399 identified. Of the 345 which the IRS reviewed, 313 resulted in needing additional tax assessment. This means that 91% of the original sample had returns that owed additional tax to the IRS, and the additional tax was owed for a variety of reasons. The IRS calculated from these 345 returns that the actual tax loss directly due to these returns being improperly prepared by the defendant(s) was in excess of $1.1 million (United States v. Brier, et. al., pg. 3). The IRS further stated that if this rate loss were applied to all 24,399 returns, then the estimated loss to the United States government would be in excess of $85 million for the years 2003 through 2007 (United States v. Brier, et. al., pg. 5). Thus the IRS was looking for damages close to 85 million dollars.

1. (2 p) Did the paper specify how the sample of 345 returns was selected? If so, identify the sampling method. If not, based on the description and numbers provided, what sampling method do you think might have been used?

Solution:

2. (2 p) What problems with the inferential results might have been introduced by the potential sampling method used?

Solution:

3. (2 p) What sampling method do you suggest for this case? Why?

Solution:

(4 p) 2. Electronic health records (EHR)

Background2: A substantial portion of the US population remains uninsured and even a larger group uses healthcare rarely only. Although the trend is toward greater use of EHRs, only about 40% of patients currently have their information recorded in EHRs.

Case Study 2: Nurses Health Study

The large Nurses Health Study followed 48,470 postmenopausal women (all of whom were nurses), 30–63 years of age for 10 years (337,854 person-years). The study3 concluded that use of hormone replacement therapy cut the rate of serious coronary heart disease nearly in half.

1. (2 p) Despite the large sample size, why should we be cautious in interpreting the study’s conclusion about hormone replacement therapy? What sampling issue arises from the fact that all participants were nurses, and how might this affect the generalizability of the results to the broader population?

Solution:

Case Study 3: Estimating disease prevalence

A young physician, Mary, wanted to predict the number of patients she might see in her specialized field. She obtained EHRs from her university hospital for the previous year and calculated the proportion of admitted patients who had a particular ailment out of the total hospital admissions. Based on this analysis, she concluded that a substantial number of people would likely need her services in a clinic.

*1. (2 p) Can Mary generalize her conclusion to the broader population? Is her estimated proportion of patients with the particular ailment likely too high or too low? Explain why.

Solution:

Footnotes

  1. Kennedy, K., & Bishop, J. (2014). Random sampling issues in a federal court case, a case study. Case Studies In Business, Industry And Government Statistics, 5(2), 111-114. pdf↩︎

  2. Kaplan, Robert M., David A. Chambers, and Russell E. Glasgow. “Big data and large sample size: a cautionary note on the potential for bias.” Clinical and translational science 7.4 (2014): 342-346. pdf↩︎

  3. Stampfer MJ, Colditz GA, Willett WC, Manson JE, Rosner B, Speizer FE, Hennekens CH. Postmenopausal estrogen therapy and cardiovascular disease. Ten-year follow-up from the nurses’ health study. N Engl J Med. 1991; 325: 756–762.↩︎