# Analysis of Variance, Design, and Regression

Data Files

## Preface

This book examines the application of basic statistical methods: primarily analysis of variance and regression but with some discussion of count data. It is directed primarily towards Masters degree students in statistics studying analysis of variance, design of experiments, and regression analysis. I have found that the Masters level regression course is often popular with students outside of statistics. These students are often weaker mathematically and the book caters to that fact while continuing to give a complete matrix formulation of regression.

The book is complete enough to be used as a second course for upper division and beginning graduate students in statistics and for graduate students in other disciplines. To do this, one must be selective in the material covered, but the more theoretical material appropriate only for Statistics Masters students is generally isolated in separate subsections and, less often, in separate sections.

For a Masters level course in analysis of variance and design, I have the students review Chapter 2, I present Chapter 3 while simultaneously presenting the examples of Section 4.2, I present Chapters 5 and 6, very briefly review the first five sections of Chapter 7, present Sections 7.11 and 7.12 in detail and then I cover Chapters 9, 10, 11, 12, and 17. Depending on time constraints, I will delete material or add material from Chapter 16.

For a Masters level course in regression analysis, I again have the students review Chapter 2 and I review Chapter 3 with examples from Section 4.2. I then present Chapters 7, 13, and 14, Appendix A, Chapter 15, Sections 16.1.2, 16.3, 16.5 (along with analysis of covariance), Section 8.7 and finally Chapter 18. All of this is done in complete detail. If any time remains I like to supplement the course with discussion of response surface methods.

As a second course for upper division and beginning graduate students in statistics and graduate students in other disciplines, I cover the first eight chapters with omission of the more technical material. A follow up course covers the less technical aspects of Chapters 9 through 15 and Appendix A.

I think the book is reasonably encyclopedic. It really contains everything I would like my students to know about applied statistics prior to them taking courses in linear model theory or log-linear models.

I believe that beginning students (even Statistics Masters students) often find statistical procedures to be a morass of vaguely related special techniques. As a result, this book focuses on four connecting themes.

• Most inferential procedures are based on identifying a (scalar) parameter of interest, estimating that parameter, obtaining the standard error of the estimate, and identifying the appropriate reference distribution. Given these items, the inferential procedures are identical for various parameters.
• Balanced one-way analysis of variance has a simple, intuitive interpretation in terms of comparing the sample variance of the group means with the mean of the sample variances for each group. All balanced analysis of variance problems are considered in terms of computing sample variances for various group means.
• Comparing different models provides a structure for examining both balanced and unbalanced analysis of variance problems and for examining regression problems. In some problems the most reasonable analysis is simply to find a succinct model that fits the data well.
• Checking assumptions is a crucial part of every statistical analysis.

The object of statistical data analysis is to reveal useful structure within the data. In a model-based setting, I know of two ways to do this. One way is to find a succinct model for the data. In such a case, the structure revealed is simply the model. The model selection approach is particularly appropriate when the ultimate goal of the analysis is making predictions. This book uses the model selection approach for multiple regression and for general unbalanced multifactor analysis of variance. The other approach to revealing structure is to start with a general model, identify interesting one-dimensional parameters, and perform statistical inferences on these parameters. This parametric approach requires that the general model involve parameters that are easily interpretable. We use the parametric approach for one-way analysis of variance, balanced multifactor analysis of variance, and simple linear regression. In particular, the parametric approach to analysis of variance presented here involves a strong emphasis on examining contrasts, including interaction contrasts. In analyzing two-way tables of counts, we use a partitioning method that is analogous to looking at contrasts.

All statistical models involve assumptions. Checking the validity of these assumptions is crucial because the models we use are never correct. We hope that our models are good approximations to the true condition of the data and experience indicates that our models often work very well. Nonetheless, to have faith in our analyses, we need to check the modeling assumptions as best we can. Some assumptions are very difficult to evaluate, e.g., the assumption that observations are statistically independent. For checking other assumptions, a variety of standard tools has been developed. Using these tools is as integral to a proper statistical analysis as is performing an appropriate confidence interval or test. For the most part, using model-checking tools without the aid of a computer is more trouble than most people are willing to tolerate.

My experience indicates that students gain a great deal of insight into balanced analysis of variance by actually doing the computations. The computation of the mean square for treatments in a balanced one-way analysis of variance is trivial on any hand calculator with a variance or standard deviation key. More importantly, the calculation reinforces the fundamental and intuitive idea behind the balanced analysis of variance test, i.e., that a mean square for treatments is just a multiple of the sample variance of the corresponding treatment means. I believe that as long as students find the balanced analysis of variance computations challenging, they should continue to do them by hand (calculator). I think that automated computation should be motivated by boredom rather than bafflement.

In addition to the four primary themes discussed above, there are several other characteristics that I have tried to incorporate into this book.

I have tried to use examples to motivate theory rather than to illustrate theory. Most chapters begin with data and an initial analysis of that data. After illustrating results for the particular data, we go back and examine general models and procedures. I have done this to make the book more palatable to two groups of people: those who only care about theory after seeing that it is useful and those unfortunates who can never bring themselves to care about theory. (The older I get, the more I identify with the first group. As for the other group, I find myself agreeing with W. Edwards Deming that experience without theory teaches nothing.) As mentioned earlier, the theoretical material is generally confined to separate subsections or, less often, separate sections, so it is easy to ignore.

I believe that the ultimate goal of all statistical analysis is prediction of observable quantities. I have incorporated predictive inferential procedures where they seemed natural.

The object of most statistics books is to illustrate techniques rather than to analyze data; this book is no exception. Nonetheless, I think we do students a disservice by not showing them a substantial portion of the work necessary to analyze even nice' data. To this end, I have tried to consistently examine residual plots, to present alternative analyses using different transformations and case deletions, and to give some final answers in plain English. I have also tried to introduce such material as early as possible. I have included reasonably detailed examinations of a three-factor analysis of variance and of a split plot design with four factors. I have included some examples in which, like real life, the final answers are not neat.' While I have tried to introduce statistical ideas as soon as possible, I have tried to keep the mathematics as simple as possible for as long as possible. For example, matrix formulations are postponed to the last chapter on multiple regression and the last section on unbalanced analysis of variance.

I never use side conditions or normal equations in analysis of variance.

In multiple comparison methods, (weakly) controlling the experimentwise error rate is discussed in terms of first performing an omnibus test for no treatment effects and then choosing a criterion for evaluating individual hypotheses. Most methods considered divide into those that use the omnibus $F$ test, those that use the Studentized range test, and the Bonferroni method, which does not use any omnibus test.

I have tried to be very clear about the fact that experimental designs are set up for arbitrary groups of treatments and that factorial treatment structures are simply an efficient way of defining the treatments in some problems. Thus, the nature of a randomized complete block design does not depend on how the treatments happen to be defined. The analysis always begins with a breakdown of the sum of squares into treatments, blocks, and error. Further analysis of the treatments then focuses on whatever structure happens to be present.

The analysis of covariance chapter includes an extensive discussion of how the covariates must be chosen to maintain a valid experiment. Tukey's one degree of freedom test for nonadditivity is presented as an analysis of covariance test for the need to perform a power transformation rather than as a test for a particular type of interaction.

The chapter on confounding and fractional replication has more discussion of analyzing such data than many other books contain.

Minitab commands are presented for most analyses. Minitab was chosen because I find it the easiest of the common packages to use. However, the real point of including computer commands is to illustrate the kinds of things that one needs to specify for any computer program and the various auxiliary computations that may be necessary for the analysis. The other statistical packages used in creating the book were BMDP, GLIM, and MSUSTAT.

Acknowledgements

Many people provided comments that helped in writing this book. My colleagues Ed Bedrick, Aparna Huzurbazar, Wes Johnson, Bert Koopmans, Frank Martin, Tim O'Brien, and Cliff Qualls helped a lot. I got numerous valuable comments from my students at the University of New Mexico. Marjorie Bond, Matt Cooney, Jeff S. Davis, Barbara Evans, Mike Fugate, Jan Mines, and Jim Shields stand out in this regard. The book had several anonymous reviewers, some of whom made excellent suggestions.

I would like to thank Martin Gilchrist and Springer-Verlag for permission to reproduce Example 7.6.1 from Plane Answers to Complex Questions: The Theory of Linear Models. I also thank the Biometrika Trustees for permission to use the tables in Appendix B.5. Professor John Deely and the University of Canterbury in New Zealand were kind enough to support completion of the book during my sabbatical there.

Now my only question is what to do with the chapters on quality control, p^n factorials, and response surfaces that ended up on the cutting room floor.

• Preface
• Introduction
• Probability
• Random variables and expectations
• Expected values and variances
• Chebyshev's inequality
• Covariances and correlations
• Rules for expected values and variances
• Continuous distributions
• The binomial distribution
• The multinomial distribution
• Exercises
• One sample
• Example and introduction
• Confidence intervals
• Hypothesis tests
• Prediction intervals
• Checking normality
• Transformations
• Exercises
• A general theory for testing and confidence intervals
• Theory for confidence intervals
• Theory for hypothesis tests
• Validity of tests and confidence intervals
• The relationship between confidence intervals and tests
• Theory of prediction intervals
• Sample size determination and power
• Exercises
• Two sample problems
• Two correlated samples: paired comparisons
• Two independent samples with equal variances
• Two independent samples with unequal variances
• Testing equality of the variances
• Exercises
• One-way analysis of variance
• Introduction and examples
• Theory
• Balanced ANOVA: introductory example
• Analytic and enumerative studies
• Balanced one-way analysis of variance: theory
• The analysis of variance table
• Unbalanced analysis of variance
• Choosing contrasts
• Comparing models
• The power of the analysis of variance F test
• Exercises
• Multiple comparison methods
• Fisher's least significant difference method
• Studentized range methods
• Tukey's honest significant difference
• Newman-Keuls multiple range method
• Scheffe's method
• Other methods
• Ott's analysis of means method
• Dunnett's many-one t statistic method
• Duncan's multiple range method
• Summary of multiple comparison procedures
• Exercises
• Simple linear and polynomial regression
• An example
• The simple linear regression model
• Estimation of parameters
• The analysis of variance table
• Inferential procedures
• An alternative model
• Correlation
• Recognizing randomness: simulated data with zero correlation
• Checking assumptions: residual analysis
• Transformations
• Box--Cox transformations
• Polynomial regression
• Polynomial regression and one-way ANOVA
• Exercises
• The analysis of count data
• One binomial sample
• The sign test
• Two independent binomial samples
• One multinomial sample
• Two independent multinomial samples
• Several independent multinomial samples
• Lancaster-Irwin partitioning
• Logistic regression
• Exercises
• Basic experimental designs
• Completely randomized designs
• Randomized complete block designs
• Latin square designs
• Discussion of experimental design
• Exercises
• Analysis of covariance
• An example
• Analysis of covariance in designed experiments
• Computations and contrasts
• Power transformations and Tukey's one degree of freedom
• Exercises
• Factorial treatment structures
• Two factors
• Two-way analysis of variance with replication
• Multifactor structures
• Extensions of Latin squares
• Exercises
• Split plots, repeated measures, random effects, and subsampling
• The analysis of split plot designs
• A four-factor split plot analysis
• Multivariate analysis of variance
• Random effects models
• Subsampling
• Random effects
• Exercises
• Multiple regression: introduction
• Example of inferential procedures
• Regression surfaces and prediction
• Comparing regression models
• Sequential fitting
• Reduced models and prediction
• Partial correlation coefficients and added variable plots
• Collinearity
• Exercises
• Regression diagnostics and variable selection
• Diagnostics
• Best subset model selection methods
• R^2 statistic
• Mallows's C_p statistic
• A combined subset selection table
• Stepwise model selection methods
• Backwards elimination
• Forward selection
• Stepwise methods
• Model selection and case deletion
• Exercises
• Multiple regression: matrix formulation
• Random vectors
• Matrix formulation of regression models
• Least squares estimation of regression parameters
• Inferential procedures
• Residuals, standardized residuals, and leverage
• Principal components regression
• Weighted least squares
• Exercises
• Unbalanced multifactor analysis of variance
• Unbalanced two-way analysis of variance
• Proportional numbers
• General case
• Balanced incomplete block designs
• Unbalanced multifactor analysis of variance
• Youden squares
• Matrix formulation of analysis of variance
• Exercises
• Confounding and fractional replication in 2^n factorial systems
• Confounding
• Fractional replication
• Analysis of unreplicated experiments
• More on graphical analysis
• Augmenting designs for factors at two levels
• Exercises
• Nonlinear regression
• Introduction and examples
• Estimation
• The Gauss--Newton algorithm
• Maximum likelihood estimation
• Statistical inference
• Linearizable models
• Exercises
• Appendix A: Matrices
• Scalar multiplication
• Matrix multiplication
• Special matrices
• Linear dependence and rank
• Inverse matrices
• A list of useful properties
• Eigenvalues and eigenvectors
• Appendix B: Tables
• Tables of the t distribution
• Tables of the \chi^2 distribution
• Tables of the W' statistic
• Tables of orthogonal polynomials
• Tables of the Studentized range
• The Greek alphabet
• Tables of the F distribution
• References
• Author Index
• Subject Index

Web design by Ronald Christensen (2007) and Fletcher Christensen (2008)