John Maindonald and John Braun Data analysis and graphics using R: An Example-based Approach Cambridge Univ. Press, 2003, Cambridge pp. 362, ISBN 0-521-81336-0 Contents:: Preface A Chapter by Chapter Summary 1. A Brief Introduction to R 1.1. A Short R Session 1.1.1. R must be installed! 1.1.2. Using the console (or command line) window 1.1.3. Reading data from a file 1.1.4. Entry of data at the command line 1.1.5. Online help 1.1.6. Quiting R 1.2. The Uses of R 1.3. The R Language 1.3.1. R objects 1.3.2. Retaining objects between sessions 1.4. Vectors in R 1.4.1. Concatenation - joining vector objects 1.4.2. Subsets of vectors 1.4.3. Patterned Data 1.4.4. Missing Values 1.4.5. Factors 1.5. Data Frames 1.5.1. Variable names 1.5.2. Applying a function to the columns of a data frame 1.5.3* Data frames and matrices 1.5.4. Identification of rows that include missing values 1.6. R Packages 1.6.1. Data sets that accompany R packages 1.7* Looping 1.8. R Graphics 1.8.1. The function plot() and allied functions 1.8.2. Identification and location on the figure region 1.8.3. Plotting mathematical symbols 1.8.4. Row by column layouts of plots 1.8.5. Graphs - additional notes 1.9. Additional Points on the Use of R in This Book 1.10. Further Reading 1.11. Exercises 2. Styles of Data Analysis 2.1. Revealing Views of the Data 2.1.1. Views of a single sample 2.1.2. Patterns of grouped data 2.1.3. Patterns in bivariate data - the scatterplot 2.1.4* Multiple variables and times 2.1.5. Lattice (trellis style) graphics 2.1.6. What to look for in plots 2.2. Data Summary 2.2.1. Mean and median 2.2.2. Standard deviation and inter-quartile range 2.2.3. Correlation 2.3. Statistical Analysis Strategies 2.3.1. Helpful and unhelpful questions 2.3.2. Planning the formal analysis 2.3.3. Changes tothe intended plan of analysis 2.4. Recap 2.5. Further Reading 2.6. Exercises 3. Statistical Models 3.1. Regularities 3.1.1. Mathematical models 3.1.2. Models that include a random component 3.1.3. Smooth and rough 3.1.4. The construction and use of models 3.1.5. Model formulae 3.2. Distributions: Models for the Random Component 3.2.1. Discrete distributions 3.2.2. Continuous distributions 3.3. The Uses of Random Numbers 3.3.1. Simulation 3.3.2. Sampling from populations 3.4. Model Assumptions 3.4.1. Random sampling assumptions - independence 3.4.2. Checks for normality 3.4.3. Checking other model assumptions 3.4.4. Are non-parametric methods the answer? 3.4.5. Why models matter - adding across contingency tables 3.5. Recap 3.6. Further Reading 3.7. Exercises 4. An Introduction to Formal Inference 4.1. Standard Errors 4.1.1. Population parameters and sample statistics 4.1.2. Assessing accuracy - the standard error 4.1.3. Standard errors for differences of means 4.1.4* THe standard error of the median 4.1.5* Resampling to estimate standard errors: bootstrapping 4.2. Calculations Involving Standard Errors: the t-Distribution 4.3. Confidence Intervals and Hypothesis Tests 4.3.1. One- and two-sample intervals and tests for means 4.3.2. Confidence intervals and tests for proportions 4.3.3. Confidence intervals for the correlation 4.4. Contingency Tables 4.4.1. Rare and endangered plant species 4.4.2. Additional notes 4.5. One-Way Unstructured Comparisons 4.5.1. Displaying means for the one-way layout 4.5.2. Multiple comparisons 4.5.3. Data with a two-way structure 4.5.4. Presentation issues 4.6. Response Curves 4.7. Data with a Nested Variation Structure 4.7.1. Degrees of freedom considerations 4.7.2. General multi-way analysis of variance designs 4.8* Resampling Methods for Tests and Confidence Intervals 4.8.1. The one-sample permutation test 4.8.2. The two-sample permutation test 4.8.3. Bootstrap estimates of confidence intervals 4.9. Further Comments on Formal Inference 4.9.1. Confidence intervals versus hypothesis tests 4.9.2. If there is strong prior information, use it! 4.10. Recap 4.11. Further Reading 4.12. Exercises 5. Regression with a Single Predictor 5.1. Fitting a Line to Data 5.1.1. Lawn roller example 5.1.2. Calculating fitted values and residuals 5.1.3. Residual plots 5.1.4. The analysis of variance table 5.2. Outliers, Influence and Robust Regression 5.3. Standard Errors and Confidence Intervals 5.3.1. Confidence intervals and tests for the slope 5.3.2. SEs and confidence intervals for predicted values 5.3.3* Implications for design 5.4. Regression versus Qualitative ANOVA Comparisons 5.5.Assessing Predictive Accuracy 5.5.1. Training/test sets, and cross-validation 5.5.2. Cross-validation - an example 5.5.3* Bootstrapping 5.6* A Note on Power Transformations 5.7. Size and Shape Data 5.7.1. Allometric growth 5.7.2. There are two regression lines! 5.8. The Model Matrix in Regression 5.9. Recap 5.10. Methodological References 5.11. Exercises 6. Multiple Linear Regression 6.1. Basic Ideas: Book Weight and Brain Weight Examples 6.1.1. Omission of the intercept term 6.1.2. Diagnostic plots 6.1.3. Further investigation of influential points 6.1.4. Example: brain weight 6.2. Multiple Regression Assumptions and Diagnostics 6.2.1. Influential outliers and Cook's distance 6.2.2. Component plus residual plots 6.2.3* Further types of diagnostic plot 6.2.4. Robust and resistant methods 6.3. A Strategy for Fitting Multiple Regression Models 6.3.1. Preliminaries 6.3.2. Model fitting 6.3.3. An example - the Scottish hill race data 6.4. Measures for the Comparison of Regression Models 6.4.1. R^2 and adjusted R^2 6.4.2. AIC and related statistics 6.4.3. How accurately does the equation predict? 6.4.4. An external assessment of predictive accuracy 6.5. Interpreting Regression Coefficients - the Labor Training Data 6.6. Problems with Many Explanatory Variables 6.6.1. Variable selection issues 6.6.2. Principal components summaries 6.7. Multicolinearity 6.7.1. A contrived example 6.7.2. The variance inflation vector (VIF) 6.7.3. Remedying multicolinearity 6.8. Multiple Regression Models 6.8.1. Confusion between explanatory and dependent variables 6.8.2. Missing explanatory variables 6.8.3* The use of transformations 6.8.4* Non-linear methods - an alternative to transformation? 6.9. Further Reading 6.10. Exercises 7. Exploiting the Linear Model Framework 7.1. Levels of a Factor - Using Indicator Variables 7.1.1. Example - sugar weight 7.1.2. Different choices for the model matrix when there are factors 7.2. Polynomial Regression 7.2.1. Issues in the choice of model 7.3. Fitting Multiple Lines 7.4* Methods for Passing Smooth Curves through Data 7.4.1. Scatterplot smoothing - regression splines 7.4.2. Other smoothing methods 7.4.3. Generalized additive models 7.5. Smoothing Terms in Multiple Linear Models 7.6. Further Reading 7.7. Exercises 8. Logistic Regression and Other Generalized Linear Models 8.1. Generalized Linear Models 8.1.1. Transformation of the expected value on the left 8.1.2. Noise terms need not be normal 8.1.3. Log odds in contingency tables 8.1.4. Logistic regression with a continuous explanatory variable 8.2. Logistic Multiple Regression 8.2.1. A plot of contributions of explanatory variables 8.2.2. Cross-validation estimates of predictive accuracy 8.3. Logistic Models for Categorical Data - an Example 8.4. Poisson and Quasi-Poisson Regression 8.4.1. Data on aberrant crypt foci 8.4.2. Moth habitat example 8.4.3* Residuals, and estimating the dispersion 8.5. Ordinal Regression Models 8.5.1. Explanatory analysis 8.5.2* Proportional odds logistic regression 8.6. Other Related Models 8.6.1* Loglinear models 8.6.2. Survival analysis 8.7. Transformations for Count Data 8.8. Further Reading 8.9. Exercises 9, Multi-level Models, Time Series and Repeated Measures 9.1. Introduction 9.2. Example - Survey Data, with Clustering 9.2.1. Alternative models 9.2.2. Instructive, though faulty, analyses 9.2.3. Predictive accuracy 9.3. A Multi-level Experimental Design 9.3.1. The ANOVA table 9.3.2. Expected values of mean squares 9.3.3* The sums of squares breakdown 9.3.4. The variance components 9.3.5. The mixed model analysis 9.3.6. Predictive accuracy 9.3.7. Different sources of variance - complication or focus of interest? 9.4. Within and between Subject Effects - an Example 9.5. Time Series - Some Basic Ideas 9.5.1. Preliminary graphical explorations 9.5.2. The autocorrelation function 9.5.3. Autoregressive (AR) models 9.5.4* Autoregressive moving average (ARMA) models - theory 9.6* Regression Modeling with Moving Average Errors - an Example 9.7. Repeated Measures in Time - Notes on the Methodology 9.7.1. The theory of repeated measures modeling 9.7.2. Correlation structure 9.7.3. Different approaches to repeated measures analysis 9.8. Further Notes on Multi-level Modeling 9.8.1. An historical prespective on multi-level modeling 9.8.2. Meta-analysis 9.9. Further Reading 9.10. Exercises 10. Tree-based Classification and Regression 10.1. The Uses of Tree-based Methods 10.1.1. Problems for which tree-based regression may be used 10.1.2. Tree-based regression versus parametric regression 10.1.3. Summary of pluses and minuses 10.2. Detecting Email Spam - an Example 10.2.1. Choosing the number of splits 10.3. Terminology and Methodology 10.3.1. Choosing the split - regression trees 10.3.2. Within and between sums of squares 10.3.3. Choosing the split - classification trees 10.3.4. The mechanics of tree-based regression - a trivial example 10.4. Assessments of Predictive Accuracy 10.4.1. Cross-validation 10.4.2. The training/test set methodology 10.4.3. Predicting the future 10.5. A Strategy for Choosing the Optimal Tree 10.5.1. Cost-complexity pruning 10.5.2. Prediction error versus tree size 10.6. Detecting Email Spam - the Optimal Tree 10.6.1. The one-standard-deviation rule 10.7. Interpretation and Presentation of the rpart Output 10.7.1. Data for female heart attack patients 10.7.2. Printed Information on Each Split 10.8. Additional Notes 10.9. Further Reading 10.10. Exercises 11. Multivariate Data Exploration and Discrimination 11.1. Multivariate Exploratoty Data Analysis 11.1.1. Scatterplot matrices 11.1.2. Principal component analysis 11.2. Discriminant Analysis 11.2.1. Example - plant architecture 11.2.2. Classical Fisherian discriminant analysis 11.2.3. Logistic discriminant analysis 11.2.4. An example with more than tow groups 11.3. Principal Component Scores in Regression 11.4* Propensity Scores in Regression Comparisons - Labor Training Data 11.5. Further Reading 11.6. Exercises 12. The R System - Additional Topics 12.1. Graphs in R 12.2. Functions - Some Further Details 12.2.1. Common useful functions 12.2.2. User-written R functions 12.2.3. Functions for working with dates 12.3. Data Input and Output 12.3.1. Input 12.3.2. Data output 12.4. Factors - Additional Comments 12.5. Missing Values 12.6. Lists and Data Frames 12.6.1. Data frames as lists 12.6.2. Reshaping data frames; reshape() 12.6.3. Joining data frames and vectors - cbind() 12.6.4. Conversion of tables and arrays into data frames 12.6.5* Merging data frames - merge() 12.6.6. The function sapply()) and related functions 12.6.7. Splitting vectors and data frames into lists - split() 12.7* Matrices and Arrays 12.7.1. Outer products 12.7.2. Arrays 12.8. Classes and Methods 12.8.1. Printing and summarizing model objects 12.8.2. Extracting information from model objects 12.9. Data-bases and Environments 12.9.1. Workspace management 12.9.2. Function environments, and lazy evaluation 12.10. Manipulation of Language Contructs 12.11. Further Reading 12.12. Exercises Epilogue - Models Appendix - S PLUS Differences References Index of R Symbols and Functions Index of Terms Index of Names