Monthly Archives: July 2012
STATISTICA Solutions using the R Language Platform
R is a programming language and environment for statistical computing. The R platform and its source code are freely available under the GNU GPL license (see http://cran.r-project.org).
Overview
You want to use some R features, however the R platform doesn’t entirely meet your needs. STATISTICA Integration with R has solutions to the following concerns:
- R tabular results are cumbersome to manipulate; R graphs cannot be modified; output is difficult to manage.
- I want to combine an R function with custom programming.
- I need a specialized R data mining algorithm that is not available in any commercial software package.
- My company needs to run reports based on R scripts, but my users don’t know how to use R.
- My company needs to run a high volume of data-intensive analyses using R algorithms, but the R program is too slow.
For additional information, see the following white papers:
Sample Applications
R tabular results are cumbersome to manipulate; R graphs cannot be modified; output is difficult to manage.
With all STATISTICA products, including STATISTICA (desktop), you can:
- Run R programs (or “scripts”) to produce output as STATISTICA spreadsheets and graphs, which can be managed in STATISTICA workbooks and/or saved in STATISTICA reports.
I want to combine an R function with custom programming.
With all STATISTICA products, including STATISTICA (desktop), you can:
- Build functions that are entirely (or partially) based on R by using STATISTICA Visual Basic (SVB).
I need a specialized R data mining algorithm that is not available in any commercial software package.
With STATISTICA Data Miner, you can:
- Create a data miner workspace where you can build and maintain models with R-based nodes.
My company needs to run reports based on R scripts, but my users don’t know how to use R.
With STATISTICA Enterprise, you can:
- Generate automated reports from reusable R-based analysis configurations, which deliver the power of R to users not familiar with R.
My company needs to run a high volume of data-intensive analyses using R algorithms, but the R program is too slow.
With WebSTATISTICA Server, you can:
- Off-load R scripts (as well as SVB scripts, Data Miner workspaces, etc.), creating a powerful multi-processor multi-user R server with load balancing, batch-job capabilities (scheduling), and more.
Multivariate Adaptive Regression Splines (MARSplines)
Introductory Overview
Multivariate Adaptive Regression Splines (MARSplines) is an implementation of techniques popularized by Friedman (1991) for solving regression-type problems (see also, Multiple Regression), with the main purpose to predict the values of a continuous dependent or outcome variable from a set of independent or predictor variables. There are a large number of methods available for fitting models to continuous variables, such as a linear regression [e.g., Multiple Regression, General Linear Model (GLM)], nonlinear regression (Generalized Linear/Nonlinear Models), regression trees (see Classification and Regression Trees), CHAID, Neural Networks, etc. (see also Hastie, Tibshirani, and Friedman, 2001, for an overview).
MARSplines is a nonparametric regression procedure that makes no assumption about the underlying functional relationship between the dependent and independent variables. Instead, MARSplines constructs this relation from a set of coefficients and basis functions that are entirely “driven” from the regression data. In a sense, the method is based on the “divide and conquer” strategy, which partitions the input space into regions, each with its own regression equation. This makes MARSplines particularly suitable for problems with higher input dimensions (i.e., with more than 2 variables), where the curse of dimensionality would likely create problems for other techniques.
The MARSplines technique has become particularly popular in the area of data mining because it does not assume or impose any particular type or class of relationship (e.g., linear, logistic, etc.) between the predictor variables and the dependent (outcome) variable of interest. Instead, useful models (i.e., models that yield accurate predictions) can be derived even in situations where the relationship between the predictors and the dependent variables is non-monotone and difficult to approximate with parametric models. For more information about this technique and how it compares to other methods for nonlinear regression (or regression trees), see Hastie, Tibshirani, and Friedman (2001).
Regression Problems
Regression problems are used to determine the relationship between a set of dependent variables (also called output, outcome, or response variables) and one or more independent variables (also known as input or predictor variables). The dependent variable is the one whose values you want to predict, based on the values of the independent (predictor) variables. For instance, one might be interested in the number of car accidents on the roads, which can be caused by 1) bad weather and 2) drunk driving. In this case one might write, for example,
Number_of_Accidents = Some Constant + 0.5*Bad_Weather + 2.0*Drunk_Driving
The variable Number of Accidents is the dependent variable that is thought to be caused by (among other variables) Bad Weather and Drunk Driving (hence the name dependent variable). Note that the independent variables are multiplied by factors, i.e., 0.5 and 2.0. These are known as regression coefficients. The larger these coefficients, the stronger the influence of the independent variables on the dependent variable. If the two predictors in this simple (fictitious) example were measured on the same scale (e.g., if the variables were standardized to a mean of 0.0 and standard deviation 1.0), then Drunk Driving could be inferred to contribute 4 times more to car accidents than Bad Weather. (If the variables are not measured on the same scale, then direct comparisons between these coefficients are not meaningful, and, usually, some other standardized measure of predictor “importance” is included in the results.)
For additional details regarding these types of statistical models, refer to Multiple Regression or General Linear Models (GLM), as well as General Regression Models (GRM). In general, the social and natural sciences regression procedures are widely used in research. Regression allows the researcher to ask (and hopefully answer) the general question “what is the best predictor of …” For example, educational researchers might want to learn what the best predictors of success in high-school are. Psychologists may want to determine which personality variable best predicts social adjustment. Sociologists may want to find out which of the multiple social indicators best predict whether a new immigrant group will adapt and be absorbed into society.
Multivariate Adaptive Regression Splines
The car accident example we considered previously is a typical application for linear regression, where the response variable is hypothesized to depend linearly on the predictor variables. Linear regression also falls into the category of so-called parametric regression, which assumes that the nature of the relationships (but not the specific parameters) between the dependent and independent variables is known a priori (e.g., is linear). By contrast, nonparametric regression (see Nonparametrics) does not make any such assumption as to how the dependent variables are related to the predictors. Instead it allows the regression function to be “driven” directly from data.
Multivariate Adaptive Regression Splines is a nonparametric regression procedure that makes no assumption about the underlying functional relationship between the dependent and independent variables. Instead, MARSplines constructs this relation from a set of coefficients and so-called basis functions that are entirely determined from the regression data. You can think of the general “mechanism” by which the MARSplines algorithm operates as multiple piecewise linear regression (see Nonlinear Estimation), where each breakpoint (estimated from the data) defines the “region of application” for a particular (very simple) linear regression equation.
Basis functions. Specifically, MARSplines uses two-sided truncated functions of the form (as shown below) as basis functions for linear or nonlinear expansion, which approximates the relationships between the response and predictor variables.
Shown above is a simple example of two basis functions (t-x)+ and (x-t)+ (adapted from Hastie, et al., 2001, Figure 9.9). Parameter t is the knot of the basis functions (defining the “pieces” of the piecewise linear regression); these knots (parameters) are also determined from the data. The “+” signs next to the terms (t-x) and (x-t) simply denote that only positive results of the respective equations are considered; otherwise the respective functions evaluate to zero. This can also be seen in the illustration.
The MARSplines model. The basis functions together with the model parameters (estimated via least squares estimation) are combined to produce the predictions given the inputs. The general MARSplines model equation (see Hastie et al., 2001, equation 9.19) is given as:
where the summation is over the M nonconstant terms in the model (further details regarding the model are also provided in Technical Notes). To summarize, y is predicted as a function of the predictor variables X (and their interactions); this function consists of an intercept parameter () and the weighted (by ) sum of one or more basis functions , of the kind illustrated earlier. You can also think of this model as “selecting” a weighted sum of basis functions from the set of (a large number of) basis functions that span all values of each predictor (i.e., that set would consist of one basis function, and parameter t, for each distinct value for each predictor variable). The MARSplines algorithm then searches over the space of all inputs and predictor values (knot locations t) as well as interactions between variables. During this search, an increasingly larger number of basis functions are added to the model (selected from the set of possible basis functions), to maximize an overall least squares goodness-of-fit criterion. As a result of these operations, MARSplines automatically determines the most important independent variables as well as the most significant interactions among them. The details of this algorithm are further described in Technical Notes, as well as in Hastie et al., 2001).
Categorical predictors. In practice, both continuous and categorical predictors could be used, and will often yield useful results. However, the basic MARSplines algorithm assumes that the predictor variables are continuous in nature, and, for example, the computed knots program will usually not coincide with actual class codes found in the categorical predictors. For a detailed discussion of categorical predictor variables in MARSplines, see Friedman (1993).
Multiple dependent (outcome) variables. The MARSplines algorithm can be applied to multiple dependent (outcome) variables. In this case, the algorithm will determine a common set of basis functions in the predictors, but estimate different coefficients for each dependent variable. This method of treating multiple outcome variables is not unlike some neural networks architectures, where multiple outcome variables can be predicted from common neurons and hidden layers; in the case of MARSplines, multiple outcome variables are predicted from common basis functions, with different coefficients.
MARSplines and classification problems. Because MARSplines can handle multiple dependent variables, it is easy to apply the algorithm to classification problems as well. First, code the classes in the categorical response variable into multiple indicator variables (e.g., 1 = observation belongs to class k, 0 = observation does not belong to class k); then apply the MARSplines algorithm to fit a model, and compute predicted (continuous) values or scores; finally, for prediction, assign each case to the class for which the highest score is predicted (see also Hastie, Tibshirani, and Freedman, 2001, for a description of this procedure). Note that this type of application will yield heuristic classifications that may work very well in practice, but is not based on a statistical model for deriving classification probabilities.
Model Selection and Pruning
In general, nonparametric models are adaptive and can exhibit a high degree of flexibility that may ultimately result in overfitting if no measures are taken to counteract it. Although such models can achieve zero error on training data, they have the tendency to perform poorly when presented with new observations or instances (i.e., they do not generalize well to the prediction of “new” cases). MARSplines, like most methods of this kind, tend to overfit the data as well. To combat this problem, MARSplines uses a pruning technique (similar to pruning in classification trees) to limit the complexity of the model by reducing the number of its basis functions.
MARSplines as a predictor (feature) selection method. This feature – the selection of and pruning of basis functions – makes this method a very powerful tool for predictor selection. The MARSplines algorithm will pick up only those basis functions (and those predictor variables) that make a “sizeable” contribution to the prediction (refer to Technical Notes for details).
Applications
Multivariate Adaptive Regression Splines have become very popular recently for finding predictive models for “difficult” data mining problems, i.e., when the predictor variables do not exhibit simple and/or monotone relationships to the dependent variable of interest. Alternative models or approaches that you can consider for such cases are CHAID, Classification and Regression Trees, or any of the many Neural Networks architectures available. Because of the specific manner in which MARSplines selects predictors (basis functions) for the model, it does generally “well” in situations where regression-tree models are also appropriate, i.e., where hierarchically organized successive splits on the predictor variables yield good (accurate) predictions. In fact, instead of considering this technique as a generalization of multiple regression (as it was presented in this introduction), you may consider MARSplines as a generalization of regression trees, where the “hard” binary splits are replaced by “smooth” basis functions. Refer to Hastie, Tibshirani, and Friedman (2001) for additional details.
Technical Notes: The MARSplines Algorithm
Implementing MARSplines involves a two step procedure that is applied successively until a desired model is found. In the first step, we build the model, i.e. increase its complexity by adding basis functions until a preset (user-defined) maximum level of complexity has been reached. Then we begin a backward procedure to remove the least significant basis functions from the model, i.e. those whose removal will lead to the least reduction in the (least-squares) goodness of fit. This algorithm is implemented as follows:
- Start with the simplest model involving only the constant basis function.
- Search the space of basis functions, for each variable and for all possible knots, and add those which maximize a certain measure of goodness of fit (minimize prediction error).
- Step 2 is recursively applied until a model of pre-determined maximum complexity is derived.
- Finally, in the last stage, a pruning procedure is applied where those basis functions are removed that contribute least to the overall (least squares) goodness of fit.
Technical Notes: The Multivariate Adaptive Regression Splines (MARSplines) Model
The MARSplines algorithm builds models from two sided truncated functions of the predictors (x) of the form:
These serve as basis functions for linear or nonlinear expansion that approximates some true underlying function f(x).
The MARSplines model for a dependent (outcome) variable y, and M terms , can be summarized in the following equation:
where the summation is over the M terms in the model, and bo and bm are parameters of the model (along with the knots t for each basis function, which are also estimated from the data). Function H is defined as:
where xv(k,m) is the predictor in the k’th of the m’th product. For order of interactions K=1, the model is additive and for K=2 the model pairwise interactive.
During forward stepwise, a number of basis functions are added to the model according to a pre-determined maximum which should be considerably larger (twice as much at least) than the optimal (best least-squares fit).
After implementing the forward stepwise selection of basis functions, a backward procedure is applied in which the model is pruned by removing those basis functions that are associated with the smallest increase in the (least squares) goodness-of-fit. A least squares error function (inverse of goodness-of-fit) is computed. The so-called Generalized Cross Validation error is a measure of the goodness of fit that takes into account not only the residual error but also the model complexity as well. It is given by
with
where N is the number of cases in the data set, d is the effective degrees of freedom, which is equal to the number of independent basis functions. The quantity c is the penalty for adding a basis function. Experiments have shown that the best value for C can be found somewhere in the range 2 < d < 3 (see Hastie et al., 2001).
Log-Linear Analysis of Frequency Tables
Log-Linear Analysis of Frequency Tables
- General Purpose
- Two-way Frequency Tables
- Multi-Way Frequency Tables
- The Log-Linear Model
- Goodness-of-fit
- Automatic Model Fitting
General Purpose
One basic and straightforward method for analyzing data is via crosstabulation. For example, a medical researcher may tabulate the frequency of different symptoms by patients’ age and gender; an educational researcher may tabulate the number of high school drop-outs by age, gender, and ethnic background; an economist may tabulate the number of business failures by industry, region, and initial capitalization; a market researcher may tabulate consumer preferences by product, age, and gender; etc. In all of these cases, the major results of interest can be summarized in a multi-way frequency table, that is, in a crosstabulation table with two or more factors.
Log-Linear provides a more “sophisticated” way of looking at crosstabulation tables. Specifically, you can test the different factors that are used in the crosstabulation (e.g., gender, region, etc.) and their interactions for statistical significance (see Elementary Concepts for a discussion of statistical significance testing). The following text will present a brief introduction to these methods, their logic, and interpretation.
Correspondence analysis is a descriptive/exploratory technique designed to analyze two-way and multi-way tables containing some measure of correspondence between the rows and columns. The results provide information which is similar in nature to those produced by Factor Analysis techniques, and they allow one to explore the structure of the categorical variables included in the table.
Two-way Frequency Tables
Let us begin with the simplest possible crosstabulation, the 2 by 2 table. Suppose we were interested in the relationship between age and the graying of people’s hair. We took a sample of 100 subjects, and determined who does and does not have gray hair. We also recorded the approximate age of the subjects. The results of this study may be summarized as follows:
Gray Hair |
Age | Total | |
---|---|---|---|
Below 40 | 40 or older | ||
No Yes |
40 20 |
5 35 |
45 55 |
Total | 60 | 40 | 100 |
While interpreting the results of our little study, let us introduce the terminology that will allow us to generalize to complex tables more easily.
Design variables and response variables. In multiple regression (Multiple Regression) or analysis of variance (ANOVA/MANOVA) one customarily distinguishes between independent and dependent variables. Dependent variables are those that we are trying to explain, that is, that we hypothesize to depend on the independent variables. We could classify the factors in the 2 by 2 table accordingly: we may think of hair color (gray, not gray) as the dependent variable, and age as the independent variable. Alternative terms that are often used in the context of frequency tables are response variables and design variables, respectively. Response variables are those that vary in response to the design variables. Thus, in the example table above, hair color can be considered to be the response variable, and age the design variable.
Fitting marginal frequencies. Let us now turn to the analysis of our example table. We could ask ourselves what the frequencies would look like if there were no relationship between variables (the null hypothesis). Without going into details, intuitively one could expect that the frequencies in each cell would proportionately reflect the marginal frequencies (Totals). For example, consider the following table:
Gray Hair |
Age | Total | |
---|---|---|---|
Below 40 | 40 or older | ||
No Yes |
27 33 |
18 22 |
45 55 |
Total | 60 | 40 | 100 |
In this table, the proportions of the marginal frequencies are reflected in the individual cells. Thus, 27/33=18/22=45/55 and 27/18=33/22=60/40. Given the marginal frequencies, these are the cell frequencies that we would expect if there were no relationship between age and graying. If you compare this table with the previous one you will see that the previous table does reflect a relationship between the two variables: There are more than expected (under the null hypothesis) cases below age 40 without gray hair, and more cases above age 40 with gray hair.
This example illustrates the general principle on which the log-linear analysis is based: Given the marginal totals for two (or more) factors, we can compute the cell frequencies that would be expected if the two (or more) factors are unrelated. Significant deviations of the observed frequencies from those expected frequencies reflect a relationship between the two (or more) variables.
Model fitting approach. Let us now rephrase our discussion of the 2 by 2 table so far. We can say that fitting the model of two variables that are not related (age and hair color) amounts to computing the cell frequencies in the table based on the respective marginal frequencies (totals). Significant deviations of the observed table from those fitted frequencies reflect the lack of fit of the independence (between two variables) model. In that case we would reject that model for our data, and instead accept the model that allows for a relationship or association between age and hair color.
Multi-way Frequency Tables
The reasoning presented for the analysis of the 2 by 2 table can be generalized to more complex tables. For example, suppose we had a third variable in our study, namely whether or not the individuals in our sample experience stress at work. Because we are interested in the effect of stress on graying, we will consider Stress as another design variable. (Note that, if our study were concerned with the effect of gray hair on subsequent stress, variable stress would be the response variable, and hair color would be the design variable.). The resultant table is a three- way frequency table.
Fitting models. We can apply our previous reasoning to analyze this table. Specifically, we could fit different models that reflect different hypotheses about the data. For example, we could begin with a model that hypothesizes independence between all factors. As before, the expected frequencies in that case would reflect the respective marginal frequencies. If any significant deviations occur, we would reject this model.
Interaction effects. Another conceivable model would be that age is related to hair color, and stress is related to hair color, but the two (age and stress) factors do not interact in their effect. In that case, we would need to simultaneously fit the marginal totals for the two-way table of age by hair color collapsed across levels of stress, and the two-way table of stress by hair color collapsed across the levels of age. If this model does not fit the data, we would have to conclude that age, stress, and hair color all are interrelated. Put another way, we would conclude that age and stress interact in their effect on graying.
The concept of interaction here is analogous to that used in analysis of variance (ANOVA /MANOVA). For example, the age by stress interaction could be interpreted such that the relationship of age to hair color is modified by stress. While age brings about only little graying in the absence of stress, age is highly related when stress is present. Put another way, the effects of age and stress on graying are not additive, but interactive.
If you are not familiar with the concept of interaction, we recommend that you read the Introductory Overview to ANOVA/MANOVA. Many aspects of the interpretation of results from a log-linear analysis of a multi-way frequency table are very similar to ANOVA.
Iterative proportional fitting. The computation of expected frequencies becomes increasingly complex when there are more than two factors in the table. However, they can be computed, and, therefore, we can easily apply the reasoning developed for the 2 by 2 table to complex tables. The commonly used method for computing the expected frequencies is the so-called iterative proportional fitting procedure.
The Log-Linear Model
The term log-linear derives from the fact that one can, through logarithmic transformations, restate the problem of analyzing multi-way frequency tables in terms that are very similar to ANOVA. Specifically, one may think of the multi-way frequency table to reflect various main effects and interaction effects that add together in a linear fashion to bring about the observed table of frequencies. Bishop, Fienberg, and Holland (1974) provide details on how to derive log- linear equations to express the relationship between factors in a multi-way frequency table. Chi-square test. You can compute two types of Chi-squares, the traditional Pearson Chi-square statistic and the maximum likelihood ratio Chi-square statistic (the term likelihood ratio was first introduced by Neyman and Pearson, 1931; the term maximum likelihood was first used by Fisher, 1922a). In practice, the interpretation and magnitude of those two Chi-square statistics are essentially identical. Both tests evaluate whether the expected cell frequencies under the respective model are significantly different from the observed cell frequencies. If so, the respective model for the table is rejected.
Goodness-of-Fit
In the previous discussion we have repeatedly made reference to the “significance” of deviations of the observed frequencies from the expected frequencies. One can evaluate the statistical significance of the goodness-of-fit of a particular model via a
Reviewing and plotting residual frequencies. After one has chosen a model for the observed table, it is always a good idea to inspect the residual frequencies, that is, the observed minus the expected frequencies. If the model is appropriate for the table, then all residual frequencies should be “random noise,” that is, consist of positive and negative values of approximately equal magnitudes that are distributed evenly across the cells of the table.
Statistical significance of effects. The Chi-squares of models that are hierarchically related to each other can be directly compared. For example, if we first fit a model with the age by hair color interaction and the stress by hair color interaction, and then fit a model with the age by stress by hair color (three-way) interaction, then the second model is a superset of the previous model. We could evaluate the difference in the Chi-square statistics, based on the difference in the degrees of freedom; if the differential Chi-square statistic is significant, then we would conclude that the three-way interaction model provides a significantly better fit to the observed table than the model without this interaction. Therefore, the three-way interaction is statistically significant.
In general, two models are hierarchically related to each other if one can be produced from the other by either adding terms (variables or interactions) or deleting terms (but not both at the same time).
Automatic Model Fitting
When analyzing four- or higher-way tables, finding the best fitting model can become increasingly difficult. You can use automatic model fitting options to facilitate the search for a “good model” that fits the data. The general logic of this algorithm is as follows. First, fit a model with no relationships between factors; if that model does not fit (i.e., the respective Chi- square statistic is significant), then it will fit a model with all two-way interactions. If that model does not fit either, then the program will fit all three-way interactions, and so on. Let’s assume that this process found the model with all two-way interactions to fit the data. The program will then proceed to eliminate all two-way interactions that are not statistically significant. The resulting model will be the one that includes the least number of interactions necessary to fit the observed table.
How To Find Relationship Between Variables, Multiple Regression
General Purpose
The general purpose of multiple regression (the term was first used by Pearson, 1908) is to learn more about the relationship between several independent or predictor variables and a dependent or criterion variable. For example, a real estate agent might record for each listing the size of the house (in square feet), the number of bedrooms, the average income in the respective neighborhood according to census data, and a subjective rating of appeal of the house. Once this information has been compiled for various houses it would be interesting to see whether and how these measures relate to the price for which a house is sold. For example, you might learn that the number of bedrooms is a better predictor of the price for which a house sells in a particular neighborhood than how “pretty” the house is (subjective rating). You may also detect “outliers,” that is, houses that should really sell for more, given their location and characteristics.
Personnel professionals customarily use multiple regression procedures to determine equitable compensation. You can determine a number of factors or dimensions such as “amount of responsibility” (Resp) or “number of people to supervise” (No_Super) that you believe to contribute to the value of a job. The personnel analyst then usually conducts a salary survey among comparable companies in the market, recording the salaries and respective characteristics (i.e., values on dimensions) for different positions. This information can be used in a multiple regression analysis to build a regression equation of the form:
Salary = .5*Resp + .8*No_Super
Once this so-called regression line has been determined, the analyst can now easily construct a graph of the expected (predicted) salaries and the actual salaries of job incumbents in his or her company. Thus, the analyst is able to determine which position is underpaid (below the regression line) or overpaid (above the regression line), or paid equitably.
In the social and natural sciences multiple regression procedures are very widely used in research. In general, multiple regression allows the researcher to ask (and hopefully answer) the general question “what is the best predictor of …”. For example, educational researchers might want to learn what are the best predictors of success in high-school. Psychologists may want to determine which personality variable best predicts social adjustment. Sociologists may want to find out which of the multiple social indicators best predict whether or not a new immigrant group will adapt and be absorbed into society.
See also Exploratory Data Analysis and Data Mining Techniques, the General Stepwise Regression topic, and the General Linear Models topic.
Computational Approach
The general computational problem that needs to be solved in multiple regression analysis is to fit a straight line to a number of points.
In the simplest case – one dependent and one independent variable – you can visualize this in a scatterplot.
- Least Squares
- The Regression Equation
- Unique Prediction and Partial Correlation
- Predicted and Residual Scores
- Residual Variance and R-square
- Interpreting the Correlation Coefficient R
Least Squares
In the scatterplot, we have an independent or X variable, and a dependent or Y variable. These variables may, for example, represent IQ (intelligence as measured by a test) and school achievement (grade point average; GPA), respectively. Each point in the plot represents one student, that is, the respective student’s IQ and GPA. The goal of linear regression procedures is to fit a line through the points. Specifically, the program will compute a line so that the squared deviations of the observed points from that line are minimized. Thus, this general procedure is sometimes also referred to as least squares estimation.
The Regression Equation
A line in a two dimensional or two-variable space is defined by the equation Y=a+b*X; in full text: the Y variable can be expressed in terms of a constant (a) and a slope (b) times the X variable. The constant is also referred to as the intercept, and the slope as the regression coefficient or B coefficient. For example, GPA may best be predicted as 1+.02*IQ. Thus, knowing that a student has an IQ of 130 would lead us to predict that her GPA would be 3.6 (since, 1+.02*130=3.6).
For example, the animation below shows a two dimensional regression equation plotted with three different confidence intervals (90%, 95% and 99%).
In the multivariate case, when there is more than one independent variable, the regression line cannot be visualized in the two dimensional space, but can be computed just as easily. For example, if in addition to IQ we had additional predictors of achievement (e.g., Motivation, Self- discipline) we could construct a linear equation containing all those variables. In general then, multiple regression procedures will estimate a linear equation of the form:
Y = a + b_{1}*X_{1} + b_{2}*X_{2} + … + b_{p}*X_{p}
Unique Prediction and Partial Correlation
Note that in this equation, the regression coefficients (or B coefficients) represent the independent contributions of each independent variable to the prediction of the dependent variable. Another way to express this fact is to say that, for example, variable X_{1} is correlated with the Y variable, after controlling for all other independent variables. This type of correlation is also referred to as a partial correlation (this term was first used by Yule, 1907). Perhaps the following example will clarify this issue. You would probably find a significant negative correlation between hair length and height in the population (i.e., short people have longer hair). At first this may seem odd; however, if we were to add the variable Gender into the multiple regression equation, this correlation would probably disappear. This is because women, on the average, have longer hair than men; they also are shorter on the average than men. Thus, after we remove this gender difference by entering Gender into the equation, the relationship between hair length and height disappears because hair length does not make any unique contribution to the prediction of height, above and beyond what it shares in the prediction with variable Gender. Put another way, after controlling for the variable Gender, the partial correlation between hair length and height is zero.
Predicted and Residual Scores
The regression line expresses the best prediction of the dependent variable (Y), given the independent variables (X). However, nature is rarely (if ever) perfectly predictable, and usually there is substantial variation of the observed points around the fitted regression line (as in the scatterplot shown earlier). The deviation of a particular point from the regression line (its predicted value) is called the residual value.
Residual Variance and R-square
R-Square, also known as the Coefficient of determination is a commonly used statistic to evaluate model fit. R-square is 1 minus the ratio of residual variability. When the variability of the residual values around the regression line relative to the overall variability is small, the predictions from the regression equation are good. For example, if there is no relationship between the X and Y variables, then the ratio of the residual variability of the Y variable to the original variance is equal to 1.0. Then R-square would be 0. If X and Y are perfectly related then there is no residual variance and the ratio of variance would be 0.0, making R-square = 1. In most cases, the ratio and R-square will fall somewhere between these extremes, that is, between 0.0 and 1.0. This ratio value is immediately interpretable in the following manner. If we have an R-square of 0.4 then we know that the variability of the Y values around the regression line is 1-0.4 times the original variance; in other words we have explained 40% of the original variability, and are left with 60% residual variability. Ideally, we would like to explain most if not all of the original variability. The R-square value is an indicator of how well the model fits the data (e.g., an R-square close to 1.0 indicates that we have accounted for almost all of the variability with the variables specified in the model).
Interpreting the Correlation Coefficient R
Customarily, the degree to which two or more predictors (independent or X variables) are related to the dependent (Y) variable is expressed in the correlation coefficient R, which is the square root of R-square. In multiple regression, R can assume values between 0 and 1. To interpret the direction of the relationship between variables, look at the signs (plus or minus) of the regression or B coefficients. If a B coefficient is positive, then the relationship of this variable with the dependent variable is positive (e.g., the greater the IQ the better the grade point average); if the B coefficient is negative then the relationship is negative (e.g., the lower the class size the better the average test scores). Of course, if the B coefficient is equal to 0 then there is no relationship between the variables.
Assumptions, Limitations, Practical Considerations
- Assumption of Linearity
- Normality Assumption
- Limitations
- Choice of the number of variables
- Multicollinearity and matrix ill-conditioning
- The importance of residual analysis
Assumption of Linearity
First of all, as is evident in the name multiple linear regression, it is assumed that the relationship between variables is linear. In practice this assumption can virtually never be confirmed; fortunately, multiple regression procedures are not greatly affected by minor deviations from this assumption. However, as a rule it is prudent to always look at bivariate scatterplot of the variables of interest. If curvature in the relationships is evident, you may consider either transforming the variables, or explicitly allowing for nonlinear components.
See also Exploratory Data Analysis and Data Mining Techniques, the General Stepwise Regression topic, and the General Linear Models topic.
Normality Assumption
It is assumed in multiple regression that the residuals (predicted minus observed values) are distributed normally (i.e., follow the normal distribution). Again, even though most tests (specifically the F-test) are quite robust with regard to violations of this assumption, it is always a good idea, before drawing final conclusions, to review the distributions of the major variables of interest. You can produce histograms for the residuals as well as normal probability plots, in order to inspect the distribution of the residual values.
Limitations
The major conceptual limitation of all regression techniques is that you can only ascertain relationships, but never be sure about underlying causal mechanism. For example, you would find a strong positive relationship (correlation) between the damage that a fire does and the number of firemen involved in fighting the blaze. Do we conclude that the firemen cause the damage? Of course, the most likely explanation of this correlation is that the size of the fire (an external variable that we forgot to include in our study) caused the damage as well as the involvement of a certain number of firemen (i.e., the bigger the fire, the more firemen are called to fight the blaze). Even though this example is fairly obvious, in real correlation research, alternative causal explanations are often not considered.
Choice of the Number of Variables
Multiple regression is a seductive technique: “plug in” as many predictor variables as you can think of and usually at least a few of them will come out significant. This is because you are capitalizing on chance when simply including as many variables as you can think of as predictors of some other variable of interest. This problem is compounded when, in addition, the number of observations is relatively low. Intuitively, it is clear that you can hardly draw conclusions from an analysis of 100 questionnaire items based on 10 respondents. Most authors recommend that you should have at least 10 to 20 times as many observations (cases, respondents) as you have variables; otherwise the estimates of the regression line are probably very unstable and unlikely to replicate if you were to conduct the study again.
Multicollinearity and Matrix Ill-Conditioning
This is a common problem in many correlation analyses. Imagine that you have two predictors (X variables) of a person’s height: (1) weight in pounds and (2) weight in ounces. Obviously, our two predictors are completely redundant; weight is one and the same variable, regardless of whether it is measured in pounds or ounces. Trying to decide which one of the two measures is a better predictor of height would be rather silly; however, this is exactly what you would try to do if you were to perform a multiple regression analysis with height as the dependent (Y) variable and the two measures of weight as the independent (X) variables. When there are very many variables involved, it is often not immediately apparent that this problem exists, and it may only manifest itself after several variables have already been entered into the regression equation. Nevertheless, when this problem occurs it means that at least one of the predictor variables is (practically) completely redundant with other predictors. There are many statistical indicators of this type of redundancy (tolerances, semi-partial R, etc., as well as some remedies (e.g., Ridge regression).
Fitting Centered Polynomial Models
The fitting of higher-order polynomials of an independent variable with a mean not equal to zero can create difficult multicollinearity problems. Specifically, the polynomials will be highly correlated due to the mean of the primary independent variable. With large numbers (e.g., Julian dates), this problem is very serious, and if proper protections are not put in place, can cause wrong results. The solution is to “center” the independent variable (sometimes, this procedures is referred to as “centered polynomials”), i.e., to subtract the mean, and then to compute the polynomials. See, for example, the classic text by Neter, Wasserman, & Kutner (1985, Chapter 9), for a detailed discussion of this issue (and analyses with polynomial models in general).
The Importance of Residual Analysis
Even though most assumptions of multiple regression cannot be tested explicitly, gross violations can be detected and should be dealt with appropriately. In particular outliers (i.e., extreme cases) can seriously bias the results by “pulling” or “pushing” the regression line in a particular direction (see the animation below), thereby leading to biased regression coefficients. Often, excluding just a single extreme case can yield a completely different set of results.
Independent Components Analysis
Introductory Overview
Independent Component Analysis is a well established and reliable statistical method that performs signal separation. Signal separation is a frequently occurring problem and is central to Statistical Signal Processing, which has a wide range of applications in many areas of technology ranging from Audio and Image Processing to Biomedical Signal Processing, Telecommunications, and Econometrics.
Imagine being in a room with a crowd of people and two speakers giving presentations at the same time. The crowed is making comments and noises in the background. We are interested in what the speakers say and not the comments emanating from the crowd. There are two microphones at different locations, recording the speakers’ voices as well as the noise coming from the crowed. Our task is to separate the voice of each speaker while ignoring the background noise (see illustration below).
This is a classic example of the Independent Component Analysis, a well established stochastic technique. ICA can be used as a method of Blind Source Separation, meaning that it can separate independent signals from linear mixtures with virtually no prior knowledge on the signals. An example is decomposition of Electro or Magnetoencephalographic signals. In computational Neuroscience, ICA has been used for Feature Extraction, in which case it seems to adequately model the basic cortical processing of visual and auditory information. New application areas are being discovered at an increasing pace.
How to Visualize Data (Graph Types)
Brief Overviews of Types of Graphs
2D Graphs Bar/Column Bar Dev Bar Left Y Bar Right Y Bar Top Bar X Box Detrended Probability Half-Normal Probability Hanging Bar Histograms Histograms Line Pie Charts Probability Probability-Probability Quantile-Quantile Range Scatterplots Sequential/Stacked Voronoi Scatterplot 3D XYZ Graphs |
Spectral Trace 3D Sequential Graphs 4D/Ternary Graphs 2D Categorized Graphs |
3D Categorized Graphs Contour Deviation Scatterplots Space Spectral Surface Ternary Categorized Graphs nD/Icon Graphs Matrix Graphs |
Categorized Graphs
One of the most important, general, and also powerful analytic methods involves dividing (“splitting”) the data set into categories in order compare the patterns of data between the resulting subsets. This common technique is known under a variety of terms (such as breaking down, grouping, categorizing, splitting, slicing, drilling-down, or conditioning) and it is used both in exploratory data analyses and hypothesis testing. For example: A positive relation between the age and the risk of a heart attack may be different in males and females (it may be stronger in males). A promising relation between taking a drug and a decrease of the cholesterol level may be present only in women with a low blood pressure and only in their thirties and forties. The process capability indices or capability histograms can be different for periods of time supervised by different operators. The regression slopes can be different in different experimental groups.
There are many computational techniques that capitalize on grouping and that are designed to quantify the differences that the grouping will reveal (e.g., ANOVA/MANOVA). However, graphical techniques (such as categorized graphs discussed in this section) offer unique advantages that cannot be substituted by any computational method alone: they can reveal patterns that cannot be easily quantified (e.g., complex interactions, exceptions, anomalies) and they provide unique, multidimensional, global analytic perspectives to explore or “mine” the data.
What are Categorized Graphs?
Categorized graphs (the term first used in STATISTICA software by StatSoft in 1990; also recently called Trellis graphs, by Becker, Cleveland, and Clark, at Bell Labs) produce a series of 2D, 3D, ternary, or nD graphs (such as histograms, scatterplots, line plots, surface plots, ternary scatterplots, etc.), one for each selected category of cases (i.e., subset of cases), for example, respondents from New York, Chicago, Dallas, etc. These “component” graphs are placed sequentially in one display, allowing for comparisons between the patterns of data shown in graphs for each of the requested groups (e.g., cities).
A variety of methods can be used to select the subsets; the simplest of them is using a categorical variable (e.g., a variable City, with three values New York, Chicago, and Dallas). For example, the following graph shows histograms of a variable representing self-reported stress levels in each of the three cities.
We could conclude that the data suggest that people who live in Dallas are less likely to report being stressed, while the patterns (distributions) of stress reporting in New York and Chicago are quite similar.
Categorized graphs in some software systems (e.g., in STATISTICA) also support two-way or multi-way categorizations, where not one criterion (e.g., City) but two or more criteria (e.g., City and Time of the day) are used to create the subsets. Two-way categorized graphs can be thought of as “crosstabulations of graphs” where each component graph represents a cross-section of one level of one grouping variable (e.g., City) and one level of the other grouping variable (e.g., Time).
Adding this second factor reveals that the patterns of stress reporting in New York and Chicago are actually quite different when the Time of questioning is taken into consideration, whereas the Time factor makes little difference in Dallas.
Categorized graphs vs. matrix graphs. Matrix graphs also produce displays containing multiple component graphs; however, each of those component graphs are (or can be) based on the same set of cases and the graphs are generated for all combinations of variables from one or two lists. Categorized graphs require a selection of variables that normally would be selected for non-categorized graphs of the respective type (e.g., two variables for a scatterplot). However, in categorized plots, you also need to specify at least one grouping variable (or some criteria to be used for sorting the observations into the categories) that contains information on group membership of each case (e.g., Chicago, Dallas). That grouping variable will not be included in the graph directly (i.e., it will not be plotted) but it will serve as a criterion for dividing all analyzed cases into separate graphs. As illustrated above, one graph will be created for each group (category) identified by the grouping variable.
Common vs. Independent scaling. Each individual category graph may be scaled according to its own range of values (independent scaling),
or all graphs may be scaled to a common scale wide enough to accommodate all values in all of the category graphs.
Common scaling allows the analyst to make comparisons of ranges and distributions of values among categories. However, if the ranges of values in graph categories are considerably different (causing a very wide common scale), then some of the graphs may be difficult to examine. The use of independent scaling may make it easier to spot trends and specific patterns within categories, but it may be more difficult to make comparisons of ranges of values among categories.
Categorization Methods
There are five general methods of categorization of values and they will be reviewed briefly in this section: Integer mode, Categories, Boundaries, Codes, and Multiple subsets. Note that the same methods of categorization can be used to categorize cases into component graphs and to categorize cases within component graphs (e.g., in histograms or box plots).
Integer Mode. When you use Integer Mode, integer values of the selected grouping variable will be used to define the categories, and one graph will be created for all cases that belong each category (defined by those integer values). If the selected grouping variable contains non-integer values, the software will usually truncate each encountered value of the selected grouping variable to an integer value.
Categories. With this mode of categorization, you will specify the number of categories which you wish to use. The software will divide the entire range of values of the selected grouping variable (from minimum to maximum) into the requested number of equal length intervals.
Boundaries. The Boundaries method will also create interval categorization, however, the intervals can be of arbitrary (e.g., uneven) width as defined by custom interval boundaries (for example, “less than –10,” “greater than or equal to –10 but less than 0,” “greater than or equal to 0 but less than 10,” and “equal to or greater than 10”).
Codes. Use this method if the selected grouping variable contains “codes” (i.e., specific, meaningful values such as Male, Female) from which you want to specify the categories.
Multiple subsets. This method allows you to custom-define the categories and enables you to use more than one variable to define the category. In other words, categorizations based on multiple subset definitions of categories may not represent distributions of specific (individual) variables but distributions of frequencies of specific “events” defined by particular combinations of values of several variables (and defined by conditions which may involve any number of variables from the current data set). For example, you might specify six categories based on combinations of three variables Gender, Age, and Employment.
Histograms
In general, histograms are used to examine frequency distributions of values of variables. For example, the frequency distribution plot shows which specific values or ranges of values of the examined variable are most frequent, how differentiated the values are, whether most observations are concentrated around the mean, whether the distribution is symmetrical or skewed, whether it is multimodal (i.e., has two or more peaks) or unimodal, etc. Histograms are also useful for evaluating the similarity of an observed distribution with theoretical or expected distributions.
Categorized Histograms allow you to produce histograms broken down by one or more categorical variables, or by any other one or more sets of logical categorization rules (see Categorization Methods).
There are two major reasons why frequency distributions are of interest.
- We can learn from the shape of the distribution about the nature of the examined variable (e.g., a bimodal distribution may suggest that the sample is not homogeneous and consists of observations that belong to two populations that are more or less normally distributed).
- Many statistics are based on assumptions about the distributions of analyzed variables; histograms help us to test whether those assumptions are met.
Often, the first step in the analysis of a new data set is to run histograms on all variables.
Histograms vs. Breakdown. Categorized Histograms provide information similar to breakdowns (e.g., mean, median, minimum, maximum, differentiation of values, etc.; see Basic Statistics and Tables). Although specific (numerical) descriptive statistics are easier to read in a table, the overall shape and global descriptive characteristics of a distribution are much easier to examine in a graph. Moreover, the graph provides qualitative information about the distribution that cannot be fully represented by any single index. For example, the overall skewed distribution of income may indicate that the majority of people have an income that is much closer to the minimum than maximum of the range of income. Moreover, when broken down by gender and ethnic background, this characteristic of the income distribution may be found to be more pronounced in certain subgroups. Although this information will be contained in the index of skewness (for each sub-group), when presented in the graphical form of a histogram, the information is usually more easily recognized and remembered. The histogram may also reveal “bumps” that may represent important facts about the specific social stratification of the investigated population or anomalies in the distribution of income in a particular group caused by a recent tax reform.
Categorized histograms and scatterplots. A useful application of the categorization methods for continuous variables is to represent the simultaneous relationships between three variables. Shown below is a scatterplot for two variables Load 1 and Load 2.
Now suppose you would like to add a third variable (Output) and examine how it is distributed at different levels of the joint distribution of Load 1 and Load 2. The following graph could be produced:
In this graph, Load 1 and Load 2 are both categorized into 5 intervals, and within each combination of intervals the distribution for variable Output is computed. Note that the “box” (parallelogram) encloses approximately the same observations (cases) in both graphs shown above.
Scatterplots
In general, two-dimensional scatterplots are used to visualize relations between two variables X and Y (e.g., weight and height). In scatterplots, individual data points are represented by point markers in two-dimensional space, where axes represent the variables. The two coordinates (X and Y) which determine the location of each point correspond to its specific values on the two variables. If the two variables are strongly related, then the data points form a systematic shape (e.g., a straight line or a clear curve). If the variables are not related, then the points form a round “cloud.”
The categorized scatterplot option allows you to produce scatterplots categorized by one or more variables. Via the Multiple Subsets method (see Categorization Methods), you can also categorize the scatterplot based on logical selection conditions that define each category or group of observations.
Categorized scatterplots offer a powerful exploratory and analytic technique for investigating relationships between two or more variables within different sub-groups.
Homogeneity of Bivariate Distributions (Shapes of Relations). Scatterplots are typically used to identify the nature of relations between two variables (e.g., blood pressure and cholesterol level), because they can provide much more information than a correlation coefficient.
For example, a lack of homogeneity in the sample from which a correlation was calculated can bias the value of the correlation. Imagine a case where a correlation coefficient is calculated from data points which came from two different experimental groups, but this fact was ignored when the correlation was calculated. Suppose the experimental manipulation in one of the groups increased the values of both correlated variables, and thus the data from each group form a distinctive “cloud” in the scatterplot (as shown in the following illustration).
In this example, the high correlation is entirely due to the arrangement of the two groups, and it does not represent the “true” relation between the two variables, which is practically equal to 0 (as could be seen if we looked at each group separately).
If you suspect that such pattern may exist in your data and you know how to identify the possible “subsets” of data, then producing a categorized scatterplot
may yield a more accurate picture of the strength of the relationship between the X and Y variable, within each group (i.e., after controlling for group membership).
Curvilinear Relations. Curvilinearity is another aspect of the relationships between variables which can be examined in scatterplots. There are no “automatic” or easy-to-use tests to measure curvilinear relationships between variables: The standard Pearson r coefficient measures only linear relations; some nonparametric correlations such as the Spearman R can measure curvilinear relations, but not non-monotonous relations. Examining scatterplots enables us to identify the shape of relations, so that later an appropriate data transformation can be chosen to “straighten” the data or choose an appropriate nonlinear estimation equation to be fit.
For more information, refer to Basic Statistics, Nonparametrics and Distributions, Multiple Regression, and Nonlinear Estimation.
Probability Plots
Three types of categorized probability plots are Normal, Half-Normal, and Detrended. Normal probability plots provide a quick way to visually inspect to what extent the pattern of data follows a normal distribution.
Via categorized probability plots, we can examine how closely the distribution of a variable follows the normal distribution in different sub-groups.
Categorized normal probability plots provide an efficient tool to examine the normality aspect of group homogeneity.
Quantile-Quantile Plots
The categorized Quantile-Quantile (or Q-Q) plot is useful for finding the best fitting distribution within a family of distributions.
With Categorized Q-Q plots, a series of Quantile-Quantile (or Q-Q) plots, one for each category of cases identified by the X or X and Y category variables (or identified by the Multiple Subset criteria, see Categorization Methods) are produced. Examples of distributions which are used for Q-Q plots are the Exponential Distribution, Extreme Distribution, Normal, Rayleigh, Beta, Gamma, Lognormal, and Weibull distributions.
Probability-Probability Plots
The categorized Probability-Probability (or P-P) plot is useful for determining how well a specific theoretical distribution fits the observed data. This type of graph includes a series of Probability-Probability (or P-P) plots, one for each category of cases identified by the X or X and Y category variables (or identified by the Multiple Subset criteria, see Categorization Methods).
In the P-P plot, the observed cumulative distribution function (the proportion of non-missing values x) is plotted against a theoretical cumulative distribution function in order to assess the fit of the theoretical distribution to the observed data. If all points in this plot fall onto a diagonal line (with intercept 0 and slope 1), then you can conclude that the theoretical cumulative distribution adequately approximates the observed distribution.
If the data points do not all fall on the diagonal line, then you can use this plot to visually assess where the data do and do not follow the distribution (e.g., if the points form an S shape along the diagonal line, then the data may need to be transformed in order to bring them to the desired distribution pattern).
Line Plots
In line plots, individual data points are connected by a line. Line plots provide a simple way to visually present a sequence of many values (e.g., stock market quotes over a number of days). The categorized Line Plots graph is useful when we want to view such data broken down (categorized) by a grouping variable (e.g., closing stock quotes on Mondays, Tuesdays, etc.) or some other logical criteria involving one or more other variables (e.g., closing quotes only for those days when two other stocks and the Dow Jones index went up, versus all other closing quotes; see Categorization Methods).
Box Plots
In Box Plots (the term first used by Tukey, 1970), ranges of values of a selected variable (or variables) are plotted separately for groups of cases defined by values of up to three categorical (grouping) variables, or as defined by Multiple Subsets categories.
The central tendency (e.g., median or mean), and range or variation statistics (e.g., quartiles, standard errors, or standard deviations) are computed for each group of cases, and the selected values are presented in one of five styles (Box Whiskers, Whiskers, Boxes, Columns, or High-Low Close). Outlier data points can also be plotted (see the sections on outliers and extremes).
For example, in the following graph, outliers (in this case, points greater or less than 1.5 times the inter-quartile range) indicate a particularly “unfortunate” flaw in an otherwise nearly perfect combination of factors:
However, in the following graph, no outliers or extreme values are evident.
There are two typical applications for box plots: (a) showing ranges of values for individual items, cases or samples (e.g., a typical MIN-MAX plot for stocks or commodities or aggregated sequence data plots with ranges), and (b) showing variation of scores in individual groups or samples (e.g., box and whisker plots presenting the mean for each sample as a point inside the box, standard errors as the box, and standard deviations around the mean as a narrower box or a pair of “whiskers”).
Box plots showing variation of scores allow us to quickly evaluate and “intuitively envision” the strength of the relation between the grouping and dependent variable. Specifically, assuming that the dependent variable is normally distributed, and knowing what proportion of observations fall, for example, within ±1 or ±2 standard deviations from the mean (see Elementary Concepts), we can easily evaluate the results of an experiment and say that, for example, the scores in about 95% of cases in experimental group 1 belong to a different range than scores in about 95% of cases in group 2.
In addition, so-called trimmed means (this term was first used by Tukey, 1962) may be plotted by excluding a user-specified percentage of cases from the extremes (i.e., tails) of the distribution of cases.
Pie Charts
The pie chart is one of the most common graph formats used for representing proportions or values of variables. This graph allows you to produce pie charts broken down by one or more other variables (e.g., grouping variables such as gender) or categorized according to some logical selection conditions that identify Multiple Subsets (see Categorization Methods).
For purposes of this discussion, categorized pie charts will always be interpreted as frequency pie charts (as opposed to data pie charts). This type of pie chart (sometimes called a frequency pie chart) interprets data like a histogram. It categorizes all values of the selected variable following the selected categorization technique and then displays the relative frequencies as pie slices of proportional sizes. Thus, these pie charts offer an alternative method to display frequency histogram data (see the section on Categorized Histograms).
Pie-Scatterplots. Another useful application of categorized pie charts is to represent the relative frequency distribution of a variable at each “location” of the joint distribution of two other variables. Here is an example:
Note that pies are only drawn in “places” where there are data. Thus, the graph shown above takes on the appearance of a scatterplot (of variables L1 and L2), with the individual pies as point markers. However, in addition to the information contained in a simple scatterplot, each pie shows the relative distribution of a third variable at the respective location (i.e., Low, Medium, and High Quality).
Missing/Range Data Points Plots
This graph produces a series of 2D graphs (one for each category of cases identified by the grouping variables or by the Multiple Subset criteria; see Categorization Methods) of missing data points and/or user-specified “out of range” points from which you can visualize the pattern or distribution of missing data (and/or user-specified “out of range” points) within each subset of cases (category).
This graph is useful in exploratory data analysis to determine the extent of missing (and/or “out of range”) data and whether the patterns of those data occur randomly.
3D Plots
This type of graph allows you to produce 3D scatterplots (space plots, spectral plots, deviation plots, and trace plots), contour plots, and surface plots for subsets of cases defined by the specified categories of a selected variable or categories determined by user-defined case selection conditions (see Categorization Methods). Thus, the general purpose of this plot is to facilitate comparisons between groups or categories regarding the relationships between three or more variables.
Applications. In general, 3D XYZ graphs summarize the interactive relationships between three variables. The different ways in which data can be categorized (in a Categorized Graph) enable us to review those relationships contingent on some other criterion (e.g., group membership).
For example, from the categorized surface plot shown below, we can conclude that the setting of the tolerance level in an apparatus does not affect the investigated relationship between the measurements (Depend1, Depend2, and Height) unless the setting is 3.
The effect is more salient when you switch to the contour plot representation.
Ternary Plots
A categorized ternary plot can be used to examine relations between three or more dimensions where three of those dimensions represent components of a mixture (i.e., the relations between them is constrained such that the values of the three variables add up to the same constant for each case) for each level of a grouping variable.
In ternary plots, the triangular coordinate systems are used to plot four (or more) variables (the components X, Y, and Z, and the responses V1, V2, etc.) in two dimensions (ternary scatterplots or contours) or three dimensions (ternary surface plots). In order to produce ternary graphs, the relative proportions of each component within each case are constrained to add up to the same value (e.g., 1).
In a categorized ternary plot, one component graph is produced for each level of the grouping variable (or user-defined subset of data) and all the component graphs are arranged in one display to allow for comparisons between the subsets of data (categories).
Applications. A typical application of this graph is when the measured response(s) from an experiment depends on the relative proportions of three components (e.g., three different chemicals) which are varied in order to determine an optimal combination of those components (e.g., in mixture designs). This type of graph can also be used for other applications where relations between constrained variables need to be compared across categories or subsets of data.
Brushing
Perhaps the most common and historically first widely used technique explicitly identified as graphical exploratory data analysis is brushing, an interactive method that enables us to select on-screen specific data points or subsets of data and identify their (e.g., common) characteristics, or to examine their effects on relations between relevant variables (e.g., in scatterplot matrices) or to identify (e.g., label) outliers.
Those relations between variables can be visualized by fitted functions (e.g., 2D lines or 3D surfaces) and their confidence intervals, thus, for example, we can examine changes in those functions by interactively (temporarily) removing or adding specific subsets of data. For example, one of many applications of the brushing technique is to select (i.e., highlight) in a matrix scatterplot all data points that belong to a certain category (e.g., a “medium” income level, see the highlighted subset in the upper right component graph in illustration below):
in order to examine how those specific observations contribute to relations between other variables in the same data set (e.g, the correlation between the “debt” and “assets” in the current example).
If the brushing facility supports features like “animated brushing” (see example below) or “automatic function re-fitting,” we can define a dynamic brush that would move over the consecutive ranges of a criterion variable (e.g., “income” measured on a continuous scale and not a discrete scale as in the illustration to the above) and examine the dynamics of the contribution of the criterion variable to the relations between other relevant variables in the same data set.
Smoothing Bivariate Distributions
Three-dimensional histograms are used to visualize crosstabulations of values in two variables. They can be considered to be a conjunction of two simple (i.e., univariate) histograms, combined such that the frequencies of co-occurrences of values on the two analyzed variables can be examined. In a most common format of this graph, a 3D bar is drawn for each “cell” of the crosstabulation table and the height of the bar represents the frequency of values for the respective cell of the table. Different methods of categorization can be used for each of the two variables for which the bivariate distribution is visualized (see below).
If the software provides smoothing facilities, you can fit surfaces to 3D representations of bivariate frequency data. Thus, every 3D histogram can be turned into a smoothed surface. This technique is of relatively little help if applied to a simple pattern of categorized data (such as the histogram that was shown above).
However, if applied to more complex patterns of frequencies, it may provide a valuable exploratory technique,
allowing identification of regularities which are less salient when examining the standard 3D histogram representations (e.g., see the systematic surface “wave-patterns” shown on the smoothed histogram above).
Layered Compression
When layered compression is used, the main graph plotting area is reduced in size to leave space for Margin Graphs in the upper and right side of the display (and a miniature graph in the corner). These smaller Margin Graphs represent vertically and horizontally compressed images (respectively) of the main graph.
In 2D graphs, layered compression is an exploratory data analysis technique that may facilitate the identification of otherwise obscured trends and patterns in 2-dimensional data sets. For example, in the following illustration
(based on an example discussed by Cleveland, 1993), it can be seen that the number of sunspots in each cycle decays more slowly than it rises at the onset of each cycle. This tendency is not readily apparent when examining the standard line plot; however, the compressed graph uncovers the hidden pattern.
Projections of 3D Data Sets
Contour plots generated by projecting surfaces (created from multivariate, typically three-variable, data sets) offer a useful method to explore and analytically examine the shapes of surfaces.
As compared to surface plots, they may be less effective to quickly visualize the overall shape of 3D data structures,
however, their main advantage is that they allow for precise examination and analysis of the shape of the surface
(Contour Plots display a series of undistorted horizontal “cross sections” of the surface).
Icon Plots
Icon Graphs represent cases or units of observation as multidimensional symbols and they offer a powerful although not easy to use exploratory technique. The general idea behind this method capitalizes on the human ability to “automatically” spot complex (sometimes interactive) relations between multiple variables if those relations are consistent across a set of instances (in this case “icons”). Sometimes the observation (or a “feeling”) that certain instances are “somehow similar” to each other comes before the observer (in this case an analyst) can articulate which specific variables are responsible for the observed consistency (Lewicki, Hill, & Czyzewska, 1992). However, further analysis that focuses on such intuitively spotted consistencies can reveal the specific nature of the relevant relations between variables.
The basic idea of icon plots is to represent individual units of observation as particular graphical objects where values of variables are assigned to specific features or dimensions of the objects (usually one case = one object). The assignment is such that the overall appearance of the object changes as a function of the configuration of values.
Thus, the objects are given visual “identities” that are unique for configurations of values and that can be identified by the observer. Examining such icons may help to discover specific clusters of both simple relations and interactions between variables.
Analyzing Icon Plots
The “ideal” design of the analysis of icon plots consists of five phases:
- Select the order of variables to be analyzed. In many cases a random starting sequence is the best solution. You may also try to enter variables based on the order in a multiple regression equation, factor loadings on an interpretable factor (see Factor Analysis), or a similar multivariate technique. That method may simplify and “homogenize” the general appearance of the icons which may facilitate the identification of non-salient patterns. It may also, however, make some interactive patterns more difficult to find. No universal recommendations can be given at this point, other than to try the quicker (random order) method before getting involved in the more time-consuming method.
- Look for any potential regularities, such as similarities between groups of icons, outliers, or specific relations between aspects of icons (e.g., “if the first two rays of the star icon are long, then one or two rays on the other side of the icon are usually short”). The Circular type of icon plots is recommended for this phase.
- If any regularities are found, try to identify them in terms of the specific variables involved.
- Reassign variables to features of icons (or switch to one of the sequential icon plots) to verify the identified structure of relations (e.g., try to move the related aspects of the icon closer together to facilitate further comparisons). In some cases, at the end of this phase it is recommended to drop the variables that appear not to contribute to the identified pattern.
- Finally, use a quantitative method (such as a regression method, nonlinear estimation, discriminant function analysis, or cluster analysis) to test and quantify the identified pattern or at least some aspects of the pattern.
Taxonomy of Icon Plots
Most icon plots can be assigned to one of two categories: circular and sequential.
Circular icons. Circular icon plots (star plots, sun ray plots, polygon icons) follow a “spoked wheel” format where values of variables are represented by distances between the center (“hub”) of the icon and its edges.
Those icons may help to identify interactive relations between variables because the overall shape of the icon may assume distinctive and identifiable overall patterns depending on multivariate configurations of values of input variables.
In order to translate such “overall patterns” into specific models (in terms of relations between variables) or verify specific observations about the pattern, it is helpful to switch to one of the sequential icon plots, which may prove more efficient when we already know what to look for.
Sequential icons. Sequential icon plots (column icons, profile icons, line icons) follow a simpler format where individual symbols are represented by small sequence plots (of different types).
The values of consecutive variables are represented in those plots by distances between the base of the icon and the consecutive break points of the sequence (e.g., the height of the columns shown above). Those plots may be less efficient as a tool for the initial exploratory phase of icon analysis because the icons may look alike. However, as mentioned before, they may be helpful in the phase when some hypothetical pattern has already been revealed and we need to verify it or articulate it in terms of relations between individual variables.
Pie icons. Pie icon plots fall somewhere in-between the previous two categories; all icons have the same shape (pie) but are sequentially divided in a different way according to the values of consecutive variables.
From a functional point of view, they belong rather to the sequential than circular category, although they can be used for both types of applications.
Chernoff faces. This type of icon is a category by itself. Cases are visualized by schematic faces such that relative values of variables selected for the graph are represented by variations of specific facial features.
Due to its unique features, it is considered by some researchers as an ultimate exploratory multivariate technique that is capable of revealing hidden patterns of interrelations between variables that cannot be uncovered by any other technique. This statement may be an exaggeration, however. Also, it must be admitted that Chernoff Faces is a method that is difficult to use, and it requires a great deal of experimentation with the assignment of variables to facial features. See also Data Mining Techniques.
Standardization of Values
Except for unusual cases when you intend for the icons to reflect the global differences in ranges of values between the selected variables, the values of the variables should be standardized once to assure within-icon compatibility of value ranges. For example, because the largest value sets the global scaling reference point for the icons, then if there are variables that are in a range of much smaller order, they may not appear in the icon at all, e.g., in a star plot, the rays that represent them will be too short to be visible.
Applications
Icon plots are generally applicable (1) to situations where we want to find systematic patterns or clusters of observations, and (2) when we want to explore possible complex relationships between several variables. The first type of application is similar to cluster analysis; that is, it can be used to classify observations.
For example, suppose you studied the personalities of artists, and you recorded the scores for several artists on a number of personality questionnaires. The icon plot may help you determine whether there are natural clusters of artists distinguished by particular patterns of scores on different questionnaires (e.g., you may find that some artists are very creative, undisciplined, and independent, while a second group is particularly intelligent, disciplined, and concerned with publicly-acknowledged success).
The second type of application — the exploration of relationships between several variables — is more similar to factor analysis; that is, it can be used to detect which variables tend to “go together.” For example, suppose you were studying the structure of people’s perception of cars. Several subjects completed detailed questionnaires rating different cars on numerous dimensions. In the data file, the average ratings on each dimension (entered as the variables) for each car (entered as cases or observations) are recorded.
When you now study the Chernoff faces (each face representing the perceptions for one car), it may occur to you that smiling faces tend to have big ears; if price was assigned to the amount of smile and acceleration to the size of ears, then this “discovery” means that fast cars are more expensive. This, of course, is only a simple example; in real-life exploratory data analyses, non-obvious complex relationships between variables may become apparent.
Related Graphs
Matrix plots visualize relations between variables from one or two lists. If the software allows you to mark selected subsets, matrix plots may provide information similar to that in icon plots.
If the software allows you to create and identify user-defined subsets in scatterplots, simple 2D scatterplots can be used to explore the relationships between two variables; likewise, when exploring the relationships between three variables, 3D scatterplots provide an alternative to icon plots.
Graph Type
There are various types of Icon Plots.
Chernoff Faces. A separate “face” icon is drawn for each case; relative values of the selected variables for each case are assigned to shapes and sizes of individual facial features (e.g., length of nose, angle of eyebrows, width of face).
For more information see Chernoff Faces in Taxonomy of Icon Plots.
Stars. Star Icons is a circular type of icon plot. A separate star-like icon is plotted for each case; relative values of the selected variables for each case are represented (clockwise, starting at 12:00) by the length of individual rays in each star. The ends of the rays are connected by a line.
Sun Rays. Sun Ray Icons is a circular type of icon plot. A separate sun-like icon is plotted for each case; each ray represents one of the selected variables (clockwise, starting at 12:00), and the length of the ray represents the relative value of the respective variable. Data values of the variables for each case are connected by a line.
Polygons. Polygon Icons is a circular type of icon plot. A separate polygon icon is plotted for each case; relative values of the selected variables for each case are represented by the distance from the center of the icon to consecutive corners of the polygon (clockwise, starting at 12:00).
Pies. Pie Icons is a circular type of icon plot. Data values for each case are plotted as a pie chart (clockwise, starting at 12:00); relative values of selected variables are represented by the size of the pie slices.
Columns. Column Icons is a sequential type of icon plot. An individual column graph is plotted for each case; relative values of the selected variables for each case are represented by the height of consecutive columns.
Lines. Line Icons is a sequential type of icon plot.
An individual line graph is plotted for each case; relative values of the selected variables for each case are represented by the height of consecutive break points of the line above the baseline.
Profiles. Profile Icons is a sequential type of icon plot. An individual area graph is plotted for each case; relative values of the selected variables for each case are represented by the height of consecutive peaks of the profile above the baseline.
Mark Icons
If the software allows you to specify multiple subsets, it is useful to specify the cases (subjects) whose icons will be marked (i.e., frames will be placed around the selected icons) in the plot.
The line patterns of frames which identify specific subsets should be listed in the legend along with the case selection conditions. The following graph shows an example of marked subsets.
All cases (observations) which meet the condition specified in Subset 1 (i.e., cases for which the value of variable Iristype is equal to Setosa and for which the case number is less than 100) are marked with a specific frame around the selected icons.
All cases which meet the condition outlined in Subset 2 (i.e., cases for which the value of Iristype is equal to Virginic and for which the case number is less than 100) are assigned a different frame around the selected icons.
Data Reduction
Sometimes plotting an extremely large data set, can obscure an existing pattern (see the animation below). When you have a very large data file, it can be useful to plot only a subset of the data, so that the pattern is not hidden by the number of point markers.
Some software products offer methods for data reduction (or optimizing) which can be useful in these instances. Ideally, a data reduction option will allow you to specify an integer value n less than the number of cases in the data file. Then the software will randomly select approximately n cases from the available cases and create the plot based on these cases only.
Note that such data set (or sample size) reduction methods effectively draw a random sample from the current data set. Obviously, the nature of such data reduction is entirely different than when data are selectively reduced only to a specific subset or split into subgroups based on certain criteria (e.g., such as gender, region, or cholesterol level). The latter methods can be implemented interactively (e.g., using animated brushing facilities), or other techniques (e.g., categorized graphs or case selection conditions). All these methods can further aid in identifying patterns in large data sets.
Data Rotation (in 3D space
Changing the viewpoint for 3D scatterplots (e.g., simple, spectral, or space plots) may prove to be an effective exploratory technique since it can reveal patterns that are easily obscured unless you look at the “cloud” of data points from an appropriate angle (see the animation below).
Some software products offer interactive perspective, rotation, and continuous spinning controls which can be useful in these instances. Ideally, these controls will allow you to adjust the graph’s angle and perspective to find the most informative location of the “viewpoint” for the graph as well as allowing you to control the vertical and horizontal rotation of the graph.
While these facilities are useful for initial exploratory data analysis, they can also be quite beneficial in exploring the factorial space (see Factor Analysis) and exploring the dimensional space (see Multidimensional Scaling).
General Regression Models (GRM)
This topic describes the use of the general linear model for finding the “best” linear model from a number of possible models. If you are unfamiliar with the basic methods of ANOVA and regression in linear models, it may be useful to first review the basic information on these topics in Elementary Concepts. A detailed discussion of univariate and multivariate ANOVA techniques can also be found in the ANOVA/MANOVA topic; a discussion of multiple regression methods is also provided in the Multiple Regression topic. Discussion of the ways in which the linear regression model is extended by the general linear model can be found in the General Linear Models topic.
Basic Ideas: The Need for Simple Models
A good theory is the end result of a winnowing process. We start with a comprehensive model that includes all conceivable, testable influences on the phenomena under investigation. Then we test the components of the initial comprehensive model, to identify the less comprehensive submodels that adequately account for the phenomena under investigation. Finally from these candidate submodels, we single out the simplest submodel, which by the principle of parsimony we take to be the “best” explanation for the phenomena under investigation.
We prefer simple models not just for philosophical but also for practical reasons. Simple models are easier to put to test again in replication and cross-validation studies. Simple models are less costly to put into practice in predicting and controlling the outcome in the future. The philosophical reasons for preferring simple models should not be downplayed, however. Simpler models are easier to understand and appreciate, and therefore have a “beauty” that their more complicated counterparts often lack.
The entire winnowing process described above is encapsulated in the model-building techniques of stepwise and best-subset regression. The use of these model-building techniques begins with the specification of the design for a comprehensive “whole model.” Less comprehensive submodels are then tested to determine if they adequately account for the outcome under investigation. Finally, the simplest of the adequate is adopted as the “best.”
Model Building in GSR
Unlike the multiple regression model, which is used to analyze designs with continuous predictor variables, the general linear model can be used to analyze any ANOVA design with categorical predictor variables, any ANCOVA design with both categorical and continuous predictor variables, as well as any regression design with continuous predictor variables. Effects for categorical predictor variables can be coded in the design matrix X using either the overparameterized model or the sigma-restricted model.
Only the sigma-restricted parameterization can be used for model-building. True to its description as general, the general linear model can be used to analyze designs with effects for categorical predictor variables which are coded using either parameterization method. In many uses of the general linear model, it is arbitrary whether categorical predictors are coded using the sigma-restricted or the overparameterized coding. When one desires to build models, however, the use of the overparameterized model is unsatisfactory; lower-order effects for categorical predictor variables are redundant with higher-order containing interactions, and therefore cannot be fairly evaluated for inclusion in the model when higher-order containing interactions are already in the model.
This problem does not occur when categorical predictors are coded using the sigma-restricted parameterization, so only the sigma-restricted parameterization is necessary in general stepwise regression.
Designs which cannot be represented using the sigma-restricted parameterization. The sigma-restricted parameterization can be used to represent most, but not all types of designs. Specifically, the designs which cannot be represented using the sigma-restricted parameterization are designs with nested effects, such as nested ANOVA and separate slope, and random effects. Any other type of ANOVA, ANCOVA, or regression design can be represented using the sigma-restricted parameterization, and can therefore be analyzed with general stepwise regression.
Model building for designs with multiple dependent variables. Stepwise and best-subset model-building techniques are well-developed for regression designs with a single dependent variable (e.g., see Cooley and Lohnes, 1971; Darlington, 1990; Hocking Lindeman, Merenda, and Gold, 1980; Morrison, 1967; Neter, Wasserman, and Kutner, 1985; Pedhazur, 1973; Stevens, 1986; Younger, 1985). Using the sigma-restricted parameterization and general linear model methods, these model-building techniques can be readily applied to any ANOVA design with categorical predictor variables, any ANCOVA design with both categorical and continuous predictor variables, as well as any regression design with continuous predictor variables. Building models for designs with multiple dependent variables, however, involves considerations that are not typically addressed by the general linear model. Model-building techniques for designs with multiple dependent variables are available with Structural Equation Modeling.
Types of Analyses
A wide variety of types of designs can be represented using the sigma-restricted coding of the design matrix X, and any such design can be analyzed using the general linear model. The following topics describe these different types of designs and how they differ. Some general ways in which designs might differ can be suggested, but keep in mind that any particular design can be a “hybrid” in the sense that it could have combinations of features of a number of different types of designs.
Between-subject designs
- Overview
- Simple regression
- Multiple regression
- Factorial regression
- Polynomial regression
- Response surface regression
- Mixture surface regression
- One-way ANOVA
- Main effect ANOVA
- Factorial ANOVA
- Analysis of covariance (ANCOVA)
- Homogeneity of slopes
Overview. The levels or values of the predictor variables in an analysis describe the differences between the n subjects or the n valid cases that are analyzed. Thus, when we speak of the between subject design (or simply the between design) for an analysis, we are referring to the nature, number, and arrangement of the predictor variables.
Concerning the nature or type of predictor variables, between designs which contain only categorical predictor variables can be called ANOVA (analysis of variance) designs, between designs which contain only continuous predictor variables can be called regression designs, and between designs which contain both categorical and continuous predictor variables can be called ANCOVA (analysis of covariance) designs.
Between designs may involve only a single predictor variable and therefore be described as simple (e.g., simple regression) or may employ numerous predictor variables (e.g., multiple regression).
Concerning the arrangement of predictor variables, some between designs employ only “main effect” or first-order terms for predictors, that is, the values for different predictor variables are independent and raised only to the first power. Other between designs may employ higher-order terms for predictors by raising the values for the original predictor variables to a power greater than 1 (e.g., in polynomial regression designs), or by forming products of different predictor variables (i.e., interaction terms). A common arrangement for ANOVA designs is the full-factorial design, in which every combination of levels for each of the categorical predictor variables is represented in the design. Designs with some but not all combinations of levels for each of the categorical predictor variables are aptly called fractional factorial designs.
These basic distinctions about the nature, number, and arrangement of predictor variables can be used in describing a variety of different types of between designs. Some of the more common between designs can now be described.
Simple Regression. Simple regression designs involve a single continuous predictor variable. If there were 3 cases with values on a predictor variable P of, say, 7, 4, and 9, and the design is for the first-order effect of P, the X matrix would be
and using P for X_{1} the regression equation would be
Y = b_{0} + b_{1}P
If the simple regression design is for a higher-order effect of P, say the quadratic effect, the values in the X_{1 }column of the design matrix would be raised to the 2nd power, that is, squared
and using P^{2} for X_{1} the regression equation would be
Y = b_{0} + b_{1}P^{2}
In regression designs, values on the continuous predictor variables are raised to the desired power and used as the values for the X variables. No recoding is performed. It is therefore sufficient, in describing regression designs, to simply describe the regression equation without explicitly describing the design matrix X.
Multiple Regression. Multiple regression designs are to continuous predictor variables as main effect ANOVA designs are to categorical predictor variables, that is, multiple regression designs contain the separate simple regression designs for 2 or more continuous predictor variables. The regression equation for a multiple regression design for the first-order effects of 3 continuous predictor variables P, Q, and R would be
Y = b_{0} + b_{1}P + b_{2}Q + b_{3}R
A discussion of multiple regression methods is also provided in the Multiple Regression topic.
Factorial Regression. Factorial regression designs are similar to factorial ANOVA designs, in which combinations of the levels of the factors are represented in the design. In factorial regression designs, however, there may be many more such possible combinations of distinct levels for the continuous predictor variables than there are cases in the data set. To simplify matters, full-factorial regression designs are defined as designs in which all possible products of the continuous predictor variables are represented in the design. For example, the full-factorial regression design for two continuous predictor variables P and Q would include the main effects (i.e., the first-order effects) of P and Q and their 2-way P by Q interaction effect, which is represented by the product of P and Q scores for each case. The regression equation would be
Y = b_{0} + b_{1}P + b_{2}Q + b_{3}P*Q
Factorial regression designs can also be fractional, that is, higher-order effects can be omitted from the design. A fractional factorial design to degree 2 for 3 continuous predictor variables P, Q, and R would include the main effects and all 2-way interactions between the predictor variables
Y = b_{0} + b_{1}P + b_{2}Q + b_{3}R + b_{4}P*Q + b_{5}P*R + b_{6}Q*R
Polynomial Regression. Polynomial regression designs are designs which contain main effects and higher-order effects for the continuous predictor variables but do not include interaction effects between predictor variables. For example, the polynomial regression design to degree 2 for three continuous predictor variables P, Q, and R would include the main effects (i.e., the first-order effects) of P, Q, and R and their quadratic (i.e., second-order) effects, but not the 2-way interaction effects or the P by Q by R 3-way interaction effect.
Y = b_{0} + b_{1}P + b_{2}P^{2} + b_{3}Q + b_{4}Q^{2} + b_{5}R + b_{6}R^{2}
Polynomial regression designs do not have to contain all effects up to the same degree for every predictor variable. For example, main, quadratic, and cubic effects could be included in the design for some predictor variables, and effects up the fourth degree could be included in the design for other predictor variables.
Response Surface Regression. Quadratic response surface regression designs are a hybrid type of design with characteristics of both polynomial regression designs and fractional factorial regression designs. Quadratic response surface regression designs contain all the same effects of polynomial regression designs to degree 2 and additionally the 2-way interaction effects of the predictor variables. The regression equation for a quadratic response surface regression design for 3 continuous predictor variables P, Q, and R would be
Y = b_{0} + b_{1}P + b_{2}P^{2} + b_{3}Q + b_{4}Q^{2} + b_{5}R + b_{6}R^{2} + b_{7}P*Q + b_{8}P*R + b_{9}Q*R
These types of designs are commonly employed in applied research (e.g., in industrial experimentation), and a detailed discussion of these types of designs is also presented in the Experimental Design topic (see Central composite designs).
Mixture Surface Regression. Mixture surface regression designs are identical to factorial regression designs to degree 2 except for the omission of the intercept. Mixtures, as the name implies, add up to a constant value; the sum of the proportions of ingredients in different recipes for some material all must add up 100%. Thus, the proportion of one ingredient in a material is redundant with the remaining ingredients. Mixture surface regression designs deal with this redundancy by omitting the intercept from the design. The design matrix for a mixture surface regression design for 3 continuous predictor variables P, Q, and R would be
Y = b_{1}P + b_{2}P^{2} + b_{3}Q + b_{4}P*Q + b_{5}P*R + b_{6}Q*R
These types of designs are commonly employed in applied research (e.g., in industrial experimentation), and a detailed discussion of these types of designs is also presented in the Experimental Design topic (see Mixture designs and triangular surfaces).
One-Way ANOVA. A design with a single categorical predictor variable is called a one-way ANOVA design. For example, a study of 4 different fertilizers used on different individual plants could be analyzed via one-way ANOVA, with four levels for the factor Fertilizer.
Consider a single categorical predictor variable A with 1 case in each of its 3 categories. Using the sigma-restricted coding of A into 2 quantitative contrast variables, the matrix X defining the between design is
That is, cases in groups A_{1}, A_{2}, and A_{3} are all assigned values of 1 on X_{0} (the intercept), the case in group A_{1} is assigned a value of 1 on X_{1} and a value 0 on X_{2}, the case in group A_{2} is assigned a value of 0 on X_{1} and a value 1 on X_{2}, and the case in group A_{3} is assigned a value of -1 on X_{1} and a value -1 on X_{2}. Of course, any additional cases in any of the 3 groups would be coded similarly. If there were 1 case in group A_{1}, 2 cases in group A_{2}, and 1 case in group A_{3}, the X matrix would be
where the first subscript for A gives the replicate number for the cases in each group. For brevity, replicates usually are not shown when describing ANOVA design matrices.
Note that in one-way designs with an equal number of cases in each group, sigma-restricted coding yields X_{1} … X_{k} variables all of which have means of 0.
These simple examples show that the X matrix actually serves two purposes. It specifies (1) the coding for the levels of the original predictor variables on the X variables used in the analysis as well as (2) the nature, number, and arrangement of the X variables, that is, the between design.
Main Effect ANOVA. Main effect ANOVA designs contain separate one-way ANOVA designs for 2 or more categorical predictors. A good example of main effect ANOVA would be the typical analysis performed on screening designs as described in the context of the Experimental Design chapter.
Consider 2 categorical predictor variables A and B each with 2 categories. Using the sigma-restricted coding, the X matrix defining the between design is
Note that if there are equal numbers of cases in each group, the sum of the cross-products of values for the X_{1} and X_{2} columns is 0, for example, with 1 case in each group (1*1)+(1*-1)+(-1*1)+(-1*-1)=0.
Factorial ANOVA. Factorial ANOVA designs contain X variables representing combinations of the levels of 2 or more categorical predictors (e.g., a study of boys and girls in four age groups, resulting in a 2 (Gender) x 4 (Age Group) design). In particular, full-factorial designs represent all possible combinations of the levels of the categorical predictors. A full-factorial design with 2 categorical predictor variables A and B each with 2 levels would be called a 2 x 2 full-factorial design. Using the sigma-restricted coding, the X matrix for this design would be
Several features of this X matrix deserve comment. Note that the X_{1} and X_{2} columns represent main effect contrasts for one variable, (i.e., A and B, respectively) collapsing across the levels of the other variable. The X_{3} column instead represents a contrast between different combinations of the levels of A and B. Note also that the values for X_{3} are products of the corresponding values for X_{1} and X_{2}. Product variables such as X_{3 }represent the multiplicative or interaction effects of their factors, so X_{3} would be said to represent the 2-way interaction of A and B. The relationship of such product variables to the dependent variables indicate the interactive influences of the factors on responses above and beyond their independent (i.e., main effect) influences on responses. Thus, factorial designs provide more information about the relationships between categorical predictor variables and responses on the dependent variables than is provided by corresponding one-way or main effect designs.
When many factors are being investigated, however, full-factorial designs sometimes require more data than reasonably can be collected to represent all possible combinations of levels of the factors, and high-order interactions between many factors can become difficult to interpret. With many factors, a useful alternative to the full-factorial design is the fractional factorial design. As an example, consider a 2 x 2 x 2 fractional factorial design to degree 2 with 3 categorical predictor variables each with 2 levels. The design would include the main effects for each variable, and all 2-way interactions between the three variables, but would not include the 3-way interactions between all three variables. These types of designs are discussed in detail in the 2**(k-p) Fractional Factorial Designs section of the Experimental Design topic.
Analysis of Covariance. In general, between designs which contain both categorical and continuous predictor variables can be called ANCOVA designs. Traditionally, however, ANCOVA designs have referred more specifically to designs in which the first-order effects of one or more continuous predictor variables are taken into account when assessing the effects of one or more categorical predictor variables. A basic introduction to analysis of covariance can also be found in the Analysis of covariance (ANCOVA) topic of the ANOVA/MANOVA chapter.
To illustrate, suppose a researcher wants to assess the influences of a categorical predictor variable A with 3 levels on some outcome, and that measurements on a continuous predictor variable P, known to covary with the outcome, are available. If the data for the analysis are
then the sigma-restricted X matrix for the design that includes the separate first-order effects of P and A would be
The b_{2} and b_{3} coefficients in the regression equation
Y = b_{0} + b_{1}X_{1} + b_{2}X_{2} + b_{3}X_{3}
represent the influences of group membership on the A categorical predictor variable, controlling for the influence of scores on the P continuous predictor variable. Similarly, the b_{1} coefficient represents the influence of scores on P controlling for the influences of group membership on A. This traditional ANCOVA analysis gives a more sensitive test of the influence of A to the extent that P reduces the prediction error, that is, the residuals for the outcome variable.
Homogeneity of Slopes. The appropriate design for modeling the influences of continuous and categorical predictor variables depends on whether the continuous and categorical predictors interact in influencing the outcome. The traditional analysis of covariance (ANCOVA) design for continuous and categorical predictor variables is appropriate when the continuous and categorical predictors do not interact in influencing responses on the outcome. The homogeneity of slopes designs can be used to test whether the continuous and categorical predictors interact in influencing responses. For the same example data used to illustrate the traditional ANCOVA design, the sigma-restricted X matrix for the homogeneity of slopes design would be
Using this design matrix X, if the b_{4} and b_{5} coefficients in the regression equation
Y = b_{0} + b_{1}X_{1} + b_{2}X_{2} + b_{3}X_{3} + b_{4}X_{4} + b_{5}X_{5}
are zero, the simpler traditional ANCOVA design should be used.
Multivariate Designs
When there are multiple dependent variables in a design, the design is said to be multivariate. Multivariate measures of association are by nature more complex than their univariate counterparts (such as the correlation coefficient, for example). This is because multivariate measures of association must take into account not only the relationships of the predictor variables with responses on the dependent variables, but also the relationships among the multiple dependent variables. By doing so, however, these measures of association provide information about the strength of the relationships between predictor and dependent variables independent of the dependent variables interrelationships. A basic discussion of multivariate designs is also presented in the Multivariate Designs section in the ANOVA/MANOVA topic.
The most commonly used multivariate measures of association all can be expressed as functions of the eigenvalues of the product matrix
E^{-1}H
where E is the error SSCP matrix (i.e., the matrix of sums of squares and cross-products for the dependent variables that are not accounted for by the predictors in the between design), and H is a hypothesis SSCP matrix (i.e., the matrix of sums of squares and cross-products for the dependent variables that are accounted for by all the predictors in the between design, or the sums of squares and cross-products for the dependent variables that are accounted for by a particular effect). If
l_{i} = the ordered eigenvalues of E^{-1}H, if E^{-1} exists
then the 4 commonly used multivariate measures of association are
Wilks’ lambda = P[1/(1+l_{i})]
Pillai’s trace = Sl_{i}/(1+l_{i})
Hotelling-Lawley trace = Sl_{i}
Roy’s largest root = l_{1}
These 4 measures have different upper and lower bounds, with Wilks’ lambda perhaps being the most easily interpretable of the four measures. Wilks’ lambda can range from 0 to 1, with 1 indicating no relationship of predictors to responses and 0 indicating a perfect relationship of predictors to responses. 1 – Wilks’ lambda can be interpreted as the multivariate counterpart of a univariate R-squared, that is, it indicates the proportion of generalized variance in the dependent variables that is accounted for by the predictors.
The 4 measures of association are also used to construct multivariate tests of significance. These multivariate tests are covered in detail in a number of sources (e.g., Finn, 1974; Tatsuoka, 1971).
Building the Whole Model
The following sections discuss details for building and testing hypotheses about the “whole model”, for example, how sums of squares are partitioned and how the overall fit for the whole model is tested.
Partitioning Sums of Squares
A fundamental principle of least squares methods is that variation on a dependent variable can be partitioned, or divided into parts, according to the sources of the variation. Suppose that a dependent variable is regressed on one or more predictor variables, and that for convenience the dependent variable is scaled so that its mean is 0. Then a basic least squares identity is that the total sum of squared values on the dependent variable equals the sum of squared predicted values plus the sum of squared residual values. Stated more generally,
S(y – y-bar)^{2} = S(y-hat – y-bar)^{2} + S(y – y-hat)^{2}
where the term on the left is the total sum of squared deviations of the observed values on the dependent variable from the dependent variable mean, and the respective terms on the right are (1) the sum of squared deviations of the predicted values for the dependent variable from the dependent variable mean and (2) the sum of the squared deviations of the observed values on the dependent variable from the predicted values, that is, the sum of the squared residuals. Stated yet another way,
Total SS = Model SS + Error SS
Note that the Total SS is always the same for any particular data set, but that the Model SS and the Error SS depend on the regression equation. Assuming again that the dependent variable is scaled so that its mean is 0, the Model SS and the Error SS can be computed using
Model SS = b’X’Y
Error SS = Y’Y – b’X’Y
Testing the Whole Model
Given the Model SS and the Error SS, one can perform a test that all the regression coefficients for the X variables (b_{1} through b_{k}, excluding the b_{0} coefficient for the intercept) are zero. This test is equivalent to a comparison of the fit of the regression surface defined by the predicted values (computed from the whole model regression equation) to the fit of the regression surface defined solely by the dependent variable mean (computed from the reduced regression equation containing only the intercept). Assuming that X’X is full-rank, the whole model hypothesis mean square
MSH = (Model SS)/k
where k is the number of columns of X (excluding the intercept column), is an estimate of the variance of the predicted values. The error mean square
s^{2} = MSE = (Error SS)/(n-k-1)
where n is the number of observations, is an unbiased estimate of the residual or error variance. The test statistic is
F = MSH/MSE
where F has (k, n – k – 1) degrees of freedom.
If X’X is not full rank, r + 1 is substituted for k, where r is the rank or the number of non-redundant columns of X‘X.
If the whole model test is not significant the analysis is complete; the whole model is concluded to fit the data no better than the reduced model using the dependent variable mean alone. It is futile to seek a submodel which adequately fits the data when the whole model is inadequate.
Note that in the case of non-intercept models, some multiple regression programs will only compute the full model test based on the proportion of variance around 0 (zero) accounted for by the predictors; for more information (see Kvålseth, 1985; Okunade, Chang, and Evans, 1993). Other programs will actually compute both values (i.e., based on the residual variance around 0, and around the respective dependent variable means.
Limitations of Whole Models
For designs such as one-way ANOVA or simple regression designs, the whole model test by itself may be sufficient for testing general hypotheses about whether or not the single predictor variable is related to the outcome. In complex designs, however, finding a statistically significant test of whole model fit is often just the first step in the analysis; one then seeks to identify simpler submodels that fit the data equally well (see the section on Basic ideas: The need for simple models). It is to this task, the search for submodels that fit the data well, that stepwise and best-subset regression are devoted.
Building Models via Stepwise Regression
Stepwise model-building techniques for regression designs with a single dependent variable are described in numerous sources (e.g., see Darlington, 1990; Hocking, 1966, Lindeman, Merenda, and Gold, 1980; Morrison, 1967; Neter, Wasserman, and Kutner, 1985; Pedhazur, 1973; Stevens, 1986; Younger, 1985). The basic procedures involve (1) identifying an initial model, (2) iteratively “stepping,” that is, repeatedly altering the model at the previous step by adding or removing a predictor variable in accordance with the “stepping criteria,” and (3) terminating the search when stepping is no longer possible given the stepping criteria, or when a specified maximum number of steps has been reached. The following topics provide details on the use of stepwise model-building procedures.
The Initial Model in Stepwise Regression. The initial model is designated the model at Step 0. The initial model always includes the regression intercept (unless the No intercept option has been specified.). For the backward stepwise and backward removal methods, the initial model also includes all effects specified to be included in the design for the analysis. The initial model for these methods is therefore the whole model.
For the forward stepwise and forward entry methods, the initial model always includes the regression intercept (unless the No intercept option has been specified.). The initial model may also include 1 or more effects specified to be forced into the model. If j is the number of effects specified to be forced into the model, the first j effects specified to be included in the design are entered into the model at Step 0 . Any such effects are not eligible to be removed from the model during subsequent Steps. Effects may also be specified to be forced into the model when the backward stepwise and backward removal methods are used. As in the forward stepwise and forward entry methods, any such effects are not eligible to be removed from the model during subsequent Steps.
The Forward Entry Method. The forward entry method is a simple model-building procedure. At each Step after Step 0, the entry statistic is computed for each effect eligible for entry in the model. If no effect has a value on the entry statistic which exceeds the specified critical value for model entry, then stepping is terminated, otherwise the effect with the largest value on the entry statistic is entered into the model. Stepping is also terminated if the maximum number of steps is reached.
The Backward Removal Method. The backward removal method is also a simple model-building procedure. At each Step after Step 0, the removal statistic is computed for each effect eligible to be removed from the model. If no effect has a value on the removal statistic which is less than the critical value for removal from the model, then stepping is terminated, otherwise the effect with the smallest value on the removal statistic is removed from the model. Stepping is also terminated if the maximum number of steps is reached.
The Forward Stepwise Method. The forward stepwise method employs a combination of the procedures used in the forward entry and backward removal methods. At Step 1 the procedures for forward entry are performed. At any subsequent step where 2 or more effects have been selected for entry into the model, forward entry is performed if possible, and backward removal is performed if possible, until neither procedure can be performed and stepping is terminated. Stepping is also terminated if the maximum number of steps is reached.
The Backward Stepwise Method. The backward stepwise method employs a combination of the procedures used in the forward entry and backward removal methods. At Step 1 the procedures for backward removal are performed. At any subsequent step where 2 or more effects have been selected for entry into the model, forward entry is performed if possible, and backward removal is performed if possible, until neither procedure can be performed and stepping is terminated. Stepping is also terminated if the maximum number of steps is reached.
Entry and Removal Criteria. Either critical F values or critical p values can be specified to be used to control entry and removal of effects from the model. If p values are specified, the actual values used to control entry and removal of effects from the model are 1 minus the specified p values. The critical value for model entry must exceed the critical value for removal from the model. A maximum number of Steps can also be specified. If not previously terminated, stepping stops when the specified maximum number of Steps is reached.
Building Models via Best-Subset Regression
All-possible-subset regression can be used as an alternative to or in conjunction with stepwise methods for finding the “best” possible submodel.
Neter, Wasserman, and Kutner (1985) discuss the use of all-possible-subset regression in conjunction with stepwise regression “A limitation of the stepwise regression search approach is that it presumes there is a single “best” subset of X variables and seeks to identify it. As noted earlier, there is often no unique “best” subset. Hence, some statisticians suggest that all possible regression models with a similar number of X variables as in the stepwise regression solution be fitted subsequently to study whether some other subsets of X variables might be better.” (p. 435). This reasoning suggests that after finding a stepwise solution, the “best” of all the possible subsets of the same number of effects should be examined to determine if the stepwise solution is among the “best.” If not, the stepwise solution is suspect.
All-possible-subset regression can also be used as an alternative to stepwise regression. Using this approach, one first decides on the range of subset sizes that could be considered to be useful. For example, one might expect that inclusion of at least 3 effects in the model is necessary to adequately account for responses, and also might expect there is no advantage to considering models with more than 6 effects. Only the “best” of all possible subsets of 3, 4, 5, and 6 effects are then considered.
Note that several different criteria can be used for ordering subsets in terms of “goodness.” The most often used criteria are the subset multiple R-square, adjusted R-square, and Mallow’s Cp statistics. When all-possible-subset regression is used in conjunction with stepwise methods, the subset multiple R-square statistic allows direct comparisons of the “best” subsets identified using each approach.
The number of possible submodels increases very rapidly as the number of effects in the whole model increases, and as subset size approaches half of the number of effects in the whole model. The amount of computation required to perform all-possible-subset regression increases as the number of possible submodels increases, and holding all else constant, also increases very rapidly as the number of levels for effects involving categorical predictors increases, thus resulting in more columns in the design matrix X. For example, all possible subsets of up to a dozen or so effects could certainly theoretically be computed for a design that includes two dozen or so effects all of which have many levels, but the computation would be very time consuming (e.g., there are about 2.7 million different ways to select 12 predictors from 24 predictors, i.e., 2.7 million models to evaluate just for subset size 12). Simpler is generally better when using all-possible-subset regression.
Generalized Linear Models (GLZ) – Statistics
Generalized Linear Models (GLZ)
This topic describes the use of the generalized linear model for analyzing linear and non-linear effects of continuous and categorical predictor variables on a discrete or continuous dependent variable. If you are unfamiliar with the basic methods of regression in linear models, it may be useful to first review the basic information on these topics in the Elementary Concepts topic. Discussion of the ways in which the linear regression model is extended by the general linear model can be found in the General Linear Models topic.
For additional information about generalized linear models, see also Dobson (1990), Green and Silverman (1994), or McCullagh and Nelder (1989).
Basic Ideas
The Generalized Linear Model (GLZ) is a generalization of the general linear model (see, e.g., the General Linear Models, Multiple Regression, and ANOVA/MANOVA topics). In its simplest form, a linear model specifies the (linear) relationship between a dependent (or response) variable Y, and a set of predictor variables, the X‘s, so that
Y = b_{0} + b_{1}X_{1} + b_{2}X_{2} + … + b_{k}X_{k}
In this equation b_{0} is the regression coefficient for the intercept and the b_{i} values are the regression coefficients (for variables 1 through k) computed from the data.
So for example, we could estimate (i.e., predict) a person’s weight as a function of the person’s height and gender. You could use linear regression to estimate the respective regression coefficients from a sample of data, measuring height, weight, and observing the subjects’ gender. For many data analysis problems, estimates of the linear relationships between variables are adequate to describe the observed data, and to make reasonable predictions for new observations (see the Multiple Regression topic for additional details).
However, there are many relationships that cannot adequately be summarized by a simple linear equation, for two major reasons:
Distribution of dependent variable. First, the dependent variable of interest may have a non-continuous distribution, and thus, the predicted values should also follow the respective distribution; any other predicted values are not logically possible. For example, a researcher may be interested in predicting one of three possible discrete outcomes (e.g., a consumer’s choice of one of three alternative products). In that case, the dependent variable can only take on 3 distinct values, and the distribution of the dependent variable is said to be multinomial. Or suppose you are trying to predict people’s family planning choices, specifically, how many children families will have, as a function of income and various other socioeconomic indicators. The dependent variable – number of children – is discrete (i.e., a family may have 1, 2, or 3 children and so on, but cannot have 2.4 children), and most likely the distribution of that variable is highly skewed (i.e., most families have 1, 2, or 3 children, fewer will have 4 or 5, very few will have 6 or 7, and so on). In this case it would be reasonable to assume that the dependent variable follows a Poisson distribution.
Link function. A second reason why the linear (multiple regression) model might be inadequate to describe a particular relationship is that the effect of the predictors on the dependent variable may not be linear in nature. For example, the relationship between a person’s age and various indicators of health is most likely not linear in nature: During early adulthood, the (average) health status of people who are 30 years old as compared to the (average) health status of people who are 40 years old is not markedly different. However, the difference in health status of 60 year old people and 70 year old people is probably greater. Thus, the relationship between age and health status is likely non-linear in nature. Probably some kind of a power function would be adequate to describe the relationship between a person’s age and health, so that each increment in years of age at older ages will have greater impact on health status, as compared to each increment in years of age during early adulthood. Put in other words, the link between age and health status is best described as non-linear, or as a power relationship in this particular example.
The generalized linear model can be used to predict responses both for dependent variables with discrete distributions and for dependent variables which are nonlinearly related to the predictors.
Computational Approach
To summarize the basic ideas, the generalized linear model differs from the general linear model (of which, for example, multiple regression is a special case) in two major respects: First, the distribution of the dependent or response variable can be (explicitly) non-normal, and does not have to be continuous, i.e., it can be binomial, multinomial, or ordinal multinomial (i.e., contain information on ranks only); second, the dependent variable values are predicted from a linear combination of predictor variables, which are “connected” to the dependent variable via a link function. The general linear model for a single dependent variable can be considered a special case of the generalized linear model: In the general linear model the dependent variable values are expected to follow the normal distribution, and the link function is a simple identity function (i.e., the linear combination of values for the predictor variables is not transformed).
To illustrate, in the general linear model a response variable Y is linearly associated with values on the X variables by
Y = b_{0} + b_{1}X_{1} + b_{2}X_{2} + … + b_{k}X_{k} + e
(where e stands for the error variability that cannot be accounted for by the predictors; note that the expected value of e is assumed to be 0), while the relationship in the generalized linear model is assumed to be
Y = g (b_{0} + b_{1}X_{1} + b_{2}X_{2} + … + b_{k}X_{k} )+ e
where e is the error, and g(…) is a function. Formally, the inverse function of g(…), say f(…), is called the link function; so that:
f(mu_{y}) = b_{0} + b_{1}X_{1} + b_{2}X_{2} + … + b_{k}X_{k}
where mu_{y} stands for the expected value of y.
Link functions and distributions. Various link functions (see McCullagh and Nelder, 1989) can be chosen, depending on the assumed distribution of the y variable values:
Normal, Gamma, Inverse normal, and Poisson distributions:
Identity link: | f(z) = z | |
Log link: | f(z) = log(z) | |
Power link: | f(z) = z^{a}, | for a given a |
Binomial, and Ordinal Multinomialdistributions:
Logit link: | f(z)=log(z/(1-z)) | |
Probit link: | f(z)=invnorm(z) | where invnorm is the inverse of the standard normal cumulative distribution function. |
Complementary log-log link: | f(z)=log(-log(1-z)) | |
Log-log link: | f(z)=-log(-log(z)) | |
Multinomial distribution:
Generalized logit link: | f(z1|z2,…,zc)=log(x1/(1-z1-…-zc)) |
where the model has c+1 categories. |
Estimation in the generalized linear model. The values of the parameters (b_{0} through b_{k} and the scale parameter) in the generalized linear model are obtained by maximum likelihood (ML) estimation, which requires iterative computational procedures. There are many iterative methods for ML estimation in the generalized linear model, of which the Newton-Raphson and Fisher-Scoring methods are among the most efficient and widely used (see Dobson,1990). The Fisher-scoring (or iterative re-weighted least squares) method in particular provides a unified algorithm for all generalized linear models, as well as providing the expected variance-covariance matrix of parameter estimates as a byproduct of its computations.
Statistical significance testing. Tests for the significance of the effects in the model can be performed via the Wald statistic, the likelihood ratio (LR), or score statistic. Detailed descriptions of these tests can be found in McCullagh and Nelder (1989). The Wald statistic (e.g., see Dobson,1990), which is computed as the generalized inner product of the parameter estimates with the respective variance-covariance matrix, is an easily computed, efficient statistic for testing the significance of effects. The score statistic is obtained from the generalized inner product of the score vector with the Hessian matrix (the matrix of the second-order partial derivatives of the maximum likelihood parameter estimates). The likelihood ratio (LR) test requires the greatest computational effort (another iterative estimation procedure) and is thus not as fast as the first two methods; however, the LR test provides the most asymptotically efficient test known. For details concerning these different test statistics, see Agresti(1996), McCullagh and Nelder(1989), and Dobson(1990).
Diagnostics in the generalized linear model. The two basic types of residuals are the so-called Pearson residuals and deviance residuals. Pearson residuals are based on the difference between observed responses and the predicted values; deviance residuals are based on the contribution of the observed responses to the log-likelihood statistic. In addition, leverage scores, studentized residuals, generalized Cook’s D, and other observational statistics (statistics based on individual observations) can be computed. For a description and discussion of these statistics, see Hosmer and Lemeshow (1989).
Types of Analyses
The design for an analysis can include effects for continuous as well as categorical predictor variables. Designs may include polynomials for continuous predictors (e.g., squared or cubic terms) as well as interaction effects (i.e., product terms) for continuous predictors. For categorical predictor variables, we can fit ANOVA-like designs, including full factorial, nested, and fractional factorial designs, etc. Designs can be incomplete (i.e., involve missing cells), and effects for categorical predictor variables can be represented using either the sigma-restricted parameterization or the overparameterized (i.e., indicator variable) representation of effects.
The topics below give complete descriptions of the types of designs that can be analyzed using the generalized linear model, as well as types of designs that can be analyzed using the general linear model.
Signal detection theory. The list of designs shown below is by no means comprehensive, i.e., it does not describe all possible research problems to which the generalized linear model can be applied. For example, an important application of the generalized linear model is the estimation of parameters for Signal detection theory models. SDT is an application of statistical decision theory used to detect a signal embedded in noise. SDT is used in psychophysical studies of detection, recognition, and discrimination, and in other areas such as medical research, weather forecasting, survey research, and marketing research. For example, DeCarlo (1998) shows how signal detection models based on different underlying distributions can easily be considered by using the generalized linear model with different link functions.
For discussion of the generalized linear model and the link functions it uses, see Computational Approaches.
Between-Subject Designs
- Overview
- One-way ANOVA
- Main effect ANOVA
- Factorial ANOVA
- Nested designs
- Simple regression
- Multiple regression
- Factorial regression
- Polynomial regression
- Response surface regression
- Mixture surface regression
- Analysis of covariance (ANCOVA)
- Separate slopes designs
- Homogeneity of slopes
Overview. The levels or values of the predictor variables in an analysis describe the differences between the n subjects or the n valid cases that are analyzed. Thus, when we speak of the between subject design (or simply the between design) for an analysis, we are referring to the nature, number, and arrangement of the predictor variables.
Concerning the nature or type of predictor variables, between designs which contain only categorical predictor variables can be called ANOVA (analysis of variance) designs, between designs which contain only continuous predictor variables can be called regression designs, and between designs which contain both categorical and continuous predictor variables can be called ANCOVA (analysis of covariance) designs. Further, continuous predictors are always considered to have fixed values, but the levels of categorical predictors can be considered to be fixed or to vary randomly. Designs which contain random categorical factors are called mixed-model designs (see the Variance Components and Mixed Model ANOVA/ANCOVA topic).
Between designs may involve only a single predictor variable and therefore be described as simple (e.g., simple regression) or may employ numerous predictor variables (e.g., multiple regression).
Concerning the arrangement of predictor variables, some between designs employ only “main effect” or first-order terms for predictors, that is, the values for different predictor variables are independent and raised only to the first power. Other between designs may employ higher-order terms for predictors by raising the values for the original predictor variables to a power greater than 1 (e.g., in polynomial regression designs), or by forming products of different predictor variables (i.e., interaction terms). A common arrangement for ANOVA designs is the full-factorial design, in which every combination of levels for each of the categorical predictor variables is represented in the design. Designs with some but not all combinations of levels for each of the categorical predictor variables are aptly called fractional factorial designs. Designs with a hierarchy of combinations of levels for the different categorical predictor variables are called nested designs.
These basic distinctions about the nature, number, and arrangement of predictor variables can be used in describing a variety of different types of between designs. Some of the more common between designs can now be described.
One-Way ANOVA. A design with a single categorical predictor variable is called a one-way ANOVA design. For example, a study of 4 different fertilizers used on different individual plants could be analyzed via one-way ANOVA, with four levels for the factor Fertilizer.
In genera, consider a single categorical predictor variable A with 1 case in each of its 3 categories. Using the sigma-restricted coding of A into 2 quantitative contrast variables, the matrix X defining the between design is
That is, cases in groups A_{1}, A_{2}, and A_{3} are all assigned values of 1 on X_{0} (the intercept), the case in group A_{1} is assigned a value of 1 on X_{1} and a value 0 on X_{2}, the case in group A_{2} is assigned a value of 0 on X_{1} and a value 1 on X_{2}, and the case in group A_{3} is assigned a value of -1 on X_{1} and a value -1 on X_{2}. Of course, any additional cases in any of the 3 groups would be coded similarly. If there were 1 case in group A_{1}, 2 cases in group A_{2}, and 1 case in group A_{3}, the X matrix would be
where the first subscript for A gives the replicate number for the cases in each group. For brevity, replicates usually are not shown when describing ANOVA design matrices.
Note that in one-way designs with an equal number of cases in each group, sigma-restricted coding yields X_{1} … X_{k} variables all of which have means of 0.
Using the overparameterized model to represent A, the X matrix defining the between design is simply
These simple examples show that the X matrix actually serves two purposes. It specifies (1) the coding for the levels of the original predictor variables on the X variables used in the analysis as well as (2) the nature, number, and arrangement of the X variables, that is, the between design.
Main Effect ANOVA. Main effect ANOVA designs contain separate one-way ANOVA designs for 2 or more categorical predictors. A good example of main effect ANOVA would be the typical analysis performed on screening designs as described in the Experimental Design topic.
Consider 2 categorical predictor variables A and B each with 2 categories. Using the sigma-restricted coding, the X matrix defining the between design is
Note that if there are equal numbers of cases in each group, the sum of the cross-products of values for the X_{1} and X_{2} columns is 0, for example, with 1 case in each group (1*1)+(1*-1)+(-1*1)+(-1*-1)=0. Using the overparameterized model, the matrix X defining the between design is
Comparing the two types of coding, it can be seen that the overparameterized coding takes almost twice as many values as the sigma-restricted coding to convey the same information.
Factorial ANOVA. Factorial ANOVA designs contain X variables representing combinations of the levels of 2 or more categorical predictors (e.g., a study of boys and girls in four age groups, resulting in a 2 (Gender) x 4 (Age Group) design). In particular, full-factorial designs represent all possible combinations of the levels of the categorical predictors. A full-factorial design with 2 categorical predictor variables A and B each with 2 levels each would be called a 2 x 2 full-factorial design. Using the sigma-restricted coding, the X matrix for this design would be
Several features of this X matrix deserve comment. Note that the X_{1} and X_{2} columns represent main effect contrasts for one variable, (i.e., A and B, respectively) collapsing across the levels of the other variable. The X_{3 }column instead represents a contrast between different combinations of the levels of A and B. Note also that the values for X_{3} are products of the corresponding values for X_{1} and X_{2}. Product variables such as X_{3 }represent the multiplicative or interaction effects of their factors, so X_{3} would be said to represent the 2-way interaction of A and B. The relationship of such product variables to the dependent variables indicate the interactive influences of the factors on responses above and beyond their independent (i.e., main effect) influences on responses. Thus, factorial designs provide more information about the relationships between categorical predictor variables and responses on the dependent variables than is provided by corresponding one-way or main effect designs.
When many factors are being investigated, however, full-factorial designs sometimes require more data than reasonably can be collected to represent all possible combinations of levels of the factors, and high-order interactions between many factors can become difficult to interpret. With many factors, a useful alternative to the full-factorial design is the fractional factorial design. As an example, consider a 2 x 2 x 2 fractional factorial design to degree 2 with 3 categorical predictor variables each with 2 levels. The design would include the main effects for each variable, and all 2-way interactions between the three variables, but would not include the 3-way interaction between all three variables. Using the overparameterized model, the X matrix for this design is
The 2-way interactions are the highest degree effects included in the design. These types of designs are discussed in detail the 2**(k-p) Fractional Factorial Designs section of the Experimental Design topic.
Nested ANOVA Designs. Nested designs are similar to fractional factorial designs in that all possible combinations of the levels of the categorical predictor variables are not represented in the design. In nested designs, however, the omitted effects are lower-order effects. Nested effects are effects in which the nested variables never appear as main effects. Suppose that for 2 variables A and B with 3 and 2 levels, respectively, the design includes the main effect for A and the effect of B nested within the levels of A. The X matrix for this design using the overparameterized model is
Note that if the sigma-restricted coding were used, there would be only 2 columns in the X matrix for the B nested within A effect instead of the 6 columns in the X matrix for this effect when the overparameterized model coding is used (i.e., columns X_{4} through X_{9}). The sigma-restricted coding method is overly-restrictive for nested designs, so only the overparameterized model is used to represent nested designs.
Simple Regression. Simple regression designs involve a single continuous predictor variable. If there were 3 cases with values on a predictor variable P of, say, 7, 4, and 9, and the design is for the first-order effect of P, the X matrix would be
and using P for X_{1} the regression equation would be
Y = b_{0} + b_{1}P
If the simple regression design is for a higher-order effect of P, say the quadratic effect, the values in the X_{1} column of the design matrix would be raised to the 2nd power, that is, squared
and using P^{2} for X_{1} the regression equation would be
Y = b_{0} + b_{1}P^{2}
The sigma-restricted and overparameterized coding methods do not apply to simple regression designs and any other design containing only continuous predictors (since there are no categorical predictors to code). Regardless of which coding method is chosen, values on the continuous predictor variables are raised to the desired power and used as the values for the X variables. No recoding is performed. It is therefore sufficient, in describing regression designs, to simply describe the regression equation without explicitly describing the design matrix X.
Multiple Regression. Multiple regression designs are to continuous predictor variables as main effect ANOVA designs are to categorical predictor variables, that is, multiple regression designs contain the separate simple regression designs for 2 or more continuous predictor variables. The regression equation for a multiple regression design for the first-order effects of 3 continuous predictor variables P, Q, and R would be
Y = b_{0} + b_{1}P + b_{2}Q + b_{3}R
Factorial Regression. Factorial regression designs are similar to factorial ANOVA designs, in which combinations of the levels of the factors are represented in the design. In factorial regression designs, however, there may be many more such possible combinations of distinct levels for the continuous predictor variables than there are cases in the data set. To simplify matters, full-factorial regression designs are defined as designs in which all possible products of the continuous predictor variables are represented in the design. For example, the full-factorial regression design for two continuous predictor variables P and Q would include the main effects (i.e., the first-order effects) of P and Q and their 2-way P by Q interaction effect, which is represented by the product of P and Q scores for each case. The regression equation would be
Y = b_{0} + b_{1}P + b_{2}Q + b_{3}P*Q
Factorial regression designs can also be fractional, that is, higher-order effects can be omitted from the design. A fractional factorial design to degree 2 for 3 continuous predictor variables P, Q, and R would include the main effects and all 2-way interactions between the predictor variables
Y = b_{0} + b_{1}P + b_{2}Q + b_{3}R + b_{4}P*Q + b_{5}P*R + b_{6}Q*R
Polynomial Regression. Polynomial regression designs are designs which contain main effects and higher-order effects for the continuous predictor variables but do not include interaction effects between predictor variables. For example, the polynomial regression design to degree 2 for three continuous predictor variables P, Q, and R would include the main effects (i.e., the first-order effects) of P, Q, and R and their quadratic (i.e., second-order) effects, but not the 2-way interaction effects or the P by Q by R 3-way interaction effect.
Y = b_{0} + b_{1}P + b_{2}P^{2} + b_{3}Q + b_{4}Q^{2} + b_{5}R + b_{6}R^{2}
Polynomial regression designs do not have to contain all effects up to the same degree for every predictor variable. For example, main, quadratic, and cubic effects could be included in the design for some predictor variables, and effects up the fourth degree could be included in the design for other predictor variables.
Response Surface Regression. Quadratic response surface regression designs are a hybrid type of design with characteristics of both polynomial regression designs and fractional factorial regression designs. Quadratic response surface regression designs contain all the same effects of polynomial regression designs to degree 2 and additionally the 2-way interaction effects of the predictor variables. The regression equation for a quadratic response surface regression design for 3 continuous predictor variables P, Q, and R would be
Y = b_{0} + b_{1}P + b_{2}P^{2} + b_{3}Q + b_{4}Q^{2} + b_{5}R + b_{6}R^{2} + b_{7}P*Q + b_{8}P*R + b_{9}Q*R
These types of designs are commonly employed in applied research (e.g., in industrial experimentation), and a detailed discussion of these types of designs is also presented in the Experimental Design topic (see Central composite designs).
Mixture Surface Regression. Mixture surface regression designs are identical to factorial regression designs to degree 2 except for the omission of the intercept. Mixtures, as the name implies, add up to a constant value; the sum of the proportions of ingredients in different recipes for some material all must add up 100%. Thus, the proportion of one ingredient in a material is redundant with the remaining ingredients. Mixture surface regression designs deal with this redundancy by omitting the intercept from the design. The design matrix for a mixture surface regression design for 3 continuous predictor variables P, Q, and R would be
Y = b_{1}P + b_{2}Q + b_{3}R + b_{4}P*Q + b_{5}P*R + b_{6}Q*R
These types of designs are commonly employed in applied research (e.g., in industrial experimentation), and a detailed discussion of these types of designs is also presented in the Experimental Design topic (see Mixture designs and triangular surfaces).
Analysis of Covariance. In general, between designs which contain both categorical and continuous predictor variables can be called ANCOVA designs. Traditionally, however, ANCOVA designs have referred more specifically to designs in which the first-order effects of one or more continuous predictor variables are taken into account when assessing the effects of one or more categorical predictor variables. A basic introduction to analysis of covariance can also be found in the Analysis of covariance (ANCOVA) section of the ANOVA/MANOVA topic.
To illustrate, suppose a researcher wants to assess the influences of a categorical predictor variable A with 3 levels on some outcome, and that measurements on a continuous predictor variable P, known to covary with the outcome, are available. If the data for the analysis are
then the sigma-restricted X matrix for the design that includes the separate first-order effects of P and A would be
The b_{2} and b_{3} coefficients in the regression equation
Y = b_{0} + b_{1}X_{1} + b_{2}X_{2} + b_{3}X_{3}
represent the influences of group membership on the A categorical predictor variable, controlling for the influence of scores on the P continuous predictor variable. Similarly, the b_{1} coefficient represents the influence of scores on P controlling for the influences of group membership on A. This traditional ANCOVA analysis gives a more sensitive test of the influence of A to the extent that P reduces the prediction error, that is, the residuals for the outcome variable.
The X matrix for the same design using the overparameterized model would be
The interpretation is unchanged except that the influences of group membership on the A categorical predictor variables are represented by the b_{2}, b_{3} and b_{4} coefficients in the regression equation
Y = b_{0} + b_{1}X_{1} + b_{2}X_{2} + b_{3}X_{3} + b_{4}X_{4}
Separate Slope Designs. The traditional analysis of covariance (ANCOVA) design for categorical and continuous predictor variables is inappropriate when the categorical and continuous predictors interact in influencing responses on the outcome. The appropriate design for modeling the influences of the predictors in this situation is called the separate slope design. For the same example data used to illustrate traditional ANCOVA, the overparameterized X matrix for the design that includes the main effect of the three-level categorical predictor A and the 2-way interaction of P by A would be
The b_{4}, b_{5}, and b_{6 coefficients in the regression equation }
Y = b_{0} + b_{1}X_{1} + b_{2}X_{2} + b_{3}X_{3} + b_{4}X_{4} + b_{5}X_{5} + b_{6}X_{6}
give the separate slopes for the regression of the outcome on P within each group on A, controlling for the main effect of A.
As with nested ANOVA designs, the sigma-restricted coding of effects for separate slope designs is overly restrictive, so only the overparameterized model is used to represent separate slope designs. In fact, separate slope designs are identical in form to nested ANOVA designs, since the main effects for continuous predictors are omitted in separate slope designs.
Homogeneity of Slopes. The appropriate design for modeling the influences of continuous and categorical predictor variables depends on whether the continuous and categorical predictors interact in influencing the outcome. The traditional analysis of covariance (ANCOVA) design for continuous and categorical predictor variables is appropriate when the continuous and categorical predictors do not interact in influencing responses on the outcome, and the separate slope design is appropriate when the continuous and categorical predictors do interact in influencing responses. The homogeneity of slopes designs can be used to test whether the continuous and categorical predictors interact in influencing responses, and thus, whether the traditional ANCOVA design or the separate slope design is appropriate for modeling the effects of the predictors. For the same example data used to illustrate the traditional ANCOVA and separate slope designs, the overparameterized X matrix for the design that includes the main effect of P, the main effect of the three-level categorical predictor A, and the 2-way interaction of P by A would be
If the b_{5}, b_{6}, or b_{7} coefficient in the regression equation
Y = b_{0} + b_{1}X_{1} + b_{2}X_{2} + b_{3}X_{3} + b_{4}X_{4} + b_{5}X_{5} + b_{6}X_{6} + b_{7}X_{7}
is non-zero, the separate slope model should be used. If instead all 3 of these regression coefficients are zero the traditional ANCOVA design should be used.
The sigma-restricted X matrix for the homogeneity of slopes design would be
Using this X matrix, if the b_{4}, or b_{5} coefficient in the regression equation
Y = b_{0} + b_{1}X_{1} + b_{2}X_{2} + b_{3}X_{3} + b_{4}X_{4} + b_{5}X_{5}
is non-zero, the separate slope model should be used. If instead both of these regression coefficients are zero the traditional ANCOVA design should be used.
Model Building
In addition to fitting the whole model for the specified type of analysis, different methods for automatic model building can be employed in analyses using the generalized linear model. Specifically, forward entry, backward removal, forward stepwise, and backward stepwise procedures can be performed, as well as best-subset search procedures. In forward methods of selection of effects to include in the model (i.e., forward entry and forward stepwise methods), score statistics are compared to select new (significant) effects. The Wald statistic can be used for backward removal methods (i.e., backward removal and backward stepwise, when effects are selected for removal from the model).
The best subsets search method can be based on three different test statistics: the score statistic, the model likelihood, and the AIC (Akaike Information Criterion, see Akaike, 1973). Note that, since the score statistic does not require iterative computations, best subset selection based on the score statistic is computationally fastest, while selection based on the other two statistics usually provides more accurate results; see McCullagh and Nelder(1989), for additional details.
Interpretation of Results and Diagnostics
Simple estimation and test statistics may not be sufficient for adequate interpretation of the effects in an analysis. Especially for higher order (e.g., interaction) effects, inspection of the observed and predicted means can be invaluable for understanding the nature of an effect. Plots of these means (with error bars) can be useful for quickly grasping the role of the effects in the model.
Inspection of the distributions of variables is critically important when using the generalized linear model. Histograms and probability plots for variables, and scatterplots showing the relationships between observed values, predicted values, and residuals (e.g., Pearson residuals, deviance residuals, studentized residuals, differential Chi-square statistics, differential deviance statistics, and generalized Cook’s D) provide invaluable model-checking tools.
Generalized Additive Models (GAM)
The methods available in Generalized Additive Models are implementations of techniques developed and popularized by Hastie and Tibshirani (1990). A detailed description of these and related techniques, the algorithms used to fit these models, and discussions of recent research in this area of statistical modeling can also be found in Schimek (2000).
Additive Models
The methods described in this section represent a generalization of multiple regression (which is a special case of general linear models). Specifically, in linear regression, a linear least-squares fit is computed for a set of predictor or X variables, to predict a dependent Y variable. The well known linear regression equation with m predictors, to predict a dependent variable Y, can be stated as:
Y = b0 + b1*X1 + … + bm*Xm
Where Y stands for the (predicted values of the) dependent variable, X1through Xm represent the m values for the predictor variables, and b0, and b1 through bm are the regression coefficients estimated by multiple regression. A generalization of the multiple regression model would be to maintain the additive nature of the model, but to replace the simple terms of the linear equation bi*Xi with fi(Xi) where fi is a non-parametric function of the predictor Xi. In other words, instead of a single coefficient for each variable (additive term) in the model, in additive models an unspecified (non-parametric) function is estimated for each predictor, to achieve the best prediction of the dependent variable values.
Generalized Linear Models
To summarize the basic idea, the generalized linear model differs from the general linear model (of which multiple regression is a special case) in two major respects: First, the distribution of the dependent or response variable can be (explicitly) non-normal, and does not have to be continuous, e.g., it can be binomial; second, the dependent variable values are predicted from a linear combination of predictor variables, which are “connected” to the dependent variable via a link function. The general linear model for a single dependent variable can be considered a special case of the generalized linear model: In the general linear model the dependent variable values are expected to follow the normal distribution, and the link function is a simple identity function (i.e., the linear combination of values for the predictor variables is not transformed).
To illustrate, in the general linear model a response variable Y is linearly associated with values on the X variables while the relationship in the generalized linear model is assumed to be
Y = g(b0 + b1*X1 + … + bm*Xm)
where g(…) is a function. Formally, the inverse function of g(…), say gi(…), is called the link function; so that:
gi(muY) = b0 + b1*X1 + … + bm*Xm
where mu-Y stands for the expected value of Y.
Distributions and Link Functions
Generalized Additive Models allows you to choose from a wide variety of distributions for the dependent variable, and link functions for the effects of the predictor variables on the dependent variable (see McCullagh and Nelder, 1989; Hastie and Tibshirani, 1990; see also GLZ Introductory Overview – Computational Approach for a discussion of link functions and distributions):
Normal, Gamma, and Poisson distributions:
Log link: f(z) = log(z)
Inverse link: f(z) = 1/z
Identity link: f(z) = z
Binomial distributions:
Logit link: f(z)=log(z/(1-z))
Generalized Additive Models
We can combine the notion of additive models with generalized linear models, to derive the notion of generalized additive models, as:
gi(muY) = Si(fi(Xi))
In other words, the purpose of generalized additive models is to maximize the quality of prediction of a dependent variable Y from various distributions, by estimating unspecific (non-parametric) functions of the predictor variables which are “connected” to the dependent variable via a link function.
Estimating the Nonparametric Function of Predictors via Scatterplot Smoothers
A unique aspect of generalized additive models are the non-parametric functions fi of the predictor variables Xi. Specifically, instead of some kind of simple or complex parametric functions, Hastie and Tibshirani (1990) discuss various general scatterplot smoothers that can be applied to the X variable values, with the target criterion to maximize the quality of prediction of the (transformed) Y variable values. One such scatterplot smoother is the cubic smoothing splines smoother, which generally produces a smooth generalization of the relationship between the two variables in the scatterplot. Computational details regarding this smoother can be found in Hastie and Tibshirani (1990; see also Schimek, 2000).
To summarize, instead of estimating single parameters (like the regression weights in multiple regression), in generalized additive models, we find a general unspecific (non-parametric) function that relates the predicted (transformed) Y values to the predictor values.
A Specific Example: The Generalized Additive Logistic Model
Let us consider a specific example of the generalized additive models: A generalization of the logistic (logit) model for binary dependent variable values. As also described in detail in the context of Nonlinear Estimation and Generalized Linear/Nonlinear Models, the logistic regression model for binary responses can be written as follows:
y=exp(b0+b1*x1+…+bm*xm)/{1+exp(b0+b1*x1+…+bm*xm)}
Note that the distribution of the dependent variable is assumed to be binomial, i.e., the response variable can only assume the values 0 or 1 (e.g., in a market research study, the purchasing decision would be binomial: The customer either did or did not make a particular purchase). We can apply the logistic link function to the probability p (ranging between 0 and 1) so that:
p’ = log {p/(1-p)}
By applying the logistic link function, we can now rewrite the model as:
p’ = b0 + b1*X1 + … + bm*Xm
Finally, we substitute the simple single-parameter additive terms to derive the generalized additive logistic model:
p’ = b0 + f1(X1) + … + fm(Xm)
An example application of the this model can be found in Hastie and Tibshirani (1990).
Fitting Generalized Additive Models
Detailed descriptions of how generalized additive models are fit to data can be found in Hastie and Tibshirani (1990), as well as Schimek (2000, p. 300). In general there are two separate iterative operations involved in the algorithm, which are usually labeled the outer and inner loop. The purpose of the outer loop is to maximize the overall fit of the model, by minimizing the overall likelihood of the data given the model (similar to the maximum likelihood estimation procedures as described in, for example, the context of Nonlinear Estimation). The purpose of the inner loop is to refine the scatterplot smoother, which is the cubic splines smoother. The smoothing is performed with respect to the partial residuals; i.e., for every predictor k, the weighted cubic spline fit is found that best represents the relationship between variable k and the (partial) residuals computed by removing the effect of all other j predictors (j ¹ k). The iterative estimation procedure will terminate, when the likelihood of the data given the model can not be improved.
Interpreting the Results
Many of the standard results statistics computed by Generalized Additive Models are similar to those customarily reported by linear or nonlinear model fitting procedures. For example, predicted and residual values for the final model can be computed, and various graphs of the residuals can be displayed to help the user identify possible outliers, etc. Refer also to the description of the residual statistics computed by Generalized Linear/Nonlinear Models for details.
The main result of interest, of course, is how the predictors are related to the dependent variable. Scatterplots can be computed showing the smoothed predictor variable values plotted against the partial residuals, i.e., the residuals after removing the effect of all other predictor variables.
This plot allows you to evaluate the nature of the relationship between the predictor with the residualized (adjusted) dependent variable values (see Hastie & Tibshirani, 1990; in particular formula 6.3), and hence the nature of the influence of the respective predictor in the overall model.
Degrees of Freedom
To reiterate, the generalized additive models approach replaces the simple products of (estimated) parameter values times the predictor values with a cubic spline smoother for each predictor. When estimating a single parameter value, we lose one degree of freedom, i.e., we add one degree of freedom to the overall model. It is not clear how many degrees of freedom are lost due to estimating the cubic spline smoother for each variable. Intuitively, a smoother can either be very smooth, not following the pattern of data in the scatterplot very closely, or it can be less smooth, following the pattern of the data more closely. In the most extreme case, a simple line would be very smooth, and require us to estimate a single slope parameter, i.e., we would use one degree of freedom to fit the smoother (simple straight line); on the other hand, we could force a very “non-smooth” line to connect each actual data point, in which case we could “use-up” approximately as many degrees of freedom as there are points in the plot. Generalized Additive Models allows you to specify the degrees of freedom for the cubic spline smoother; the fewer degrees of freedom you specify, the smoother is the cubic spline fit to the partial residuals, and typically, the worse is the overall fit of the model. The issue of degrees of freedom for smoothers is discussed in detail in Hastie and Tibshirani (1990).
A word of Caution
Generalized additive models are very flexible, and can provide an excellent fit in the presence of nonlinear relationships and significant noise in the predictor variables. However, note that because of this flexibility, you must be extra cautious not to over-fit the data, i.e., apply an overly complex model (with many degrees of freedom) to data so as to produce a good fit that likely will not replicate in subsequent validation studies. Also, compare the quality of the fit obtained from Generalized Additive Models to the fit obtained via Generalized Linear/Nonlinear Models. In other words, evaluate whether the added complexity (generality) of generalized additive models (regression smoothers) is necessary in order to obtain a satisfactory fit to the data. Often, this is not the case, and given a comparable fit of the models, the simpler generalized linear model is preferable to the more complex generalized additive model. These issues are discussed in greater detail in Hastie and Tibshirani (1990).
Another issue to keep in mind pertains to the interpretability of results obtained from (generalized) linear models vs. generalized additive models. Linear models are easily understood, summarized, and communicated to others (e.g., in technical reports). Moreover, parameter estimates can be used to predict or classify new cases in a simple and straightforward manner. Generalized additive models are not easily interpreted, in particular when they involve complex nonlinear effects of some or all of the predictor variables (and, of course, it is in those instances where generalized additive models may yield a better fit than generalized linear models). To reiterate, it is usually preferable to rely on a simple well understood model for predicting future cases, than on a complex model that is difficult to interpret and summarize.
General Linear Models (GLM)
This topic describes the use of the general linear model in a wide variety of statistical analyses. If you are unfamiliar with the basic methods of ANOVA and regression in linear models, it may be useful to first review the basic information on these topics in Elementary Concepts. A detailed discussion of univariate and multivariate ANOVA techniques can also be found in the ANOVA/MANOVA topic.
Basic Ideas: The General Linear Model
The following topics summarize the historical, mathematical, and computational foundations for the general linear model. For a basic introduction to ANOVA (MANOVA, ANCOVA) techniques, refer to ANOVA/MANOVA; for an introduction to multiple regression, see Multiple Regression; for an introduction to the design an analysis of experiments in applied (industrial) settings, see Experimental Design.
Historical Background
The roots of the general linear model surely go back to the origins of mathematical thought, but it is the emergence of the theory of algebraic invariants in the 1800’s that made the general linear model, as we know it today, possible. The theory of algebraic invariants developed from the groundbreaking work of 19th century mathematicians such as Gauss, Boole, Cayley, and Sylvester. The theory seeks to identify those quantities in systems of equations which remain unchanged under linear transformations of the variables in the system. Stated more imaginatively (but in a way in which the originators of the theory would not consider an overstatement), the theory of algebraic invariants searches for the eternal and unchanging amongst the chaos of the transitory and the illusory. That is no small goal for any theory, mathematical or otherwise.
The wonder of it all is the theory of algebraic invariants was successful far beyond the hopes of its originators. Eigenvalues, eigenvectors, determinants, matrix decomposition methods; all derive from the theory of algebraic invariants. The contributions of the theory of algebraic invariants to the development of statistical theory and methods are numerous, but a simple example familiar to even the most casual student of statistics is illustrative. The correlation between two variables is unchanged by linear transformations of either or both variables. We probably take this property of correlation coefficients for granted, but what would data analysis be like if we did not have statistics that are invariant to the scaling of the variables involved? Some thought on this question should convince you that without the theory of algebraic invariants, the development of useful statistical techniques would be nigh impossible.
The development of the linear regression model in the late 19th century, and the development of correlational methods shortly thereafter, are clearly direct outgrowths of the theory of algebraic invariants. Regression and correlational methods, in turn, serve as the basis for the general linear model. Indeed, the general linear model can be seen as an extension of linear multiple regression for a single dependent variable. Understanding the multiple regression model is fundamental to understanding the general linear model, so we will look at the purpose of multiple regression, the computational algorithms used to solve regression problems, and how the regression model is extended in the case of the general linear model. A basic introduction to multiple regression methods and the analytic problems to which they are applied is provided in the Multiple Regression.
The Purpose of Multiple Regression
The general linear model can be seen as an extension of linear multiple regression for a single dependent variable, and understanding the multiple regression model is fundamental to understanding the general linear model. The general purpose of multiple regression (the term was first used by Pearson, 1908) is to quantify the relationship between several independent or predictor variables and a dependent or criterion variable. For a detailed introduction to multiple regression, also refer to the Multiple Regression section. For example, a real estate agent might record for each listing the size of the house (in square feet), the number of bedrooms, the average income in the respective neighborhood according to census data, and a subjective rating of appeal of the house. Once this information has been compiled for various houses it would be interesting to see whether and how these measures relate to the price for which a house is sold. For example, we might learn that the number of bedrooms is a better predictor of the price for which a house sells in a particular neighborhood than how “pretty” the house is (subjective rating). We may also detect “outliers,” for example, houses that should really sell for more, given their location and characteristics.
Personnel professionals customarily use multiple regression procedures to determine equitable compensation. We can determine a number of factors or dimensions such as “amount of responsibility” (Resp) or “number of people to supervise” (No_Super) that we believe to contribute to the value of a job. The personnel analyst then usually conducts a salary survey among comparable companies in the market, recording the salaries and respective characteristics (i.e., values on dimensions) for different positions. This information can be used in a multiple regression analysis to build a regression equation of the form:
Salary = .5*Resp + .8*No_Super
Once this so-called regression equation has been determined, the analyst can now easily construct a graph of the expected (predicted) salaries and the actual salaries of job incumbents in his or her company. Thus, the analyst is able to determine which position is underpaid (below the regression line) or overpaid (above the regression line), or paid equitably.
In the social and natural sciences multiple regression procedures are very widely used in research. In general, multiple regression allows the researcher to ask (and hopefully answer) the general question “what is the best predictor of …”. For example, educational researchers might want to learn what are the best predictors of success in high-school. Psychologists may want to determine which personality variable best predicts social adjustment. Sociologists may want to find out which of the multiple social indicators best predict whether or not a new immigrant group will adapt and be absorbed into society.
Computations for Solving the Multiple Regression Equation
A one-dimensional surface in a two-dimensional or two-variable space is a line defined by the equation Y = b_{0} + b_{1}X. According to this equation, the Y variable can be expressed in terms of or as a function of a constant (b_{0}) and a slope (b_{1}) times the X variable. The constant is also referred to as the intercept, and the slope as the regression coefficient. For example, GPA may best be predicted as 1+.02*IQ. Thus, knowing that a student has an IQ of 130 would lead us to predict that her GPA would be 3.6 (since, 1+.02*130=3.6). In the multiple regression case, when there are multiple predictor variables, the regression surface usually cannot be visualized in a two dimensional space, but the computations are a straightforward extension of the computations in the single predictor case. For example, if in addition to IQ we had additional predictors of achievement (e.g., Motivation, Self-discipline) we could construct a linear equation containing all those variables. In general then, multiple regression procedures will estimate a linear equation of the form:
Y = b_{0} + b_{1}X_{1} + b_{2}X_{2} + … + b_{k}X_{k}
where k is the number of predictors. Note that in this equation, the regression coefficients (or b_{1} … b_{k} coefficients) represent the independent contributions of each in dependent variable to the prediction of the dependent variable. Another way to express this fact is to say that, for example, variable X_{1} is correlated with the Y variable, after controlling for all other independent variables. This type of correlation is also referred to as a partial correlation (this term was first used by Yule, 1907). Perhaps the following example will clarify this issue. We would probably find a significant negative correlation between hair length and height in the population (i.e., short people have longer hair). At first this may seem odd; however, if we were to add the variable Gender into the multiple regression equation, this correlation would probably disappear. This is because women, on the average, have longer hair than men; they also are shorter on the average than men. Thus, after we remove this gender difference by entering Gender into the equation, the relationship between hair length and height disappears because hair length does not make any unique contribution to the prediction of height, above and beyond what it shares in the prediction with variable Gender. Put another way, after controlling for the variable Gender, the partial correlation between hair length and height is zero.
The regression surface (a line in simple regression, a plane or higher-dimensional surface in multiple regression) expresses the best prediction of the dependent variable (Y), given the independent variables (X‘s). However, nature is rarely (if ever) perfectly predictable, and usually there is substantial variation of the observed points from the fitted regression surface. The deviation of a particular point from the nearest corresponding point on the predicted regression surface (its predicted value) is called the residual value. Since the goal of linear regression procedures is to fit a surface, which is a linear function of the X variables, as closely as possible to the observed Y variable, the residual values for the observed points can be used to devise a criterion for the “best fit.” Specifically, in regression problems the surface is computed for which the sum of the squared deviations of the observed points from that surface are minimized. Thus, this general procedure is sometimes also referred to as least squares estimation. (see also the description of weighted least squares estimation).
The actual computations involved in solving regression problems can be expressed compactly and conveniently using matrix notation. Suppose that there are n observed values of Y and n associated observed values for each of k different X variables. Then Y_{i}, X_{ik}, and e_{i }can represent the ith observation of the Y variable, the ith observation of each of the X variables, and the ith unknown residual value, respectively. Collecting these terms into matrices we have
The multiple regression model in matrix notation then can be expressed as
Y = Xb + e
where b is a column vector of 1 (for the intercept) + k unknown regression coefficients. Recall that the goal of multiple regression is to minimize the sum of the squared residuals. Regression coefficients that satisfy this criterion are found by solving the set of normal equations
X’Xb = X’Y
When the X variables are linearly independent (i.e., they are nonredundant, yielding an X’X matrix which is of full rank) there is a unique solution to the normal equations. Premultiplying both sides of the matrix formula for the normal equations by the inverse of X’X gives
(X’X)^{-1}X’Xb = (X’X)^{-1}X’Y
or
b = (X’X)^{-1}X’Y
This last result is very satisfying in view of its simplicity and its generality. With regard to its simplicity, it expresses the solution for the regression equation in terms just 2 matrices (X and Y) and 3 basic matrix operations, (1) matrix transposition, which involves interchanging the elements in the rows and columns of a matrix, (2) matrix multiplication, which involves finding the sum of the products of the elements for each row and column combination of two conformable (i.e., multipliable) matrices, and (3) matrix inversion, which involves finding the matrix equivalent of a numeric reciprocal, that is, the matrix that satisfies
A^{-1}AA=A
for a matrix A.
It took literally centuries for the ablest mathematicians and statisticians to find a satisfactory method for solving the linear least square regression problem. But their efforts have paid off, for it is hard to imagine a simpler solution.
With regard to the generality of the multiple regression model, its only notable limitations are that (1) it can be used to analyze only a single dependent variable, (2) it cannot provide a solution for the regression coefficients when the X variables are not linearly independent and the inverse of X’X therefore does not exist. These restrictions, however, can be overcome, and in doing so the multiple regression model is transformed into the general linear model.
Extension of Multiple Regression to the General Linear Model
One way in which the general linear model differs from the multiple regression model is in terms of the number of dependent variables that can be analyzed. The Y vector of n observations of a single Y variable can be replaced by a Y matrix of n observations of m different Y variables. Similarly, the b vector of regression coefficients for a single Y variable can be replaced by a b matrix of regression coefficients, with one vector of b coefficients for each of the m dependent variables. These substitutions yield what is sometimes called the multivariate regression model, but it should be emphasized that the matrix formulations of the multiple and multivariate regression models are identical, except for the number of columns in the Y and b matrices. The method for solving for the b coefficients is also identical, that is, m different sets of regression coefficients are separately found for the m different dependent variables in the multivariate regression model.
The general linear model goes a step beyond the multivariate regression model by allowing for linear transformations or linear combinations of multiple dependent variables. This extension gives the general linear model important advantages over the multiple and the so-called multivariate regression models, both of which are inherently univariate (single dependent variable) methods. One advantage is that multivariate tests of significance can be employed when responses on multiple dependent variables are correlated. Separate univariate tests of significance for correlated dependent variables are not independent and may not be appropriate. Multivariate tests of significance of independent linear combinations of multiple dependent variables also can give insight into which dimensions of the response variables are, and are not, related to the predictor variables. Another advantage is the ability to analyze effects of repeated measure factors. Repeated measure designs, or within-subject designs, have traditionally been analyzed using ANOVA techniques. Linear combinations of responses reflecting a repeated measure effect (for example, the difference of responses on a measure under differing conditions) can be constructed and tested for significance using either the univariate or multivariate approach to analyzing repeated measures in the general linear model.
A second important way in which the general linear model differs from the multiple regression model is in its ability to provide a solution for the normal equations when the X variables are not linearly independent and the inverse of X’X does not exist. Redundancy of the X variables may be incidental (e.g., two predictor variables might happen to be perfectly correlated in a small data set), accidental (e.g., two copies of the same variable might unintentionally be used in an analysis) or designed (e.g., indicator variables with exactly opposite values might be used in the analysis, as when both Male and Female predictor variables are used in representing Gender). Finding the regular inverse of a non-full-rank matrix is reminiscent of the problem of finding the reciprocal of 0 in ordinary arithmetic. No such inverse or reciprocal exists because division by 0 is not permitted. This problem is solved in the general linear model by using a generalized inverse of the X’X matrix in solving the normal equations. A generalized inverse is any matrix that satisfies
AA^{–}A = A
for a matrix A. A generalized inverse is unique and is the same as the regular inverse only if the matrix A is full rank. A generalized inverse for a non-full-rank matrix can be computed by the simple expedient of zeroing the elements in redundant rows and columns of the matrix. Suppose that an X’X matrix with r non-redundant columns is partitioned as
where A_{11} is an r by r matrix of rank r. Then the regular inverse of A_{11} exists and a generalized inverse of X’X is
where each 0 (null) matrix is a matrix of 0’s (zeroes) and has the same dimensions as the corresponding A matrix.
In practice, however, a particular generalized inverse of X’X for finding a solution to the normal equations is usually computed using the sweep operator (Dempster, 1960). This generalized inverse, called a g2 inverse, has two important properties. One is that zeroing of the elements in redundant rows is unnecessary. Another is that partitioning or reordering of the columns of X’X is unnecessary, so that the matrix can be inverted “in place.”
There are infinitely many generalized inverses of a non-full-rank X’X matrix, and thus, infinitely many solutions to the normal equations. This can make it difficult to understand the nature of the relationships of the predictor variables to responses on the dependent variables, because the regression coefficients can change depending on the particular generalized inverse chosen for solving the normal equations. It is not cause for dismay, however, because of the invariance properties of many results obtained using the general linear model.
A simple example may be useful for illustrating one of the most important invariance properties of the use of generalized inverses in the general linear model. If both Male and Female predictor variables with exactly opposite values are used in an analysis to represent Gender, it is essentially arbitrary as to which predictor variable is considered to be redundant (e.g., Male can be considered to be redundant with Female, or vice versa). No matter which predictor variable is considered to be redundant, no matter which corresponding generalized inverse is used in solving the normal equations, and no matter which resulting regression equation is used for computing predicted values on the dependent variables, the predicted values and the corresponding residuals for males and females will be unchanged. In using the general linear model, we must keep in mind that finding a particular arbitrary solution to the normal equations is primarily a means to the end of accounting for responses on the dependent variables, and not necessarily an end in itself.
Sigma-Restricted and Overparameterized Model
Unlike the multiple regression model, which is usually applied to cases where the X variables are continuous, the general linear model is frequently applied to analyze any ANOVA or MANOVA design with categorical predictor variables, any ANCOVA or MANCOVA design with both categorical and continuous predictor variables, as well as any multiple or multivariate regression design with continuous predictor variables. To illustrate, Gender is clearly a nominal level variable (anyone who attempts to rank order the sexes on any dimension does so at his or her own peril in today’s world). There are two basic methods by which Gender can be coded into one or more (non-offensive) predictor variables, and analyzed using the general linear model.
Sigma-restricted model (coding of categorical predictors). Using the first method, males and females can be assigned any two arbitrary, but distinct values on a single predictor variable. The values on the resulting predictor variable will represent a quantitative contrast between males and females. Typically, the values corresponding to group membership are chosen not arbitrarily but rather to facilitate interpretation of the regression coefficient associated with the predictor variable. In one widely used strategy, cases in the two groups are assigned values of 1 and -1 on the predictor variable, so that if the regression coefficient for the variable is positive, the group coded as 1 on the predictor variable will have a higher predicted value (i.e., a higher group mean) on the dependent variable, and if the regression coefficient is negative, the group coded as -1 on the predictor variable will have a higher predicted value on the dependent variable. An additional advantage is that since each group is coded with a value one unit from zero, this helps in interpreting the magnitude of differences in predicted values between groups, because regression coefficients reflect the units of change in the dependent variable for each unit change in the predictor variable. This coding strategy is aptly called the sigma-restricted parameterization, because the values used to represent group membership (1 and -1) sum to zero.
Note that the sigma-restricted parameterization of categorical predictor variables usually leads to X’X matrices which do not require a generalized inverse for solving the normal equations. Potentially redundant information, such as the characteristics of maleness and femaleness, is literally reduced to full-rank by creating quantitative contrast variables representing differences in characteristics.
Overparameterized model (coding of categorical predictors). The second basic method for recoding categorical predictors is the indicator variable approach. In this method a separate predictor variable is coded for each group identified by a categorical predictor variable. To illustrate, females might be assigned a value of 1 and males a value of 0 on a first predictor variable identifying membership in the female Gender group, and males would then be assigned a value of 1 and females a value of 0 on a second predictor variable identifying membership in the male Gender group. Note that this method of recoding categorical predictor variables will almost always lead to X’X matrices with redundant columns, and thus require a generalized inverse for solving the normal equations. As such, this method is often called the overparameterized model for representing categorical predictor variables, because it results in more columns in the X’X than are necessary for determining the relationships of categorical predictor variables to responses on the dependent variables.
True to its description as general, the general linear model can be used to perform analyses with categorical predictor variables which are coded using either of the two basic methods that have been described.
Summary of Computations
To conclude this discussion of the ways in which the general linear model extends and generalizes regression methods, the general linear model can be expressed as
YM = Xb + e
Here Y, X, b, and e are as described for the multivariate regression model and M is an m x s matrix of coefficients defining s linear transformation of the dependent variables. The normal equations are
X’Xb = X’YM
and a solution for the normal equations is given by
b = (X’X)^{–}X’YM Here the inverse of X’X is a generalized inverse if X’X contains redundant columns.
Add a provision for analyzing linear combinations of multiple dependent variables, add a method for dealing with redundant predictor variables and recoded categorical predictor variables, and the major limitations of multiple regression are overcome by the general linear model.
Types of Analyses
A wide variety of types of designs can be analyzed using the general linear model. In fact, the flexibility of the general linear model allows it to handle so many different types of designs that it is difficult to develop simple typologies of the ways in which these designs might differ. Some general ways in which designs might differ can be suggested, but keep in mind that any particular design can be a “hybrid” in the sense that it could have combinations of features of a number of different types of designs.
In the following discussion, references will be made to the design matrix X, as well as sigma-restricted and overparameterized model coding. For an explanation of this terminology, refer to the section entitled Basic Ideas: The General Linear Model, or, for a brief summary, to the Summary of computations section.
A basic discussion to univariate and multivariate ANOVA techniques can also be found in the ANOVA/MANOVA topic; a discussion of multiple regression methods is also provided in the Multiple Regression topic.
Between-Subject Designs
- Overview
- One-way ANOVA
- Main effect ANOVA
- Factorial ANOVA
- Nested designs
- Balanced ANOVA
- Simple regression
- Multiple regression
- Factorial regression
- Polynomial regression
- Response surface regression
- Mixture surface regression
- Analysis of covariance (ANCOVA)
- Separate slopes designs
- Homogeneity of slopes
- Mixed-model ANOVA and ANCOVA
Overview. The levels or values of the predictor variables in an analysis describe the differences between the n subjects or the n valid cases that are analyzed. Thus, when we speak of the between subject design (or simply the between design) for an analysis, we are referring to the nature, number, and arrangement of the predictor variables.
Concerning the nature or type of predictor variables, between designs which contain only categorical predictor variables can be called ANOVA (analysis of variance) designs, between designs which contain only continuous predictor variables can be called regression designs, and between designs which contain both categorical and continuous predictor variables can be called ANCOVA (analysis of covariance) designs. Further, continuous predictors are always considered to have fixed values, but the levels of categorical predictors can be considered to be fixed or to vary randomly. Designs which contain random categorical factors are called mixed-model designs (see the Variance Components and Mixed Model ANOVA/ANCOVA section).
Between designs may involve only a single predictor variable and therefore be described as simple (e.g., simple regression) or may employ numerous predictor variables (e.g., multiple regression).
Concerning the arrangement of predictor variables, some between designs employ only “main effect” or first-order terms for predictors, that is, the values for different predictor variables are independent and raised only to the first power. Other between designs may employ higher-order terms for predictors by raising the values for the original predictor variables to a power greater than 1 (e.g., in polynomial regression designs), or by forming products of different predictor variables (i.e., interaction terms). A common arrangement for ANOVA designs is the full-factorial design, in which every combination of levels for each of the categorical predictor variables is represented in the design. Designs with some but not all combinations of levels for each of the categorical predictor variables are aptly called fractional factorial designs. Designs with a hierarchy of combinations of levels for the different categorical predictor variables are called nested designs.
These basic distinctions about the nature, number, and arrangement of predictor variables can be used in describing a variety of different types of between designs. Some of the more common between designs can now be described.
One-Way ANOVA. A design with a single categorical predictor variable is called a one-way ANOVA design. For example, a study of 4 different fertilizers used on different individual plants could be analyzed via one-way ANOVA, with four levels for the factor Fertilizer.
In genera, consider a single categorical predictor variable A with 1 case in each of its 3 categories. Using the sigma-restricted coding of A into 2 quantitative contrast variables, the matrix X defining the between design is
That is, cases in groups A_{1}, A_{2}, and A_{3} are all assigned values of 1 on X_{0} (the intercept), the case in group A_{1} is assigned a value of 1 on X_{1} and a value 0 on X_{2}, the case in group A_{2} is assigned a value of 0 on X_{1} and a value 1 on X_{2}, and the case in group A_{3} is assigned a value of -1 on X_{1} and a value -1 on X_{2}. Of course, any additional cases in any of the 3 groups would be coded similarly. If there were 1 case in group A_{1}, 2 cases in group A_{2}, and 1 case in group A_{3}, the X matrix would be
where the first subscript for A gives the replicate number for the cases in each group. For brevity, replicates usually are not shown when describing ANOVA design matrices.
Note that in one-way designs with an equal number of cases in each group, sigma-restricted coding yields X_{1} … X_{k} variables all of which have means of 0.
Using the overparameterized model to represent A, the X matrix defining the between design is simply
These simple examples show that the X matrix actually serves two purposes. It specifies (1) the coding for the levels of the original predictor variables on the X variables used in the analysis as well as (2) the nature, number, and arrangement of the X variables, that is, the between design.
Main Effect ANOVA. Main effect ANOVA designs contain separate one-way ANOVA designs for 2 or more categorical predictors. A good example of main effect ANOVA would be the typical analysis performed on screening designs as described in the context of the Experimental Design section.
Consider 2 categorical predictor variables A and B each with 2 categories. Using the sigma-restricted coding, the X matrix defining the between design is
Note that if there are equal numbers of cases in each group, the sum of the cross-products of values for the X_{1} and X_{2} columns is 0, for example, with 1 case in each group (1*1)+(1*-1)+(-1*1)+(-1*-1)=0. Using the overparameterized model, the matrix X defining the between design is
Comparing the two types of coding, it can be seen that the overparameterized coding takes almost twice as many values as the sigma-restricted coding to convey the same information.
Factorial ANOVA. Factorial ANOVA designs contain X variables representing combinations of the levels of 2 or more categorical predictors (e.g., a study of boys and girls in four age groups, resulting in a 2 (Gender) x 4 (Age Group) design). In particular, full-factorial designs represent all possible combinations of the levels of the categorical predictors. A full-factorial design with 2 categorical predictor variables A and B each with 2 levels each would be called a 2 x 2 full-factorial design. Using the sigma-restricted coding, the X matrix for this design would be
Several features of this X matrix deserve comment. Note that the X_{1} and X_{2} columns represent main effect contrasts for one variable, (i.e., A and B, respectively) collapsing across the levels of the other variable. The X_{3 }column instead represents a contrast between different combinations of the levels of A and B. Note also that the values for X_{3} are products of the corresponding values for X_{1} and X_{2}. Product variables such as X_{3 }represent the multiplicative or interaction effects of their factors, so X_{3} would be said to represent the 2-way interaction of A and B. The relationship of such product variables to the dependent variables indicate the interactive influences of the factors on responses above and beyond their independent (i.e., main effect) influences on responses. Thus, factorial designs provide more information about the relationships between categorical predictor variables and responses on the dependent variables than is provided by corresponding one-way or main effect designs.
When many factors are being investigated, however, full-factorial designs sometimes require more data than reasonably can be collected to represent all possible combinations of levels of the factors, and high-order interactions between many factors can become difficult to interpret. With many factors, a useful alternative to the full-factorial design is the fractional factorial design. As an example, consider a 2 x 2 x 2 fractional factorial design to degree 2 with 3 categorical predictor variables each with 2 levels. The design would include the main effects for each variable, and all 2-way interactions between the three variables, but would not include the 3-way interaction between all three variables. Using the overparameterized model, the X matrix for this design is
The 2-way interactions are the highest degree effects included in the design. These types of designs are discussed in detail the 2**(k-p) Fractional Factorial Designs section of the Experimental Design topic.
Nested ANOVA Designs. Nested designs are similar to fractional factorial designs in that all possible combinations of the levels of the categorical predictor variables are not represented in the design. In nested designs, however, the omitted effects are lower-order effects. Nested effects are effects in which the nested variables never appear as main effects. Suppose that for 2 variables A and B with 3 and 2 levels, respectively, the design includes the main effect for A and the effect of B nested within the levels of A. The X matrix for this design using the overparameterized model is
Note that if the sigma-restricted coding were used, there would be only 2 columns in the X matrix for the B nested within A effect instead of the 6 columns in the X matrix for this effect when the overparameterized model coding is used (i.e., columns X_{4} through X_{9}). The sigma-restricted coding method is overly-restrictive for nested designs, so only the overparameterized model is used to represent nested designs.
Balanced ANOVA. Most of the between designs discussed in this section can be analyzed much more efficiently, when they are balanced, i.e., when all cells in the ANOVA design have equal n, when there are no missing cells in the design, and, if nesting is present, when the nesting is balanced so that equal numbers of levels of the factors that are nested appear in the levels of the factor(s) that they are nested in. In that case, the X’X matrix (where X stands for the design matrix) is a diagonal matrix, and many of the computations necessary to compute the ANOVA results (such as matrix inversion) are greatly simplified.
Simple Regression. Simple regression designs involve a single continuous predictor variable. If there were 3 cases with values on a predictor variable P of, say, 7, 4, and 9, and the design is for the first-order effect of P, the X matrix would be
and using P for X_{1} the regression equation would be
Y = b_{0} + b_{1}P
If the simple regression design is for a higher-order effect of P, say the quadratic effect, the values in the X_{1} column of the design matrix would be raised to the 2nd power, that is, squared
and using P^{2} for X_{1} the regression equation would be
Y = b_{0} + b_{1}P^{2}
The sigma-restricted and overparameterized coding methods do not apply to simple regression designs and any other design containing only continuous predictors (since there are no categorical predictors to code). Regardless of which coding method is chosen, values on the continuous predictor variables are raised to the desired power and used as the values for the X variables. No recoding is performed. It is therefore sufficient, in describing regression designs, to simply describe the regression equation without explicitly describing the design matrix X.
Multiple Regression. Multiple regression designs are to continuous predictor variables as main effect ANOVA designs are to categorical predictor variables, that is, multiple regression designs contain the separate simple regression designs for 2 or more continuous predictor variables. The regression equation for a multiple regression design for the first-order effects of 3 continuous predictor variables P, Q, and R would be
Y = b_{0} + b_{1}P + b_{2}Q + b_{3}R
Factorial Regression. Factorial regression designs are similar to factorial ANOVA designs, in which combinations of the levels of the factors are represented in the design. In factorial regression designs, however, there may be many more such possible combinations of distinct levels for the continuous predictor variables than there are cases in the data set. To simplify matters, full-factorial regression designs are defined as designs in which all possible products of the continuous predictor variables are represented in the design. For example, the full-factorial regression design for two continuous predictor variables P and Q would include the main effects (i.e., the first-order effects) of P and Q and their 2-way P by Q interaction effect, which is represented by the product of P and Q scores for each case. The regression equation would be
Y = b_{0} + b_{1}P + b_{2}Q + b_{3}P*Q
Factorial regression designs can also be fractional, that is, higher-order effects can be omitted from the design. A fractional factorial design to degree 2 for 3 continuous predictor variables P, Q, and R would include the main effects and all 2-way interactions between the predictor variables
Y = b_{0} + b_{1}P + b_{2}Q + b_{3}R + b_{4}P*Q + b_{5}P*R + b_{6}Q*R
Polynomial Regression. Polynomial regression designs are designs which contain main effects and higher-order effects for the continuous predictor variables but do not include interaction effects between predictor variables. For example, the polynomial regression design to degree 2 for three continuous predictor variables P, Q, and R would include the main effects (i.e., the first-order effects) of P, Q, and R and their quadratic (i.e., second-order) effects, but not the 2-way interaction effects or the P by Q by R 3-way interaction effect.
Y = b_{0} + b_{1}P + b_{2}P^{2} + b_{3}Q + b_{4}Q^{2} + b_{5}R + b_{6}R^{2}
Polynomial regression designs do not have to contain all effects up to the same degree for every predictor variable. For example, main, quadratic, and cubic effects could be included in the design for some predictor variables, and effects up the fourth degree could be included in the design for other predictor variables.
Response Surface Regression. Quadratic response surface regression designs are a hybrid type of design with characteristics of both polynomial regression designs and fractional factorial regression designs. Quadratic response surface regression designs contain all the same effects of polynomial regression designs to degree 2 and additionally the 2-way interaction effects of the predictor variables. The regression equation for a quadratic response surface regression design for 3 continuous predictor variables P, Q, and R would be
Y = b_{0} + b_{1}P + b_{2}P^{2} + b_{3}Q + b_{4}Q^{2} + b_{5}R + b_{6}R^{2} + b_{7}P*Q + b_{8}P*R + b_{9}Q*R
These types of designs are commonly employed in applied research (e.g., in industrial experimentation), and a detailed discussion of these types of designs is also presented in the Experimental Design topic (see Central composite designs).
Mixture Surface Regression. Mixture surface regression designs are identical to factorial regression designs to degree 2 except for the omission of the intercept. Mixtures, as the name implies, add up to a constant value; the sum of the proportions of ingredients in different recipes for some material all must add up 100%. Thus, the proportion of one ingredient in a material is redundant with the remaining ingredients. Mixture surface regression designs deal with this redundancy by omitting the intercept from the design. The design matrix for a mixture surface regression design for 3 continuous predictor variables P, Q, and R would be
Y = b_{1}P + b_{2}Q + b_{3}R + b_{4}P*Q + b_{5}P*R + b_{6}Q*R
These types of designs are commonly employed in applied research (e.g., in industrial experimentation), and a detailed discussion of these types of designs is also presented in the Experimental Design topic (see Mixture designs and triangular surfaces).
Analysis of Covariance. In general, between designs which contain both categorical and continuous predictor variables can be called ANCOVA designs. Traditionally, however, ANCOVA designs have referred more specifically to designs in which the first-order effects of one or more continuous predictor variables are taken into account when assessing the effects of one or more categorical predictor variables. A basic introduction to analysis of covariance can also be found in the Analysis of covariance (ANCOVA) section of the ANOVA/MANOVA topic.
To illustrate, suppose a researcher wants to assess the influences of a categorical predictor variable A with 3 levels on some outcome, and that measurements on a continuous predictor variable P, known to covary with the outcome, are available. If the data for the analysis are
then the sigma-restricted X matrix for the design that includes the separate first-order effects of P and A would be
The b_{2} and b_{3} coefficients in the regression equation
Y = b_{0} + b_{1}X_{1} + b_{2}X_{2} + b_{3}X_{3}
represent the influences of group membership on the A categorical predictor variable, controlling for the influence of scores on the P continuous predictor variable. Similarly, the b_{1} coefficient represents the influence of scores on P controlling for the influences of group membership on A. This traditional ANCOVA analysis gives a more sensitive test of the influence of A to the extent that P reduces the prediction error, that is, the residuals for the outcome variable.
The X matrix for the same design using the overparameterized model would be
The interpretation is unchanged except that the influences of group membership on the A categorical predictor variables are represented by the b_{2}, b_{3} and b_{4} coefficients in the regression equation
Y = b_{0} + b_{1}X_{1} + b_{2}X_{2} + b_{3}X_{3} + b_{4}X_{4}
Separate Slope Designs. The traditional analysis of covariance (ANCOVA) design for categorical and continuous predictor variables is inappropriate when the categorical and continuous predictors interact in influencing responses on the outcome. The appropriate design for modeling the influences of the predictors in this situation is called the separate slope design. For the same example data used to illustrate traditional ANCOVA, the overparameterized X matrix for the design that includes the main effect of the three-level categorical predictor A and the 2-way interaction of P by A would be
The b_{4}, b_{5}, and b_{6 coefficients in the regression equation }
Y = b_{0} + b_{1}X_{1} + b_{2}X_{2} + b_{3}X_{3} + b_{4}X_{4} + b_{5}X_{5} + b_{6}X_{6}
give the separate slopes for the regression of the outcome on P within each group on A, controlling for the main effect of A.
As with nested ANOVA designs, the sigma-restricted coding of effects for separate slope designs is overly restrictive, so only the overparameterized model is used to represent separate slope designs. In fact, separate slope designs are identical in form to nested ANOVA designs, since the main effects for continuous predictors are omitted in separate slope designs.
Homogeneity of Slopes. The appropriate design for modeling the influences of continuous and categorical predictor variables depends on whether the continuous and categorical predictors interact in influencing the outcome. The traditional analysis of covariance (ANCOVA) design for continuous and categorical predictor variables is appropriate when the continuous and categorical predictors do not interact in influencing responses on the outcome, and the separate slope design is appropriate when the continuous and categorical predictors do interact in influencing responses. The homogeneity of slopes designs can be used to test whether the continuous and categorical predictors interact in influencing responses, and thus, whether the traditional ANCOVA design or the separate slope design is appropriate for modeling the effects of the predictors. For the same example data used to illustrate the traditional ANCOVA and separate slope designs, the overparameterized X matrix for the design that includes the main effect of P, the main effect of the three-level categorical predictor A, and the 2-way interaction of P by A would be
If the b_{5}, b_{6}, or b_{7} coefficient in the regression equation
Y = b_{0} + b_{1}X_{1} + b_{2}X_{2} + b_{3}X_{3} + b_{4}X_{4} + b_{5}X_{5} + b_{6}X_{6} + b_{7}X_{7}
is non-zero, the separate slope model should be used. If instead all 3 of these regression coefficients are zero the traditional ANCOVA design should be used.
The sigma-restricted X matrix for the homogeneity of slopes design would be
Using this X matrix, if the b_{4}, or b_{5} coefficient in the regression equation
Y = b_{0} + b_{1}X_{1} + b_{2}X_{2} + b_{3}X_{3} + b_{4}X_{4} + b_{5}X_{5}
is non-zero, the separate slope model should be used. If instead both of these regression coefficients are zero the traditional ANCOVA design should be used.
Mixed Model ANOVA and ANCOVA. Designs that contain random effects for one or more categorical predictor variables are called mixed-model designs. Random effects are classification effects where the levels of the effects are assumed to be randomly selected from an infinite population of possible levels. The solution for the normal equations in mixed-model designs is identical to the solution for fixed-effect designs (i.e., designs which do not contain Random effects. Mixed-model designs differ from fixed-effect designs only in the way in which effects are tested for significance. In fixed-effect designs, between effects are always tested using the mean squared residual as the error term. In mixed-model designs, between effects are tested using relevant error terms based on the covariation of random sources of variation in the design. Specifically, this is done using Satterthwaite’s method of denominator synthesis (Satterthwaite, 1946), which finds the linear combinations of sources of random variation that serve as appropriate error terms for testing the significance of the respective effect of interest. A basic discussion of these types of designs, and methods for estimating variance components for the random effects can also be found in the Variance Components and Mixed Model ANOVA/ANCOVA topic.
Mixed-model designs, like nested designs and separate slope designs, are designs in which the sigma-restricted coding of categorical predictors is overly restrictive. Mixed-model designs require estimation of the covariation between the levels of categorical predictor variables, and the sigma-restricted coding of categorical predictors suppresses this covariation. Thus, only the overparameterized model is used to represent mixed-model designs (some programs will use the sigma-restricted approach and a so-called “restricted model” for random effects; however, only the overparameterized model as described in General Linear Models applies to both balanced and unbalanced designs, as well as designs with missing cells; see Searle, Casella, & McCullock, 1992, p. 127). It is important to recognize, however, that sigma-restricted coding can be used to represent any between design, with the exceptions of mixed-model, nested, and separate slope designs. Furthermore, some types of hypotheses can only be tested using the sigma-restricted coding (i.e., the effective hypothesis, Hocking, 1996), thus the greater generality of the overparameterized model for representing between designs does not justify it being used exclusively for representing categorical predictors in the general linear model.
Within-Subject (Repeated Measures) Designs
- Overview
- One-way within-subject designs
- Multi-way within-subject designs
- The multivariate approach to Repeated Measures
- Doubly multivariate within-subject designs
Overview. It is quite common for researchers to administer the same test to the same subjects repeatedly over a period of time or under varying circumstances. In essence, we are interested in examining differences within each subject, for example, subjects’ improvement over time. Such designs are referred to as within-subject designs or repeated measures designs. A basic introduction to repeated measures designs is also provided in the Between-groups and repeated measures section of the ANOVA/MANOVA topic.
For example, imagine that we want to monitor the improvement of students’ algebra skills over two months of instruction. A standardized algebra test is administered after one month (level 1 of the repeated measures factor), and a comparable test is administered after two months (level 2 of the repeated measures factor). Thus, the repeated measures factor (Time) has 2 levels. Now, suppose that scores for the 2 algebra tests (i.e., values on the Y_{1} and Y_{2} variables at Time 1 and Time 2, respectively) are transformed into scores on a new composite variable (i.e., values on the T_{1}), using the linear transformation
T = YM
where M is an orthonormal contrast matrix. Specifically, if
then the difference of the mean score on T_{1} from 0 indicates the improvement (or deterioration) of scores across the 2 levels of Time.
One-Way Within-Subject Designs. The example algebra skills study with the Time repeated measures factor (see also within-subjects design Overview) illustrates a one-way within-subject design. In such designs, orthonormal contrast transformations of the scores on the original dependent Y variables are performed via the M transformation (orthonormal transformations correspond to orthogonal rotations of the original variable axes). If any b_{0} coefficient in the regression of a transformed T variable on the intercept is non-zero, this indicates a change in responses across the levels of the repeated measures factor, that is, the presence of a main effect for the repeated measure factor on responses.
What if the between design includes effects other than the intercept? If any of the b_{1} through b_{k} coefficients in the regression of a transformed T variable on X are non-zero, this indicates a different change in responses across the levels of the repeated measures factor for different levels of the corresponding between effect, i.e., the presence of a within by between interaction effect on responses.
The same between-subject effects that can be tested in designs with no repeated-measures factors can also be tested in designs that do include repeated-measures factors. This is accomplished by creating a transformed dependent variable which is the sum of the original dependent variables divided by the square root of the number of original dependent variables. The same tests of between-subject effects that are performed in designs with no repeated-measures factors (including tests of the between intercept) are performed on this transformed dependent variable.
Multi-Way Within-Subject Designs. Suppose that in the example algebra skills study with the Time repeated measures factor (see the within-subject designs Overview), students were given a number problem test and then a word problem test on each testing occasion. Test could then be considered as a second repeated measures factor, with scores on the number problem tests representing responses at level 1 of the Test repeated measure factor, and scores on the word problem tests representing responses at level 2 of the Test repeated measure factor. The within subject design for the study would be a 2 (Time) by 2 (Test) full-factorial design, with effects for Time, Test, and the Time by Test interaction.
To construct transformed dependent variables representing the effects of Time, Test, and the Time by Test interaction, three respective M transformations of the original dependent Y variables are performed. Assuming that the original Y variables are in the order Time 1 – Test 1, Time 1 – Test 2, Time 2 – Test 1, and Time 2 – Test 2, the M matrices for the Time, Test, and the Time by Test interaction would be
The differences of the mean scores on the transformed T variables from 0 are then used to interpret the corresponding within-subject effects. If the b_{0} coefficient in the regression of a transformed T variable on the intercept is non-zero, this indicates a change in responses across the levels of a repeated measures effect, that is, the presence of the corresponding main or interaction effect for the repeated measure factors on responses.
Interpretation of within by between interaction effects follow the same procedures as for one-way within designs, except that now within by between interactions are examined for each within effect by between effect combination.
Multivariate Approach to Repeated Measures. When the repeated measures factor has more than 2 levels, then the M matrix will have more than a single column. For example, for a repeated measures factor with 3 levels (e.g., Time 1, Time 2, Time 3), the M matrix will have 2 columns (e.g., the two transformations of the dependent variables could be (1) Time 1 vs. Time 2 and Time 3 combined, and (2) Time 2 vs. Time 3). Consequently, the nature of the design is really multivariate, that is, there are two simultaneous dependent variables, which are transformations of the original dependent variables. Therefore, when testing repeated measures effects involving more than a single degree of freedom (e.g., a repeated measures main effect with more than 2 levels), you can compute multivariate test statistics to test the respective hypotheses. This is a different (and usually the preferred) approach than the univariate method that is still widely used. For a further discussion of the multivariate approach to testing repeated measures effects, and a comparison to the traditional univariate approach, see the Sphericity and compound symmetry section of the ANOVA/MANOVA topic.
Doubly Multivariate Designs. If the product of the number of levels for each within-subject factor is equal to the number of original dependent variables, the within-subject design is called a univariate repeated measures design. The within design is univariate because there is one dependent variable representing each combination of levels of the within-subject factors. Note that this use of the term univariate design is not to be confused with the univariate and multivariate approach to the analysis of repeated measures designs, both of which can be used to analyze such univariate (single-dependent-variable-only) designs. When there are two or more dependent variables for each combination of levels of the within-subject factors, the within-subject design is called a multivariate repeated measures design, or more commonly, a doubly multivariate within-subject design. This term is used because the analysis for each dependent measure can be done via the multivariate approach; so when there is more than one dependent measure, the design can be considered doubly-multivariate.
Doubly multivariate design are analyzed using a combination of univariate repeated measures and multivariate analysis techniques. To illustrate, suppose in an algebra skills study, tests are administered three times (repeated measures factor Time with 3 levels). Two test scores are recorded at each level of Time: a Number Problem score and a Word Problem score. Thus, scores on the two types of tests could be treated as multiple measures on which improvement (or deterioration) across Time could be assessed. M transformed variables could be computed for each set of test measures, and multivariate tests of significance could be performed on the multiple transformed measures, as well as on the each individual test measure.
Multivariate Designs
Overview. When there are multiple dependent variables in a design, the design is said to be multivariate. Multivariate measures of association are by nature more complex than their univariate counterparts (such as the correlation coefficient, for example). This is because multivariate measures of association must take into account not only the relationships of the predictor variables with responses on the dependent variables, but also the relationships among the multiple dependent variables. By doing so, however, these measures of association provide information about the strength of the relationships between predictor and dependent variables independent of the dependent variable interrelationships. A basic discussion of multivariate designs is also presented in the Multivariate Designs section in the ANOVA/MANOVA topic.
The most commonly used multivariate measures of association all can be expressed as functions of the eigenvalues of the product matrix
E^{-1}H
where E is the error SSCP matrix (i.e., the matrix of sums of squares and cross-products for the dependent variables that are not accounted for by the predictors in the between design), and H is a hypothesis SSCP matrix (i.e., the matrix of sums of squares and cross-products for the dependent variables that are accounted for by all the predictors in the between design, or the sums of squares and cross-products for the dependent variables that are accounted for by a particular effect). If
l_{i} = the ordered eigenvalues of E^{-1}H, if E^{-1} exists
then the 4 commonly used multivariate measures of association are
Wilks’ lambda = P[1/(1+l_{i})]
Pillai’s trace = Sl_{i}/(1+l_{i})
Hotelling-Lawley trace = Sl_{i}
Roy’s largest root = l_{1}
These 4 measures have different upper and lower bounds, with Wilks’ lambda perhaps being the most easily interpretable of the 4 measures. Wilks’ lambda can range from 0 to 1, with 1 indicating no relationship of predictors to responses and 0 indicating a perfect relationship of predictors to responses. 1 – Wilks’ lambda can be interpreted as the multivariate counterpart of a univariate R-squared, that is, it indicates the proportion of generalized variance in the dependent variables that is accounted for by the predictors.
The 4 measures of association are also used to construct multivariate tests of significance. These multivariate tests are covered in detail in a number of sources (e.g., Finn, 1974; Tatsuoka, 1971).
Estimation and Hypothesis Testing
The following sections discuss details concerning hypothesis testing in the context of STATISTICA‘s GLM module, for example, how the test for the overall model fit is computed, the options for computing tests for categorical effects in unbalanced or incomplete designs, how and when custom-error terms can be chosen, and the logic of testing custom-hypotheses in factorial or regression designs.
Whole Model Tests
Partitioning Sums of Squares. A fundamental principle of least squares methods is that variation on a dependent variable can be partitioned, or divided into parts, according to the sources of the variation. Suppose that a dependent variable is regressed on one or more predictor variables, and that for convenience the dependent variable is scaled so that its mean is 0. Then a basic least squares identity is that the total sum of squared values on the dependent variable equals the sum of squared predicted values plus the sum of squared residual values. Stated more generally,
S(y – y-bar)^{2} = S(y-hat – y-bar)^{2} + S(y – y-hat)^{2}
where the term on the left is the total sum of squared deviations of the observed values on the dependent variable from the dependent variable mean, and the respective terms on the right are (1) the sum of squared deviations of the predicted values for the dependent variable from the dependent variable mean and (2) the sum of the squared deviations of the observed values on the dependent variable from the predicted values, that is, the sum of the squared residuals. Stated yet another way,
Total SS = Model SS + Error SS
Note that the Total SS is always the same for any particular data set, but that the Model SS and the Error SS depend on the regression equation. Assuming again that the dependent variable is scaled so that its mean is 0, the Model SS and the Error SS can be computed using
Model SS = b’X’Y
Error SS = Y’Y – b’X’Y
Testing the Whole Model. Given the Model SS and the Error SS, we can perform a test that all the regression coefficients for the X variables (b1 through bk) are zero. This test is equivalent to a comparison of the fit of the regression surface defined by the predicted values (computed from the whole model regression equation) to the fit of the regression surface defined solely by the dependent variable mean (computed from the reduced regression equation containing only the intercept). Assuming that X’X is full-rank, the whole model hypothesis mean square
MSH = (Model SS)/k
is an estimate of the variance of the predicted values. The error mean square
s^{2} = MSE = (Error SS)/(n-k-1)
is an unbiased estimate of the residual or error variance. The test statistic is
F = MSH/MSE
where F has (k, n – k – 1) degrees of freedom.
If X’X is not full rank, r + 1 is substituted for k, where r is the rank or the number of non-redundant columns of X’X.
Note that in the case of non-intercept models, some multiple regression programs will compute the full model test based on the proportion of variance around 0 (zero) accounted for by the predictors; for more information (see Kvålseth, 1985; Okunade, Chang, and Evans, 1993), while other will actually compute both values (i.e., based on the residual variance around 0, and around the respective dependent variable means.
Limitations of Whole Model Tests. For designs such as one-way ANOVA or simple regression designs, the whole model test by itself may be sufficient for testing general hypotheses about whether or not the single predictor variable is related to the outcome. In more complex designs, however, hypotheses about specific X variables or subsets of X variables are usually of interest. For example, you might want to make inferences about whether a subset of regression coefficients are 0, or you might want to test whether subpopulation means corresponding to combinations of specific X variables differ. The whole model test is usually insufficient for such purposes.
A variety of methods have been developed for testing specific hypotheses. Like whole model tests, many of these methods rely on comparisons of the fit of different models (e.g., Type I, Type II, and the effective hypothesis sums of squares). Other methods construct tests of linear combinations of regression coefficients in order to test mean differences (e.g., Type III, Type IV, and Type V sums of squares). For designs that contain only first-order effects of continuous predictor variables (i.e., multiple regression designs), many of these methods are equivalent (i.e., Type II through Type V sums of squares all test the significance of partial regression coefficients). However, there are important distinctions between the different hypothesis testing techniques for certain types of ANOVA designs (i.e., designs with unequal cell n‘s and/or missing cells).
All methods for testing hypotheses, however, involve the same hypothesis testing strategy employed in whole model tests, that is, the sums of squares attributable to an effect (using a given criterion) is computed, and then the mean square for the effect is tested using an appropriate error term.
When there are categorical predictors in the model, arranged in a factorial ANOVA design, then we are typically interested in the main effects for and interaction effects between the categorical predictors. However, when the design is not balanced (has unequal cell n’s, and consequently, the coded effects for the categorical factors are usually correlated), or when there are missing cells in a full factorial ANOVA design, then there is ambiguity regarding the specific comparisons between the (population, or least-squares) cell means that constitute the main effects and interactions of interest. These issues are discussed in great detail in Milliken and Johnson (1986), and if you routinely analyze incomplete factorial designs, you should consult their discussion of various problems and approaches to solving them.
In addition to the widely used methods that are commonly labeled Type I, II, III, and IV sums of squares (see Goodnight, 1980), we also offer different methods for testing effects in incomplete designs, that are widely used in other areas (and traditions) of research.
Type V sums of squares. Specifically, we propose the term Type V sums of squares to denote the approach that is widely used in industrial experimentation, to analyze fractional factorial designs; these types of designs are discussed in detail in the 2**(k-p) Fractional Factorial Designs section of the Experimental Design topic. In effect, for those effects for which tests are performed all population marginal means (least squares means) are estimable.
Type VI sums of squares. Second, in keeping with the Type i labeling convention, we propose the term Type VI sums of squares to denote the approach that is often used in programs that only implement the sigma-restricted model (which is not well suited for certain types of designs; we offer a choice between the sigma-restricted and overparameterized model models). This approach is identical to what is described as the effective hypothesis method in Hocking (1996).
Contained Effects. The following descriptions will use the term contained effect. An effect E1 (e.g., A * B interaction) is contained in another effect E2 if:
- Both effects involve the same continuous predictor variable (if included in the model; e.g., A * B * X would be contained in A * C * X, where A, B, and C are categorical predictors, and X is a continuous predictor); or
- E2 has more categorical predictors than does E1, and, if E1 includes any categorical predictors, they also appear in E2 (e.g., A * B would be contained in the A * B * C interaction).
Type I Sums of Squares. Type I sums of squares involve a sequential partitioning of the whole model sums of squares. A hierarchical series of regression equations are estimated, at each step adding an additional effect into the model. In Type I sums of squares, the sums of squares for each effect are determined by subtracting the predicted sums of squares with the effect in the model from the predicted sums of squares for the preceding model not including the effect. Tests of significance for each effect are then performed on the increment in the predicted sums of squares accounted for by the effect. Type I sums of squares are therefore sometimes called sequential or hierarchical sums of squares.
Type I sums of squares are appropriate to use in balanced (equal n) ANOVA designs in which effects are entered into the model in their natural order (i.e., any main effects are entered before any two-way interaction effects, any two-way interaction effects are entered before any three-way interaction effects, and so on). Type I sums of squares are also useful in polynomial regression designs in which any lower-order effects are entered before any higher-order effects. A third use of Type I sums of squares is to test hypotheses for hierarchically nested designs, in which the first effect in the design is nested within the second effect, the second effect is nested within the third, and so on.
One important property of Type I sums of squares is that the sums of squares attributable to each effect add up to the whole model sums of squares. Thus, Type I sums of squares provide a complete decomposition of the predicted sums of squares for the whole model. This is not generally true for any other type of sums of squares. An important limitation of Type I sums of squares, however, is that the sums of squares attributable to a specific effect will generally depend on the order in which the effects are entered into the model. This lack of invariance to order of entry into the model limits the usefulness of Type I sums of squares for testing hypotheses for certain designs (e.g., fractional factorial designs).
Type II Sums of Squares. Type II sums of squares are sometimes called partially sequential sums of squares. Like Type I sums of squares, Type II sums of squares for an effect controls for the influence of other effects. Which other effects to control for, however, is determined by a different criterion. In Type II sums of squares, the sums of squares for an effect is computed by controlling for the influence of all other effects of equal or lower degree. Thus, sums of squares for main effects control for all other main effects, sums of squares for two-way interactions control for all main effects and all other two-way interactions, and so on.
Unlike Type I sums of squares, Type II sums of squares are invariant to the order in which effects are entered into the model. This makes Type II sums of squares useful for testing hypotheses for multiple regression designs, for main effect ANOVA designs, for full-factorial ANOVA designs with equal cell ns, and for hierarchically nested designs.
There is a drawback to the use of Type II sums of squares for factorial designs with unequal cell ns. In these situations, Type II sums of squares test hypotheses that are complex functions of the cell ns that ordinarily are not meaningful. Thus, a different method for testing hypotheses is usually preferred.
Type III Sums of Squares. Type I and Type II sums of squares usually are not appropriate for testing hypotheses for factorial ANOVA designs with unequal ns. For ANOVA designs with unequal ns, however, Type III sums of squares test the same hypothesis that would be tested if the cell ns were equal, provided that there is at least one observation in every cell. Specifically, in no-missing-cell designs, Type III sums of squares test hypotheses about differences in subpopulation (or marginal) means. When there are no missing cells in the design, these subpopulation means are least squares means, which are the best linear-unbiased estimates of the marginal means for the design (see, Milliken and Johnson, 1986).
Tests of differences in least squares means have the important property that they are invariant to the choice of the coding of effects for categorical predictor variables (e.g., the use of the sigma-restricted or overparameterized model) and to the choice of the particular g2 inverse of X’X used to solve the normal equations. Thus, tests of linear combinations of least squares means in general, including Type III tests of differences in least squares means, are said to not depend on the parameterization of the design. This makes Type III sums of squares useful for testing hypotheses for any design for which Type I or Type II sums of squares are appropriate, as well as for any unbalanced ANOVA design with no missing cells.
The Type III sums of squares attributable to an effect is computed as the sums of squares for the effect controlling for any effects of equal or lower degree and orthogonal to any higher-order interaction effects (if any) that contain it. The orthogonality to higher-order containing interactions is what gives Type III sums of squares the desirable properties associated with linear combinations of least squares means in ANOVA designs with no missing cells. But for ANOVA designs with missing cells, Type III sums of squares generally do not test hypotheses about least squares means, but instead test hypotheses that are complex functions of the patterns of missing cells in higher-order containing interactions and that are ordinarily not meaningful. In this situation Type V sums of squares or tests of the effective hypothesis (Type VI sums of squares) are preferred.
Type IV Sums of Squares. Type IV sums of squares were designed to test “balanced” hypotheses for lower-order effects in ANOVA designs with missing cells. Type IV sums of squares are computed by equitably distributing cell contrast coefficients for lower-order effects across the levels of higher-order containing interactions.
Type IV sums of squares are not recommended for testing hypotheses for lower-order effects in ANOVA designs with missing cells, even though this is the purpose for which they were developed. This is because Type IV sum-of-squares are invariant to some but not all g2 inverses of X’X that could be used to solve the normal equations. Specifically, Type IV sums of squares are invariant to the choice of a g2 inverse of X’X given a particular ordering of the levels of the categorical predictor variables, but are not invariant to different orderings of levels. Furthermore, as with Type III sums of squares, Type IV sums of squares test hypotheses that are complex functions of the patterns of missing cells in higher-order containing interactions and that are ordinarily not meaningful.
Statisticians who have examined the usefulness of Type IV sums of squares have concluded that Type IV sums of squares are not up to the task for which they were developed:
- Milliken & Johnson (1992, p. 204) write: “It seems likely that few, if any, of the hypotheses tested by the Type IV analysis of [some programs] will be of particular interest to the experimenter.”
- Searle (1987, p. 463-464) writes: “In general, [Type IV] hypotheses determined in this nature are not necessarily of any interest.”; and (p. 465) “This characteristic of Type IV sums of squares for rows depending on the sequence of rows establishes their non-uniqueness, and this in turn emphasizes that the hypotheses they are testing are by no means necessarily of any general interest.”
- Hocking (1985, p. 152), in an otherwise comprehensive introduction to general linear models, writes: “For the missing cell problem, [some programs] offers a fourth analysis, Type IV, which we shall not discuss.”
So, we recommend that you use the Type IV sums of squares solution with caution, and that you understand fully the nature of the (often non-unique) hypotheses that are being testing, before attempting interpretations of the results. Furthermore, in ANOVA designs with no missing cells, Type IV sums of squares are always equal to Type III sums of squares, so the use of Type IV sums of squares is either (potentially) inappropriate, or unnecessary, depending on the presence of missing cells in the design.
Type V Sums of Squares. Type V sums of squares were developed as an alternative to Type IV sums of squares for testing hypotheses in ANOVA designs in missing cells. Also, this approach is widely used in industrial experimentation, to analyze fractional factorial designs; these types of designs are discussed in detail in the 2**(k-p) Fractional Factorial Designs section of the Experimental Design topic. In effect, for effects for which tests are performed all population marginal means (least squares means) are estimable.
Type V sums of squares involve a combination of the methods employed in computing Type I and Type III sums of squares. Specifically, whether or not an effect is eligible to be dropped from the model is determined using Type I procedures, and then hypotheses are tested for effects not dropped from the model using Type III procedures. Type V sums of squares can be illustrated by using a simple example. Suppose that the effects considered are A, B, and A by B, in that order, and that A and B are both categorical predictors with, say, 3 and 2 levels, respectively. The intercept is first entered into the model. Then A is entered into the model, and its degrees of freedom are determined (i.e., the number of non-redundant columns for A in X’X, given the intercept). If A‘s degrees of freedom are less than 2 (i.e., its number of levels minus 1), it is eligible to be dropped. Then B is entered into the model, and its degrees of freedom are determined (i.e., the number of non-redundant columns for B in X’X, given the intercept and A). If B‘s degrees of freedom are less than 1 (i.e., its number of levels minus 1), it is eligible to be dropped. Finally, A by B is entered into the model, and its degrees of freedom are determined (i.e., the number of non-redundant columns for A by B in X’X, given the intercept, A, and B). If B‘s degrees of freedom are less than 2 (i.e., the product of the degrees of freedom for its factors if there were no missing cells), it is eligible to be dropped. Type III sums of squares are then computed for the effects that were not found to be eligible to be dropped, using the reduced model in which any eligible effects are dropped. Tests of significance, however, use the error term for the whole model prior to dropping any eligible effects.
Note that Type V sums of squares involve determining a reduced model for which all effects remaining in the model have at least as many degrees of freedom as they would have if there were no missing cells. This is equivalent to finding a subdesign with no missing cells such that the Type III sums of squares for all effects in the subdesign reflect differences in least squares means.
Appropriate caution should be exercised when using Type V sums of squares. Dropping an effect from a model is the same as assuming that the effect is unrelated to the outcome (see, e.g., Hocking, 1996). The reasonableness of the assumption does not necessarily insure its validity, so when possible the relationships of dropped effects to the outcome should be inspected. It is also important to note that Type V sums of squares are not invariant to the order in which eligibility for dropping effects from the model is evaluated. Different orders of effects could produce different reduced models.
In spite of these limitations, Type V sums of squares for the reduced model have all the same properties of Type III sums of squares for ANOVA designs with no missing cells. Even in designs with many missing cells (such as fractional factorial designs, in which many high-order interaction effects are assumed to be zero), Type V sums of squares provide tests of meaningful hypotheses, and sometimes hypotheses that cannot be tested using any other method.
Type VI (Effective Hypothesis) Sums of Squares. Type I through Type V sums of squares can all be viewed as providing tests of hypotheses that subsets of partial regression coefficients (controlling for or orthogonal to appropriate additional effects) are zero. Effective hypothesis tests (developed by Hocking, 1996) are based on the philosophy that the only unambiguous estimate of an effect is the proportion of variability on the outcome that is uniquely attributable to the effect. The overparameterized coding of effects for categorical predictor variables generally cannot be used to provide such unique estimates for lower-order effects. Effective hypothesis tests, which we propose to call Type VI sums of squares, use the sigma-restricted coding of effects for categorical predictor variables to provide unique effect estimates even for lower-order effects.
The method for computing Type VI sums of squares is straightforward. The sigma-restricted coding of effects is used, and for each effect, its Type VI sums of squares is the difference of the model sums of squares for all other effects from the whole model sums of squares. As such, the Type VI sums of squares provide an unambiguous estimate of the variability of predicted values for the outcome uniquely attributable to each effect.
In ANOVA designs with missing cells, Type VI sums of squares for effects can have fewer degrees of freedom than they would have if there were no missing cells, and for some missing cell designs, can even have zero degrees of freedom. The philosophy of Type VI sums of squares is to test as much as possible of the original hypothesis given the observed cells. If the pattern of missing cells is such that no part of the original hypothesis can be tested, so be it. The inability to test hypotheses is simply the price we pay for having no observations at some combinations of the levels of the categorical predictor variables. The philosophy is that it is better to admit that a hypothesis cannot be tested than it is to test a distorted hypothesis that may not meaningfully reflect the original hypothesis.
Type VI sums of squares cannot generally be used to test hypotheses for nested ANOVA designs, separate slope designs, or mixed-model designs, because the sigma-restricted coding of effects for categorical predictor variables is overly restrictive in such designs. This limitation, however, does not diminish the fact that Type VI sums of squares can b
Error Terms for Tests
Lack-of-Fit Tests using Pure Error. Whole model tests and tests based on the 6 types of sums of squares use the mean square residual as the error term for tests of significance. For certain types of designs, however, the residual sum of squares can be further partitioned into meaningful parts which are relevant for testing hypotheses. One such type of design is a simple regression design in which there are subsets of cases all having the same values on the predictor variable. For example, performance on a task could be measured for subjects who work on the task under several different room temperature conditions. The test of significance for the Temperature effect in the linear regression of Performance on Temperature would not necessarily provide complete information on how Temperature relates to Performance; the regression coefficient for Temperature only reflects its linear effect on the outcome.
One way to glean additional information from this type of design is to partition the residual sums of squares into lack-of-fit and pure error components. In the example just described, this would involve determining the difference between the sum of squares that cannot be predicted by Temperature levels, given the linear effect of Temperature (residual sums of squares) and the pure error; this difference would be the sums of squares associated with the lack-of-fit (in this example, of the linear model). The test of lack-of-fit, using the mean square pure error as the error term, would indicate whether non-linear effects of Temperature are needed to adequately model Tempature’s influence on the outcome. Further, the linear effect could be tested using the pure error term, thus providing a more sensitive test of the linear effect independent of any possible nonlinear effect.
Designs with Zero Degrees of Freedom for Error. When the model degrees of freedom equal the number of cases or subjects, the residual sums of squares will have zero degrees of freedom and preclude the use of standard hypothesis tests. This sometimes occurs for overfitted designs (designs with many predictors, or designs with categorical predictors having many levels). However, in some designed experiments, such as experiments using split-plot designs or highly fractionalized factorial designs as commonly used in industrial experimentation, it is no accident that the residual sum of squares has zero degrees of freedom. In such experiments, mean squares for certain effects are planned to be used as error terms for testing other effects, and the experiment is designed with this in mind. It is entirely appropriate to use alternatives to the mean square residual as error terms for testing hypotheses in such designs.
Tests in Mixed Model Designs. Designs which contain random effects for one or more categorical predictor variables are called mixed-model designs. These types of designs, and the analysis of those designs, is also described in detail in the Variance Components and Mixed Model ANOVA/ANCOVA topic. Random effects are classification effects where the levels of the effects are assumed to be randomly selected from an infinite population of possible levels. The solution for the normal equations in mixed-model designs is identical to the solution for fixed-effect designs (i.e., designs which do not contain random effects). Mixed-model designs differ from fixed-effect designs only in the way in which effects are tested for significance. In fixed-effect designs, between effects are always tested using the mean square residual as the error term. In mixed-model designs, between effects are tested using relevant error terms based on the covariation of sources of variation in the design. Also, only the overparameterized model is used to code effects for categorical predictors in mixed-models, because the sigma-restricted model is overly restrictive.
The covariation of sources of variation in the design is estimated by the elements of a matrix called the Expected Mean Squares (EMS) matrix. This non-square matrix contains elements for the covariation of each combination of pairs of sources of variation and for each source of variation with Error. Specifically, each element is the mean square for one effect (indicated by the column) that is expected to be accounted by another effect (indicated by the row), given the observed covariation in their levels. Note that expected mean squares can be computing using any type of sums of squares from Type I through Type V. Once the EMS matrix is computed, it is used to the solve for the linear combinations of sources of random variation that are appropriate to use as error terms for testing the significance of the respective effects. This is done using Satterthwaite’s method of denominator synthesis (Satterthwaite, 1946). Detailed discussions of methods for testing effects in mixed-models, and related methods for estimating variance components for random effects, can be found in the Variance Components and Mixed Model ANOVA/ANCOVA topic.
Testing Specific Hypotheses
Whole model tests and tests based on sums of squares attributable to specific effects illustrate two general types of hypotheses that can be tested using the general linear model. Still, there may be other types of hypotheses the researcher wishes to test that do not fall into either of these categories. For example, hypotheses about subsets of effects may be of interest, or hypotheses involving comparisons of specific levels of categorical predictor variables may be of interest.
Estimability of Hypotheses. Before considering tests of specific hypotheses of this sort, it is important to address the issue of estimability. A test of a specific hypothesis using the general linear model must be framed in terms of the regression coefficients for the solution of the normal equations. If the X’X matrix is less than full rank, the regression coefficients depend on the particular g2 inverse used for solving the normal equations, and the regression coefficients will not be unique. When the regression coefficients are not unique, linear functions (f) of the regression coefficients having the form
f = Lb
where L is a vector of coefficients, will also in general not be unique. However, Lb for an L which satisfies
L = L(X’X)^{–}X’X
is invariant for all possible g2 inverses, and is therefore called an estimable function.
The theory of estimability of linear functions is an advanced topic in the theory of algebraic invariants (Searle, 1987, provides a comprehensive introduction), but its implications are clear enough. One instance of non-estimability of a hypothesis has been encountered in tests of the effective hypothesis which have zero degrees of freedom. On the other hand, Type III sums of squares for categorical predictor variable effects in ANOVA designs with no missing cells (and the least squares means in such designs) provide an example of estimable functions which do not depend on the model parameterization (i.e., the particular g2 inverse used to solve the normal equations). The general implication of the theory of estimability of linear functions is that hypotheses which cannot be expressed as linear combinations of the rows of X (i.e., the combinations of observed levels of the categorical predictor variables) are not estimable, and therefore cannot be tested. Stated another way, we simply cannot test specific hypotheses that are not represented in the data. The notion of estimability is valuable because the test for estimability makes explicit which specific hypotheses can be tested and which cannot.
Linear Combinations of Effects. In multiple regression designs, it is common for hypotheses of interest to involve subsets of effects. In mixture designs, for example, we might be interested in simultaneously testing whether the main effect and any of the two-way interactions involving a particular predictor variable are non-zero. It is also common in multiple regression designs for hypotheses of interest to involves comparison of slopes. For example, we might be interested in whether the regression coefficients for two predictor variables differ. In both factorial regression and factorial ANOVA designs with many factors, it is often of interest whether sets of effects, say, all three-way and higher-order interactions, are nonzero. Tests of these types of specific hypotheses involve (1) constructing one or more Ls reflecting the hypothesis, (2) testing the estimability of the hypothesis by determining whether
L = L(X’X)^{–}X’X
and if so, using (3)
(Lb)’-L’)^{-1}(Lb)
to estimate the sums of squares accounted for by the hypothesis. Finally, (4) the hypothesis is tested for significance using the usual mean square residual as the error term. To illustrate this 4-step procedure, suppose that a test of the difference in the regression slopes is desired for the (intercept plus) 2 predictor variables in a first-order multiple regression design. The coefficients for L would be
L = [0 1 -1]
(note that the first coefficient 0 excludes the intercept from the comparison) for which Lb is estimable if the 2 predictor variables are not redundant with each other. The hypothesis sums of squares reflect the difference in the partial regression coefficients for the 2 predictor variables, which is tested for significance using the mean square residual as the error term.
Planned Comparisons of Least Square Means. Usually, experimental hypotheses are stated in terms that are more specific than simply main effects or interactions. We may have the specific hypothesis that a particular textbook will improve math skills in males, but not in females, while another book would be about equally effective for both genders, but less effective overall for males. Now generally, we are predicting an interaction here: the effectiveness of the book is modified (qualified) by the student’s gender. However, we have a particular prediction concerning the nature of the interaction: we expect a significant difference between genders for one book, but not the other. This type of specific prediction is usually tested by testing planned comparisons of least squares means (estimates of the population marginal means), or as it is sometimes called, contrast analysis.
Briefly, contrast analysis allows us to test the statistical significance of predicted specific differences in particular parts of our complex design. The 4-step procedure for testing specific hypotheses is used to specify and test specific predictions. Contrast analysis is a major and indispensable component of the analysis of many complex experimental designs (see also for details).
To learn more about the logic and interpretation of contrast analysis refer to the ANOVA/MANOVA topic Overview section.
Post-Hoc Comparisons. Sometimes we find effects in an experiment that were not expected. Even though in most cases a creative experimenter will be able to explain almost any pattern of means, it would not be appropriate to analyze and evaluate that pattern as if we had predicted it all along. The problem here is one of capitalizing on chance when performing multiple tests post-hoc, that is, without a priori hypotheses. To illustrate this point, let’s consider the following “experiment.” Imagine we were to write down a number between 1 and 10 on 100 pieces of paper. We then put all of those pieces into a hat and draw 20 samples (of pieces of paper) of 5 observations each, and compute the means (from the numbers written on the pieces of paper) for each group. How likely do you think it is that we will find two sample means that are significantly different from each other? It is very likely! Selecting the extreme means obtained from 20 samples is very different from taking only 2 samples from the hat in the first place, which is what the test via the contrast analysis implies. Without going into further detail, there are several so-called post-hoc tests that are explicitly based on the first scenario (taking the extremes from 20 samples), that is, they are based on the assumption that we have chosen for our comparison the most extreme (different) means out of k total means in the design. Those tests apply “corrections” that are designed to offset the advantage of post-hoc selection of the most extreme comparisons. Whenever we find unexpected results in an experiment, we should use those post-hoc procedures to test their statistical significance.
Testing Hypotheses for Repeated Measures and Dependent Variables
In the discussion of different hypotheses that can be tested using the general linear model, the tests have been described as tests for “the dependent variable” or “the outcome.” This has been done solely to simplify the discussion. When there are multiple dependent variables reflecting the levels of repeated measure factors, the general linear model performs tests using orthonormalized M-transformations of the dependent variables. When there are multiple dependent variables but no repeated measure factors, the general linear model performs tests using the hypothesis sums of squares and cross-products for the multiple dependent variables, which are tested against the residual sums of squares and cross-products for the multiple dependent variables. Thus, the same hypothesis testing procedures which apply to univariate designs with a single dependent variable also apply to repeated measure and multivariate designs.