# Blog Archives

## Completing the value chain: data, insight, action

#### Thomas Hill, Ph.D. Dell Contributor at Tech Page One

The value of effective predictive/prescriptive analytics is easily explained: The best and largest storage capabilities, fastest data access and ETL functionality, and most robust hardware infrastructure will not guarantee success in a highly competitive market place. If, however, one can predict what will happen next – how consumer sentiments will shift, which large insurance claim provides opportunities for subrogation, or how specific changes in the manufacturing process will drastically reduce warranty claims in the field – critical actions can be taken yielding competitive advantages that could pay off within weeks or even days for the entire investment required to achieve those insights.

I sometimes like to point out that I have predicted every stock market crash in the past 30 years – after they happened. Obviously, reporting on what happened to gain insight is interesting and perhaps useful, but the value of predicting outcomes and “pre-acting” rather than reacting to those outcomes can be priceless.

I cannot think of a single successful business that is not continuously working to complete the value chain from the collection of data to predictive modeling, and automating mission critical decisions through effective prescriptive decisioning systems, i.e., some (semi-) automated system by which the best pre-actions to anticipated events and outcomes become part of the routine day-to-day operations and SOPs.

There are near infinite numbers of specific examples. I have had the privilege of collaborating with some brilliant visionaries and practitioners on several books around predictive modeling, the analysis of unstructured data, and (in a forthcoming book) on the application of these technologies to optimize healthcare in various ways. These books describe the near-infinite universe of use cases and examples to illustrate what successful businesses and government agencies are doing today.

## When good projects go bad

So what are the real challenges to adopting successfully predictive and prescriptive analytics? The biggest challenge in any such project – in order to incorporate these technologies into mission critical processes – is to complete successfully every single step of the value chain, from data collection, to data storage, data preparation, predictive modeling, validated analytic reporting, to providing decisioning support and prescriptive tools to realize value.

There are near infinite numbers of ways by which well-intended and sometimes planned projects can drive off the rails. But in our experience, it almost always has to do with the difficulty to connect to the right data at the right time, to deliver the right results to the right stakeholder within the actionable time interval where the right decision can make a difference, or to incorporate the predictions and prescriptions into an effective automated process that implements the right decisions.

Sometimes, it is an overworked IT department dealing with outdated and inadequate hardware and storage technologies, trying to manage the “prevention of IT” given these other challenges. Sometimes there are challenges integrating diverse data sources that span structured data in relational databases on premise, information that needs to be accessed in the cloud or from internet-based services, with unstructured textual information stored in distributed file systems.

For example, many manufacturing customers of StatSoft need to integrate manufacturing data upstream with final product testing data, and then link it to unstructured warranty claim narratives that capture failures in the field stored in diverse systems. In the financial services industry, in particular the established “brick-and-mortar” players are challenged to build the right systems to capture all customer touch points and connect them with the right prediction/prescription models, to deliver superior services when they are most needed.

So in short, the data may be there, the technologies to do useful things with those data exist (and are comprehensively available in StatSoft’s products), but the two cannot readily be connected. It is generally acknowledged that data preparation consumes about 90% or more of the effort in analytic projects.

## Completing the value chain

That is why we are excited at StatSoft to be part of Dell, and why our customers almost immediately “get it”: Dell hardware, combined with the cutting edge tools and technologies in Dell’s software stack, combined with Dell’s thought leaders and effective services across different domains, and now combined with StatSoft’s tools and solutions for predictive and prescriptive analytics deliver the only ecosystem of its kind that can integrate very heterogeneous data sources, and connect them to effective predictive and prescriptive analytics. It does not matter if, as is the case in the real world, these data sources are structured or unstructured, involve multiple data storage technologies and vendors, are implemented on-prem or cloud based. We can deliver solutions based on robust hardware with cutting-edge software and effective and efficient services, combined with the right analytics capabilities to drive effective action.

So pausing for a moment to reflect on this, I cannot really think of any other provider of these capabilities that can complete the data-to-insight-and-action value chain for driving competitive advantages to all businesses small or large. StatSoft’s motto was “Making the World more Productive” which naturally goes with Dell and the Power to do more.

This will be an exciting time going forward for StatSoft and Dell, and our customers.

## Success Story – Nelson Mandela Metropolitan University

Ms Jennifer Bowler Lecturer in Industrial Psychology and Human Resources at Nelson Mandela Metropolitan University, testifies on her Statistica programme provided By Statsoft Southern Africa’s trainer Merle Weberlof and how it has benefited her.

” Personally, the two days were personally extremely beneficial. I had expected a “how to work with Statistica” and what I got was how to understand the relationship between research design, analytical tools and then how to do that in Statistica. I felt very sad last night that I had not been exposed to someone like Merle much earlier on in my research career- it would have saved me many, many hours of confusion and frustration. I am however, pleased to say that my nervousness regarding Statistica has been put to rest and I have clarified many issues regarding analysis and design.

I know that Merle was worried that she lost people at certain times and I would not presume to comment for the others but each one of the group was at different stages of personal development as far as research is concerned plus we had different disciplines represented- my sense was that each person took away something of value and application even if they could not understand and/or utilise all that was offered.

Thanks very much for accommodating us for the two days.

If there is anything else that you want feedback on – please let me know.

I hope that you had a good trip to CT ”

Kind regards

Jennifer Bowler

## Generalized Additive Models (GAM)

The methods available in *Generalized Additive Models* are implementations of techniques developed and popularized by Hastie and Tibshirani (1990). A detailed description of these and related techniques, the algorithms used to fit these models, and discussions of recent research in this area of statistical modeling can also be found in Schimek (2000).

## Additive Models

The methods described in this section represent a generalization of multiple regression (which is a special case of general linear models). Specifically, in linear regression, a linear least-squares fit is computed for a set of predictor or X variables, to predict a dependent Y variable. The well known linear regression equation with m predictors, to predict a dependent variable Y, can be stated as:

Y = b0 + b1*X1 + … + bm*Xm

Where Y stands for the (predicted values of the) dependent variable, X1through Xm represent the m values for the predictor variables, and b0, and b1 through bm are the regression coefficients estimated by multiple regression. A generalization of the multiple regression model would be to maintain the additive nature of the model, but to replace the simple terms of the linear equation bi*Xi with fi(Xi) where fi is a non-parametric function of the predictor Xi. In other words, instead of a single coefficient for each variable (additive term) in the model, in additive models an unspecified (non-parametric) function is estimated for each predictor, to achieve the best prediction of the dependent variable values.

## Generalized Linear Models

To summarize the basic idea, the generalized linear model differs from the general linear model (of which multiple regression is a special case) in two major respects: First, the distribution of the dependent or response variable can be (explicitly) non-normal, and does not have to be continuous, e.g., it can be binomial; second, the dependent variable values are predicted from a linear combination of predictor variables, which are “connected” to the dependent variable via a link function. The general linear model for a single dependent variable can be considered a special case of the generalized linear model: In the general linear model the dependent variable values are expected to follow the normal distribution, and the link function is a simple identity function (i.e., the linear combination of values for the predictor variables is not transformed).

To illustrate, in the general linear model a response variable Y is linearly associated with values on the X variables while the relationship in the generalized linear model is assumed to be

Y = g(b0 + b1*X1 + … + bm*Xm)

where g(…) is a function. Formally, the inverse function of g(…), say gi(…), is called the link function; so that:

gi(muY) = b0 + b1*X1 + … + bm*Xm

where mu-Y stands for the expected value of Y.

## Distributions and Link Functions

*Generalized Additive Models* allows you to choose from a wide variety of distributions for the dependent variable, and link functions for the effects of the predictor variables on the dependent variable (see McCullagh and Nelder, 1989; Hastie and Tibshirani, 1990; see also *GLZ* Introductory Overview – Computational Approach for a discussion of link functions and distributions):

Normal, Gamma, and Poisson distributions:

Log link: f(z) = log(z)

Inverse link: f(z) = 1/z

Identity link: f(z) = z

Binomial distributions:

Logit link: f(z)=log(z/(1-z))

## Generalized Additive Models

We can combine the notion of additive models with generalized linear models, to derive the notion of generalized additive models, as:

gi(muY) = Si(fi(Xi))

In other words, the purpose of generalized additive models is to maximize the quality of prediction of a dependent variable Y from various distributions, by estimating unspecific (non-parametric) functions of the predictor variables which are “connected” to the dependent variable via a link function.

## Estimating the Nonparametric Function of Predictors via Scatterplot Smoothers

A unique aspect of generalized additive models are the non-parametric functions fi of the predictor variables Xi. Specifically, instead of some kind of simple or complex parametric functions, Hastie and Tibshirani (1990) discuss various general scatterplot smoothers that can be applied to the X variable values, with the target criterion to maximize the quality of prediction of the (transformed) Y variable values. One such scatterplot smoother is the cubic smoothing splines smoother, which generally produces a smooth generalization of the relationship between the two variables in the scatterplot. Computational details regarding this smoother can be found in Hastie and Tibshirani (1990; see also Schimek, 2000).

To summarize, instead of estimating single parameters (like the regression weights in multiple regression), in generalized additive models, we find a general unspecific (non-parametric) function that relates the predicted (transformed) Y values to the predictor values.

## A Specific Example: The Generalized Additive Logistic Model

Let us consider a specific example of the generalized additive models: A generalization of the logistic (logit) model for binary dependent variable values. As also described in detail in the context of Nonlinear Estimation and Generalized Linear/Nonlinear Models, the logistic regression model for binary responses can be written as follows:

y=exp(b0+b1*x1+…+bm*xm)/{1+exp(b0+b1*x1+…+bm*xm)}

Note that the distribution of the dependent variable is assumed to be binomial, i.e., the response variable can only assume the values 0 or 1 (e.g., in a market research study, the purchasing decision would be binomial: The customer either did or did not make a particular purchase). We can apply the logistic link function to the probability p (ranging between 0 and 1) so that:

p’ = log {p/(1-p)}

By applying the logistic link function, we can now rewrite the model as:

p’ = b0 + b1*X1 + … + bm*Xm

Finally, we substitute the simple single-parameter additive terms to derive the generalized additive logistic model:

p’ = b0 + f1(X1) + … + fm(Xm)

An example application of the this model can be found in Hastie and Tibshirani (1990).

## Fitting Generalized Additive Models

Detailed descriptions of how generalized additive models are fit to data can be found in Hastie and Tibshirani (1990), as well as Schimek (2000, p. 300). In general there are two separate iterative operations involved in the algorithm, which are usually labeled the outer and inner loop. The purpose of the outer loop is to maximize the overall fit of the model, by minimizing the overall likelihood of the data given the model (similar to the maximum likelihood estimation procedures as described in, for example, the context of Nonlinear Estimation). The purpose of the inner loop is to refine the scatterplot smoother, which is the cubic splines smoother. The smoothing is performed with respect to the partial residuals; i.e., for every predictor k, the weighted cubic spline fit is found that best represents the relationship between variable k and the (partial) residuals computed by removing the effect of all other j predictors (j ¹ k). The iterative estimation procedure will terminate, when the likelihood of the data given the model can not be improved.

## Interpreting the Results

Many of the standard results statistics computed by *Generalized Additive Models* are similar to those customarily reported by linear or nonlinear model fitting procedures. For example, predicted and residual values for the final model can be computed, and various graphs of the residuals can be displayed to help the user identify possible outliers, etc. Refer also to the description of the residual statistics computed by Generalized Linear/Nonlinear Models for details.

The main result of interest, of course, is how the predictors are related to the dependent variable. Scatterplots can be computed showing the smoothed predictor variable values plotted against the partial residuals, i.e., the residuals after removing the effect of all other predictor variables.

This plot allows you to evaluate the nature of the relationship between the predictor with the residualized (adjusted) dependent variable values (see Hastie & Tibshirani, 1990; in particular formula 6.3), and hence the nature of the influence of the respective predictor in the overall model.

## Degrees of Freedom

To reiterate, the generalized additive models approach replaces the simple products of (estimated) parameter values times the predictor values with a cubic spline smoother for each predictor. When estimating a single parameter value, we lose one degree of freedom, i.e., we add one degree of freedom to the overall model. It is not clear how many degrees of freedom are lost due to estimating the cubic spline smoother for each variable. Intuitively, a smoother can either be very smooth, not following the pattern of data in the scatterplot very closely, or it can be less smooth, following the pattern of the data more closely. In the most extreme case, a simple line would be very smooth, and require us to estimate a single slope parameter, i.e., we would use one degree of freedom to fit the smoother (simple straight line); on the other hand, we could force a very “non-smooth” line to connect each actual data point, in which case we could “use-up” approximately as many degrees of freedom as there are points in the plot. Generalized Additive Models allows you to specify the degrees of freedom for the cubic spline smoother; the fewer degrees of freedom you specify, the smoother is the cubic spline fit to the partial residuals, and typically, the worse is the overall fit of the model. The issue of degrees of freedom for smoothers is discussed in detail in Hastie and Tibshirani (1990).

## A word of Caution

Generalized additive models are very flexible, and can provide an excellent fit in the presence of nonlinear relationships and significant noise in the predictor variables. However, note that because of this flexibility, you must be extra cautious not to over-fit the data, i.e., apply an overly complex model (with many degrees of freedom) to data so as to produce a good fit that likely will not replicate in subsequent validation studies. Also, compare the quality of the fit obtained from *Generalized Additive Models* to the fit obtained via *Generalized Linear/Nonlinear Models*. In other words, evaluate whether the added complexity (generality) of generalized additive models (regression smoothers) is necessary in order to obtain a satisfactory fit to the data. Often, this is not the case, and given a comparable fit of the models, the simpler generalized linear model is preferable to the more complex generalized additive model. These issues are discussed in greater detail in Hastie and Tibshirani (1990).

Another issue to keep in mind pertains to the interpretability of results obtained from (generalized) linear models vs. generalized additive models. Linear models are easily understood, summarized, and communicated to others (e.g., in technical reports). Moreover, parameter estimates can be used to predict or classify new cases in a simple and straightforward manner. Generalized additive models are not easily interpreted, in particular when they involve complex nonlinear effects of some or all of the predictor variables (and, of course, it is in those instances where generalized additive models may yield a better fit than generalized linear models). To reiterate, it is usually preferable to rely on a simple well understood model for predicting future cases, than on a complex model that is difficult to interpret and summarize.

## General Linear Models (GLM)

This topic describes the use of the general linear model in a wide variety of statistical analyses. If you are unfamiliar with the basic methods of ANOVA and regression in linear models, it may be useful to first review the basic information on these topics in *Elementary Concepts*. A detailed discussion of univariate and multivariate ANOVA techniques can also be found in the *ANOVA/MANOVA* topic.

## Basic Ideas: The General Linear Model

The following topics summarize the historical, mathematical, and computational foundations for the general linear model. For a basic introduction to ANOVA (MANOVA, ANCOVA) techniques, refer to *ANOVA/MANOVA*; for an introduction to multiple regression, see *Multiple Regression*; for an introduction to the design an analysis of experiments in applied (industrial) settings, see *Experimental Design*.

### Historical Background

The roots of the general linear model surely go back to the origins of mathematical thought, but it is the emergence of the theory of algebraic invariants in the 1800’s that made the general linear model, as we know it today, possible. The theory of algebraic invariants developed from the groundbreaking work of 19th century mathematicians such as Gauss, Boole, Cayley, and Sylvester. The theory seeks to identify those quantities in systems of equations which remain unchanged under linear transformations of the variables in the system. Stated more imaginatively (but in a way in which the originators of the theory would *not* consider an overstatement), the theory of algebraic invariants searches for the eternal and unchanging amongst the chaos of the transitory and the illusory. That is no small goal for any theory, mathematical or otherwise.

The wonder of it all is the theory of algebraic invariants was successful far beyond the hopes of its originators. Eigenvalues, eigenvectors, determinants, matrix decomposition methods; all derive from the theory of algebraic invariants. The contributions of the theory of algebraic invariants to the development of statistical theory and methods are numerous, but a simple example familiar to even the most casual student of statistics is illustrative. The correlation between two variables is unchanged by linear transformations of either or both variables. We probably take this property of correlation coefficients for granted, but what would data analysis be like if we did not have statistics that are invariant to the scaling of the variables involved? Some thought on this question should convince you that without the theory of algebraic invariants, the development of useful statistical techniques would be nigh impossible.

The development of the linear regression model in the late 19th century, and the development of correlational methods shortly thereafter, are clearly direct outgrowths of the theory of algebraic invariants. Regression and correlational methods, in turn, serve as the basis for the general linear model. Indeed, the general linear model can be seen as an extension of linear multiple regression for a single dependent variable. Understanding the multiple regression model is fundamental to understanding the general linear model, so we will look at the purpose of multiple regression, the computational algorithms used to solve regression problems, and how the regression model is extended in the case of the general linear model. A basic introduction to multiple regression methods and the analytic problems to which they are applied is provided in the *Multiple Regression*.

### The Purpose of Multiple Regression

The general linear model can be seen as an extension of linear multiple regression for a single dependent variable, and understanding the multiple regression model is fundamental to understanding the general linear model. The general purpose of multiple regression (the term was first used by Pearson, 1908) is to quantify the relationship between several independent or predictor variables and a dependent or criterion variable. For a detailed introduction to multiple regression, also refer to the *Multiple Regression* section. For example, a real estate agent might record for each listing the size of the house (in square feet), the number of bedrooms, the average income in the respective neighborhood according to census data, and a subjective rating of appeal of the house. Once this information has been compiled for various houses it would be interesting to see whether and how these measures relate to the price for which a house is sold. For example, we might learn that the number of bedrooms is a better predictor of the price for which a house sells in a particular neighborhood than how “pretty” the house is (subjective rating). We may also detect “outliers,” for example, houses that should really sell for more, given their location and characteristics.

Personnel professionals customarily use multiple regression procedures to determine equitable compensation. We can determine a number of factors or dimensions such as “amount of responsibility” (*Resp*) or “number of people to supervise” (*No_Super*) that we believe to contribute to the value of a job. The personnel analyst then usually conducts a salary survey among comparable companies in the market, recording the salaries and respective characteristics (i.e., values on dimensions) for different positions. This information can be used in a multiple regression analysis to build a regression equation of the form:

Salary = .5*Resp + .8*No_Super

Once this so-called regression equation has been determined, the analyst can now easily construct a graph of the expected (predicted) salaries and the actual salaries of job incumbents in his or her company. Thus, the analyst is able to determine which position is underpaid (below the regression line) or overpaid (above the regression line), or paid equitably.

In the social and natural sciences multiple regression procedures are very widely used in research. In general, multiple regression allows the researcher to ask (and hopefully answer) the general question “what is the best predictor of …”. For example, educational researchers might want to learn what are the best predictors of success in high-school. Psychologists may want to determine which personality variable best predicts social adjustment. Sociologists may want to find out which of the multiple social indicators best predict whether or not a new immigrant group will adapt and be absorbed into society.

### Computations for Solving the Multiple Regression Equation

A one-dimensional surface in a two-dimensional or two-variable space is a line defined by the equation *Y = b _{0} + b_{1}X*. According to this equation, the

*Y*variable can be expressed in terms of or as a function of a constant (

*b*

_{0}) and a slope (

*b*

_{1}) times the

*X*variable. The constant is also referred to as the intercept, and the slope as the regression coefficient. For example,

*GPA*may best be predicted as

*1+.02*IQ*. Thus, knowing that a student has an

*IQ*of 130 would lead us to predict that her

*GPA*would be 3.6 (since, 1+.02*130=3.6). In the multiple regression case, when there are multiple predictor variables, the regression surface usually cannot be visualized in a two dimensional space, but the computations are a straightforward extension of the computations in the single predictor case. For example, if in addition to

*IQ*we had additional predictors of achievement (e.g.,

*Motivation*,

*Self-discipline*) we could construct a linear equation containing all those variables. In general then, multiple regression procedures will estimate a linear equation of the form:

Y = b_{0} + b_{1}X_{1} + b_{2}X_{2} + … + b_{k}X_{k}

where *k* is the number of predictors. Note that in this equation, the regression coefficients (or* b*_{1} … *b*_{k} coefficients) represent the *independent* contributions of each in dependent variable to the prediction of the dependent variable. Another way to express this fact is to say that, for example, variable *X*_{1} is correlated with the *Y* variable, after controlling for all other independent variables. This type of correlation is also referred to as a *partial correlation* (this term was first used by Yule, 1907). Perhaps the following example will clarify this issue. We would probably find a significant negative correlation between hair length and height in the population (i.e., short people have longer hair). At first this may seem odd; however, if we were to add the variable Gender into the multiple regression equation, this correlation would probably disappear. This is because women, on the average, have longer hair than men; they also are shorter on the average than men. Thus, after we remove this gender difference by entering Gender into the equation, the relationship between hair length and height disappears because hair length does *not *make any unique contribution to the prediction of height, above and beyond what it shares in the prediction with variable Gender. Put another way, after controlling for the variable Gender, the *partial correlation* between hair length and height is zero.

The regression surface (a line in simple regression, a plane or higher-dimensional surface in multiple regression) expresses the best prediction of the dependent variable (*Y*), given the independent variables (*X*‘s). However, nature is rarely (if ever) perfectly predictable, and usually there is substantial variation of the observed points from the fitted regression surface. The deviation of a particular point from the nearest corresponding point on the predicted regression surface (its predicted value) is called the *residual* value. Since the goal of linear regression procedures is to fit a surface, which is a linear function of the *X* variables, as closely as possible to the observed *Y* variable, the *residual* values for the observed points can be used to devise a criterion for the “best fit.” Specifically, in regression problems the surface is computed for which the sum of the squared deviations of the observed points from that surface are minimized. Thus, this general procedure is sometimes also referred to as *least squares* estimation. (see also the description of *weighted least squares* estimation).

The actual computations involved in solving regression problems can be expressed compactly and conveniently using matrix notation. Suppose that there are *n* observed values of *Y *and *n* associated observed values for each of *k* different *X *variables. Then *Y*_{i}, *X*_{ik}, and *e*_{i }can represent the *i*th observation of the *Y *variable, the *i*th observation of each of the *X *variables, and the *i*th unknown residual value, respectively. Collecting these terms into matrices we have

The multiple regression model in matrix notation then can be expressed as

**Y** = **Xb** + **e**

where ** b **is a column vector of 1 (for the intercept) +

*k*unknown regression coefficients. Recall that the goal of multiple regression is to minimize the sum of the squared residuals. Regression coefficients that satisfy this criterion are found by solving the set of normal equations

**X’Xb** = **X’Y**

When the *X *variables are linearly independent (i.e., they are nonredundant, yielding an ** X’X **matrix which is of full rank) there is a unique solution to the normal equations. Premultiplying both sides of the matrix formula for the normal equations by the inverse of

*X’X*gives

**(X’X)^{-1}X’Xb = (X’X)^{-1}X’Y**

or

**b = (X’X) ^{-1}X’Y**

This last result is very satisfying in view of its simplicity and its generality. With regard to its simplicity, it expresses the solution for the regression equation in terms just 2 matrices (** X **and

*Y*) and 3 basic matrix operations, (1) matrix transposition, which involves interchanging the elements in the rows and columns of a matrix, (2) matrix multiplication, which involves finding the sum of the products of the elements for each row and column combination of two conformable (i.e., multipliable) matrices, and (3) matrix inversion, which involves finding the matrix equivalent of a numeric reciprocal, that is, the matrix that satisfies

**A ^{-1}AA=A**

for a matrix *A*.

It took literally centuries for the ablest mathematicians and statisticians to find a satisfactory method for solving the linear least square regression problem. But their efforts have paid off, for it is hard to imagine a simpler solution.

With regard to the generality of the multiple regression model, its only notable limitations are that (1) it can be used to analyze only a single dependent variable, (2) it cannot provide a solution for the regression coefficients when the *X* variables are not linearly independent and the inverse of **X’X** therefore does not exist. These restrictions, however, can be overcome, and in doing so the multiple regression model is transformed into the general linear model.

### Extension of Multiple Regression to the General Linear Model

One way in which the general linear model differs from the multiple regression model is in terms of the number of dependent variables that can be analyzed. The *Y *vector of *n* observations of a single *Y *variable can be replaced by a *Y* matrix of *n* observations of *m* different *Y* variables. Similarly, the *b *vector of regression coefficients for a single *Y* variable can be replaced by a *b* matrix of regression coefficients, with one vector of *b *coefficients for each of the *m *dependent variables. These substitutions yield what is sometimes called the multivariate regression model, but it should be emphasized that the matrix formulations of the multiple and multivariate regression models are identical, except for the number of columns in the *Y* and *b *matrices. The method for solving for the *b* coefficients is also identical, that is, *m* different sets of regression coefficients are separately found for the *m* different dependent variables in the multivariate regression model.

The general linear model goes a step beyond the multivariate regression model by allowing for linear transformations or linear combinations of multiple dependent variables. This extension gives the general linear model important advantages over the multiple and the so-called multivariate regression models, both of which are inherently univariate (single dependent variable) methods. One advantage is that multivariate tests of significance can be employed when responses on multiple dependent variables are correlated. Separate univariate tests of significance for correlated dependent variables are not independent and may not be appropriate. Multivariate tests of significance of independent linear combinations of multiple dependent variables also can give insight into which dimensions of the response variables are, and are not, related to the predictor variables. Another advantage is the ability to analyze effects of repeated measure factors. Repeated measure designs, or within-subject designs, have traditionally been analyzed using ANOVA techniques. Linear combinations of responses reflecting a repeated measure effect (for example, the difference of responses on a measure under differing conditions) can be constructed and tested for significance using either the univariate or multivariate approach to analyzing repeated measures in the general linear model.

A second important way in which the general linear model differs from the multiple regression model is in its ability to provide a solution for the normal equations when the *X* variables are not linearly independent and the inverse of ** X’X** does not exist. Redundancy of the

*X*variables may be incidental (e.g., two predictor variables might happen to be perfectly correlated in a small data set), accidental (e.g., two copies of the same variable might unintentionally be used in an analysis) or designed (e.g., indicator variables with exactly opposite values might be used in the analysis, as when both

*Male*and

*Female*predictor variables are used in representing

*Gender*). Finding the regular inverse of a non-full-rank matrix is reminiscent of the problem of finding the reciprocal of 0 in ordinary arithmetic. No such inverse or reciprocal exists because division by 0 is not permitted. This problem is solved in the general linear model by using a generalized inverse of the

**matrix in solving the normal equations. A generalized inverse is any matrix that satisfies**

*X’X***AA ^{–}A = A**

for a matrix *A*. A generalized inverse is unique and is the same as the regular inverse only if the matrix *A* is full rank. A generalized inverse for a non-full-rank matrix can be computed by the simple expedient of zeroing the elements in redundant rows and columns of the matrix. Suppose that an *X’X* matrix with *r* non-redundant columns is partitioned as

where *A _{11}* is an

*r*by

*r*matrix of rank

*r*. Then the regular inverse of

*A*exists and a generalized inverse of

_{11}*X’X*is

where each ** 0 **(null) matrix is a matrix of 0’s (zeroes) and has the same dimensions as the corresponding

*A*matrix.

In practice, however, a particular generalized inverse of ** X’X **for finding a solution to the normal equations is usually computed using the sweep operator (Dempster, 1960). This generalized inverse, called a g2 inverse, has two important properties. One is that zeroing of the elements in redundant rows is unnecessary. Another is that partitioning or reordering of the columns of

*X’X*is unnecessary, so that the matrix can be inverted “in place.”

There are infinitely many generalized inverses of a non-full-rank ** X’X **matrix, and thus, infinitely many solutions to the normal equations. This can make it difficult to understand the nature of the relationships of the predictor variables to responses on the dependent variables, because the regression coefficients can change depending on the particular generalized inverse chosen for solving the normal equations. It is not cause for dismay, however, because of the invariance properties of many results obtained using the general linear model.

A simple example may be useful for illustrating one of the most important invariance properties of the use of generalized inverses in the general linear model. If both *Male* and *Female* predictor variables with exactly opposite values are used in an analysis to represent *Gender*, it is essentially arbitrary as to which predictor variable is considered to be redundant (e.g., *Male *can be considered to be redundant with *Female*, or vice versa). No matter which predictor variable is considered to be redundant, no matter which corresponding generalized inverse is used in solving the normal equations, and no matter which resulting regression equation is used for computing predicted values on the dependent variables, the predicted values and the corresponding residuals for males and females will be unchanged. In using the general linear model, we must keep in mind that finding a particular arbitrary solution to the normal equations is primarily a means to the end of accounting for responses on the dependent variables, and not necessarily an end in itself.

### Sigma-Restricted and Overparameterized Model

Unlike the multiple regression model, which is usually applied to cases where the *X *variables are continuous, the general linear model is frequently applied to analyze any ANOVA or MANOVA design with categorical predictor variables, any ANCOVA or MANCOVA design with both categorical and continuous predictor variables, as well as any multiple or multivariate regression design with continuous predictor variables. To illustrate, *Gender* is clearly a nominal level variable (anyone who attempts to rank order the sexes on any dimension does so at his or her own peril in today’s world). There are two basic methods by which *Gender* can be coded into one or more (non-offensive) predictor variables, and analyzed using the general linear model.

**Sigma-restricted model (coding of ****categorical predictors ). **Using the first method, males and females can be assigned any two arbitrary, but distinct values on a single predictor variable. The values on the resulting predictor variable will represent a quantitative contrast between males and females. Typically, the values corresponding to group membership are chosen not arbitrarily but rather to facilitate interpretation of the regression coefficient associated with the predictor variable. In one widely used strategy, cases in the two groups are assigned values of 1 and -1 on the predictor variable, so that if the regression coefficient for the variable is positive, the group coded as 1 on the predictor variable will have a higher predicted value (i.e., a higher group mean) on the dependent variable, and if the regression coefficient is negative, the group coded as -1 on the predictor variable will have a higher predicted value on the dependent variable. An additional advantage is that since each group is coded with a value one unit from zero, this helps in interpreting the magnitude of differences in predicted values between groups, because regression coefficients reflect the units of change in the dependent variable for each unit change in the predictor variable. This coding strategy is aptly called the sigma-restricted parameterization, because the values used to represent group membership (1 and -1) sum to zero.

Note that the sigma-restricted parameterization of categorical predictor variables usually leads to ** X’X **matrices which do not require a generalized inverse for solving the normal equations. Potentially redundant information, such as the characteristics of maleness and femaleness, is literally reduced to full-rank by creating quantitative contrast variables representing differences in characteristics.

**Overparameterized model (coding of ****categorical predictors ). **The second basic method for recoding categorical predictors is the indicator variable approach. In this method a separate predictor variable is coded for each group identified by a categorical predictor variable. To illustrate, females might be assigned a value of 1 and males a value of 0 on a first predictor variable identifying membership in the female

*Gender*group, and males would then be assigned a value of 1 and females a value of 0 on a second predictor variable identifying membership in the male

*Gender*group. Note that this method of recoding categorical predictor variables will almost always lead to

*X’X*matrices with redundant columns, and thus require a generalized inverse for solving the normal equations. As such, this method is often called the overparameterized model for representing categorical predictor variables, because it results in more columns in the

*X’X*than are necessary for determining the relationships of categorical predictor variables to responses on the dependent variables.

True to its description as general, the general linear model can be used to perform analyses with categorical predictor variables which are coded using either of the two basic methods that have been described.

### Summary of Computations

To conclude this discussion of the ways in which the general linear model extends and generalizes regression methods, the general linear model can be expressed as

**YM = Xb + e**

Here *Y*, *X*,* b*, and *e* are as described for the multivariate regression model and *M* is an *m* x *s *matrix of coefficients defining *s* linear transformation of the dependent variables. The normal equations are

**X’Xb = X’YM**

and a solution for the normal equations is given by

**b = (X’X)**^{–}**X’YM** Here the inverse of ** X’X **is a generalized inverse if

*X’X*contains redundant columns.

Add a provision for analyzing linear combinations of multiple dependent variables, add a method for dealing with redundant predictor variables and recoded categorical predictor variables, and the major limitations of multiple regression are overcome by the general linear model.

## Types of Analyses

A wide variety of types of designs can be analyzed using the general linear model. In fact, the flexibility of the general linear model allows it to handle so many different types of designs that it is difficult to develop simple typologies of the ways in which these designs might differ. Some general ways in which designs might differ can be suggested, but keep in mind that any particular design can be a “hybrid” in the sense that it could have combinations of features of a number of different types of designs.

In the following discussion, references will be made to the design matrix* X, *as well as sigma-restricted and overparameterized model coding. For an explanation of this terminology, refer to the section entitled Basic Ideas: The General Linear Model, or, for a brief summary, to the Summary of computations section.

A basic discussion to univariate and multivariate ANOVA techniques can also be found in the *ANOVA/MANOVA* topic; a discussion of multiple regression methods is also provided in the *Multiple Regression* topic.

### Between-Subject Designs

- Overview
- One-way ANOVA
- Main effect ANOVA
- Factorial ANOVA
- Nested designs
- Balanced ANOVA
- Simple regression
- Multiple regression
- Factorial regression
- Polynomial regression
- Response surface regression
- Mixture surface regression
- Analysis of covariance (ANCOVA)
- Separate slopes designs
- Homogeneity of slopes
- Mixed-model ANOVA and ANCOVA

**Overview.** The levels or values of the predictor variables in an analysis describe the differences between the *n* subjects or the *n* valid cases that are analyzed. Thus, when we speak of the between subject design (or simply the between design) for an analysis, we are referring to the nature, number, and arrangement of the predictor variables.

Concerning the nature or type of predictor variables, between designs which contain only categorical predictor variables can be called ANOVA (analysis of variance) designs, between designs which contain only continuous predictor variables can be called regression designs, and between designs which contain both categorical and continuous predictor variables can be called ANCOVA (analysis of covariance) designs. Further, continuous predictors are always considered to have fixed values, but the levels of categorical predictors can be considered to be fixed or to vary randomly. Designs which contain random categorical factors are called mixed-model designs (see the *Variance Components and Mixed Model ANOVA/ANCOVA* section).

Between designs may involve only a single predictor variable and therefore be described as simple (e.g., simple regression) or may employ numerous predictor variables (e.g., multiple regression).

Concerning the arrangement of predictor variables, some between designs employ only “main effect” or first-order terms for predictors, that is, the values for different predictor variables are independent and raised only to the first power. Other between designs may employ higher-order terms for predictors by raising the values for the original predictor variables to a power greater than 1 (e.g., in polynomial regression designs), or by forming products of different predictor variables (i.e., interaction terms). A common arrangement for ANOVA designs is the full-factorial design, in which every combination of levels for each of the categorical predictor variables is represented in the design. Designs with some but not all combinations of levels for each of the categorical predictor variables are aptly called fractional factorial designs. Designs with a hierarchy of combinations of levels for the different categorical predictor variables are called nested designs.

These basic distinctions about the nature, number, and arrangement of predictor variables can be used in describing a variety of different types of between designs. Some of the more common between designs can now be described.

**One-Way ANOVA**. A design with a single categorical predictor variable is called a one-way ANOVA design. For example, a study of 4 different fertilizers used on different individual plants could be analyzed via one-way ANOVA, with four levels for the factor *Fertilizer. *

In genera, consider a single categorical predictor variable *A* with 1 case in each of its 3 categories. Using the sigma-restricted coding of A into 2 quantitative contrast variables, the matrix ** X **defining the between design is

That is, cases in groups *A*_{1}, *A*_{2}, and *A*_{3} are all assigned values of 1 on *X*_{0} (the intercept), the case in group *A*_{1} is assigned a value of 1 on *X*_{1} and a value 0 on *X*_{2}, the case in group *A*_{2} is assigned a value of 0 on *X*_{1} and a value 1 on *X*_{2}, and the case in group *A*_{3} is assigned a value of -1 on *X*_{1} and a value -1 on *X*_{2}. Of course, any additional cases in any of the 3 groups would be coded similarly. If there were 1 case in group *A*_{1}, 2 cases in group *A*_{2}, and 1 case in group *A*_{3}, the ** X **matrix would be

where the first subscript for *A *gives the replicate number for the cases in each group. For brevity, replicates usually are not shown when describing ANOVA design matrices.

Note that in one-way designs with an equal number of cases in each group, sigma-restricted coding yields *X*_{1}* … X*_{k} variables all of which have means of 0.

Using the overparameterized model to represent A, the ** X **matrix defining the between design is simply

These simple examples show that the ** X **matrix actually serves two purposes. It specifies (1) the coding for the levels of the original predictor variables on the

*X*variables used in the analysis as well as (2) the nature, number, and arrangement of the X variables, that is, the between design.

**Main Effect ANOVA.** Main effect ANOVA designs contain separate one-way ANOVA designs for 2 or more categorical predictors. A good example of main effect ANOVA would be the typical analysis performed on *screening designs* as described in the context of the *Experimental Design* section.

Consider 2 categorical predictor variables *A* and *B *each with 2 categories. Using the sigma-restricted coding, the ** X **matrix defining the between design is

Note that if there are equal numbers of cases in each group, the sum of the cross-products of values for the *X*_{1} and *X*_{2} columns is 0, for example, with 1 case in each group (1*1)+(1*-1)+(-1*1)+(-1*-1)=0. Using the overparameterized model, the matrix ** X **defining the between design is

Comparing the two types of coding, it can be seen that the overparameterized coding takes almost twice as many values as the sigma-restricted coding to convey the same information.

**Factorial ANOVA.** Factorial ANOVA designs contain *X *variables representing combinations of the levels of 2 or more categorical predictors (e.g., a study of boys and girls in four age groups, resulting in a *2 (Gender) x 4 (Age Group) *design). In particular, full-factorial designs represent all possible combinations of the levels of the categorical predictors. A full-factorial design with 2 categorical predictor variables *A* and *B *each with 2 levels each would be called a 2 x 2 full-factorial design. Using the sigma-restricted coding, the ** X **matrix for this design would be

Several features of this ** X **matrix deserve comment. Note that the

*X*

_{1}and

*X*

_{2}columns represent main effect contrasts for one variable, (i.e.,

*A*and

*B*, respectively) collapsing across the levels of the other variable. The

*X*

_{3 }column instead represents a contrast between different combinations of the levels of

*A*and

*B*. Note also that the values for

*X*

_{3}are products of the corresponding values for

*X*

_{1}and

*X*

_{2}. Product variables such as

*X*

_{3 }represent the multiplicative or interaction effects of their factors, so

*X*

_{3}would be said to represent the 2-way interaction of

*A*and

*B*. The relationship of such product variables to the dependent variables indicate the interactive influences of the factors on responses above and beyond their independent (i.e., main effect) influences on responses. Thus, factorial designs provide more information about the relationships between categorical predictor variables and responses on the dependent variables than is provided by corresponding one-way or main effect designs.

When many factors are being investigated, however, full-factorial designs sometimes require more data than reasonably can be collected to represent all possible combinations of levels of the factors, and high-order interactions between many factors can become difficult to interpret. With many factors, a useful alternative to the full-factorial design is the fractional factorial design. As an example, consider a 2 x 2 x 2 fractional factorial design to degree 2 with 3 categorical predictor variables each with 2 levels. The design would include the main effects for each variable, and all 2-way interactions between the three variables, but would not include the 3-way interaction between all three variables. Using the overparameterized model, the ** X **matrix for this design is

The 2-way interactions are the highest degree effects included in the design. These types of designs are discussed in detail the *2**(k-p) Fractional Factorial Designs* section of the *Experimental Design* topic.

**Nested ANOVA Designs.** Nested designs are similar to fractional factorial designs in that all possible combinations of the levels of the categorical predictor variables are not represented in the design. In nested designs, however, the omitted effects are lower-order effects. Nested effects are effects in which the nested variables never appear as main effects. Suppose that for 2 variables *A *and *B *with 3 and 2 levels, respectively, the design includes the main effect for *A *and the effect of *B* nested within the levels of *A. *The ** X **matrix for this design using the overparameterized model is

Note that if the sigma-restricted coding were used, there would be only 2 columns in the ** X **matrix for the

*B*nested within

*A*effect instead of the 6 columns in the

**matrix for this effect when the overparameterized model coding is used (i.e., columns**

*X***through**

*X*_{4}**The sigma-restricted coding method is overly-restrictive for nested designs, so only the overparameterized model is used to represent nested designs.**

*X*_{9}).**Balanced ANOVA. **Most of the between designs discussed in this section can be analyzed much more efficiently, when they are balanced, i.e., when all cells in the ANOVA design have equal n, when there are no missing cells in the design, and, if nesting is present, when the nesting is balanced so that equal numbers of levels of the factors that are nested appear in the levels of the factor(s) that they are nested in. In that case, the * X’X* matrix (where

*stands for the design matrix) is a diagonal matrix, and many of the computations necessary to compute the ANOVA results (such as matrix inversion) are greatly simplified.*

**X****Simple Regression. **Simple regression designs involve a single continuous predictor variable. If there were 3 cases with values on a predictor variable *P* of, say, 7, 4, and 9, and the design is for the first-order effect of *P*, the *X *matrix would be

and using *P *for *X _{1}* the regression equation would be

Y = b_{0} + b_{1}P

If the simple regression design is for a higher-order effect of *P, *say the quadratic effect, the values in the *X _{1} *column of the design matrix would be raised to the 2nd power, that is, squared

and using *P ^{2} *for

*X*the regression equation would be

_{1}Y = b_{0} + b_{1}P^{2}

The sigma-restricted and overparameterized coding methods do not apply to simple regression designs and any other design containing only continuous predictors (since there are no categorical predictors to code). Regardless of which coding method is chosen, values on the continuous predictor variables are raised to the desired power and used as the values for the *X* variables. No recoding is performed. It is therefore sufficient, in describing regression designs, to simply describe the regression equation without explicitly describing the design matrix *X*.

**Multiple Regression.** Multiple regression designs are to continuous predictor variables as main effect ANOVA designs are to categorical predictor variables, that is, multiple regression designs contain the separate simple regression designs for 2 or more continuous predictor variables. The regression equation for a multiple regression design for the first-order effects of 3 continuous predictor variables *P*, *Q*, and *R* would be

Y = b_{0} + b_{1}P + b_{2}Q + b_{3}R

**Factorial Regression. **Factorial regression designs are similar to factorial ANOVA designs, in which combinations of the levels of the factors are represented in the design. In factorial regression designs, however, there may be many more such possible combinations of distinct levels for the continuous predictor variables than there are cases in the data set. To simplify matters, full-factorial regression designs are defined as designs in which all possible products of the continuous predictor variables are represented in the design. For example, the full-factorial regression design for two continuous predictor variables *P *and *Q* would include the main effects (i.e., the first-order effects) of *P *and *Q *and their 2-way *P *by *Q* interaction effect, which is represented by the product of *P *and *Q* scores for each case. The regression equation would be

Y = b_{0} + b_{1}P + b_{2}Q + b_{3}P*Q

Factorial regression designs can also be fractional, that is, higher-order effects can be omitted from the design. A fractional factorial design to degree 2 for 3 continuous predictor variables *P*, *Q*, and *R* would include the main effects and all 2-way interactions between the predictor variables

Y = b_{0} + b_{1}P + b_{2}Q + b_{3}R + b_{4}P*Q + b_{5}P*R + b_{6}Q*R

**Polynomial Regression.** Polynomial regression designs are designs which contain main effects and higher-order effects for the continuous predictor variables but do not include interaction effects between predictor variables. For example, the polynomial regression design to degree 2 for three continuous predictor variables *P, Q, *and *R* would include the main effects (i.e., the first-order effects) of *P, Q, *and *R* and their quadratic (i.e., second-order)* *effects*, *but not the 2-way interaction effects or the *P *by *Q* by *R* 3-way interaction effect.

Y = b_{0} + b_{1}P + b_{2}P^{2} + b_{3}Q + b_{4}Q^{2} + b_{5}R + b_{6}R^{2}

Polynomial regression designs do not have to contain all effects up to the same degree for every predictor variable. For example, main, quadratic, and cubic effects could be included in the design for some predictor variables, and effects up the fourth degree could be included in the design for other predictor variables.

**Response Surface Regression.** Quadratic response surface regression designs are a hybrid type of design with characteristics of both polynomial regression designs and fractional factorial regression designs. Quadratic response surface regression designs contain all the same effects of polynomial regression designs to degree 2 and additionally the 2-way interaction effects of the predictor variables. The regression equation for a quadratic response surface regression design for 3 continuous predictor variables *P, Q, *and *R* would be

Y = b_{0} + b_{1}P + b_{2}P^{2} + b_{3}Q + b_{4}Q^{2} + b_{5}R + b_{6}R^{2} + b_{7}P*Q + b_{8}P*R + b_{9}Q*R

These types of designs are commonly employed in applied research (e.g., in industrial experimentation), and a detailed discussion of these types of designs is also presented in the *Experimental Design* topic (see *Central composite designs*).

**Mixture Surface Regression.** Mixture surface regression designs are identical to factorial regression designs to degree 2 except for the omission of the intercept. Mixtures, as the name implies, add up to a constant value; the sum of the proportions of ingredients in different recipes for some material all must add up 100%. Thus, the proportion of one ingredient in a material is redundant with the remaining ingredients. Mixture surface regression designs deal with this redundancy by omitting the intercept from the design. The design matrix for a mixture surface regression design for 3 continuous predictor variables *P, Q, *and *R* would be

Y = b_{1}P + b_{2}Q + b_{3}R + b_{4}P*Q + b_{5}P*R + b_{6}Q*R

These types of designs are commonly employed in applied research (e.g., in industrial experimentation), and a detailed discussion of these types of designs is also presented in the *Experimental Design* topic (see *Mixture designs and triangular surfaces*).

**Analysis of Covariance.** In general, between designs which contain both categorical and continuous predictor variables can be called ANCOVA designs. Traditionally, however, ANCOVA designs have referred more specifically to designs in which the first-order effects of one or more continuous predictor variables are taken into account when assessing the effects of one or more categorical predictor variables. A basic introduction to analysis of covariance can also be found in the *Analysis of covariance (ANCOVA)* section of the *ANOVA/MANOVA* topic.

To illustrate, suppose a researcher wants to assess the influences of a categorical predictor variable *A* with 3 levels on some outcome, and that measurements on a continuous predictor variable *P*, known to covary with the outcome, are available. If the data for the analysis are

then the sigma-restricted ** X **matrix for the design that includes the separate first-order effects of

*P*and

*A*would be

The *b** _{2}* and

*b*

*coefficients in the regression equation*

_{3}Y = b_{0} + b_{1}X_{1} + b_{2}X_{2} + b_{3}X_{3}

represent the influences of group membership on the *A *categorical predictor variable, controlling for the influence of scores on the *P* continuous predictor variable. Similarly, the *b** _{1}* coefficient represents the influence of scores on

*P*controlling for the influences of group membership on

*A.*This traditional ANCOVA analysis gives a more sensitive test of the influence of

*A*to the extent that

*P*reduces the prediction error, that is, the residuals for the outcome variable.

The ** X **matrix for the same design using the overparameterized model would be

The interpretation is unchanged except that the influences of group membership on the *A* categorical predictor variables are represented by the *b** _{2}*,

*b*

*and*

_{3}*b*

*coefficients in the regression equation*

_{4}Y = b_{0} + b_{1}X_{1} + b_{2}X_{2} + b_{3}X_{3} + b_{4}X_{4}

**Separate Slope Designs.** The traditional analysis of covariance (ANCOVA) design for categorical and continuous predictor variables is inappropriate when the categorical and continuous predictors interact in influencing responses on the outcome. The appropriate design for modeling the influences of the predictors in this situation is called the separate slope design. For the same example data used to illustrate traditional ANCOVA, the overparameterized ** X **matrix for the design that includes the main effect of the three-level categorical predictor

*A*and the 2-way interaction of

*P*by

*A*would be

The *b** _{4}*,

*b*

*, and*

_{5}*b*

_{6 coefficients in the regression equation }

Y = b_{0} + b_{1}X_{1} + b_{2}X_{2} + b_{3}X_{3} + b_{4}X_{4} + b_{5}X_{5} + b_{6}X_{6}

give the separate slopes for the regression of the outcome on *P* within each group on *A*, controlling for the main effect of *A*.

As with nested ANOVA designs, the sigma-restricted coding of effects for separate slope designs is overly restrictive, so only the overparameterized model is used to represent separate slope designs. In fact, separate slope designs are identical in form to nested ANOVA designs, since the main effects for continuous predictors are omitted in separate slope designs.

**Homogeneity of Slopes.** The appropriate design for modeling the influences of continuous and categorical predictor variables depends on whether the continuous and categorical predictors interact in influencing the outcome. The traditional analysis of covariance (ANCOVA) design for continuous and categorical predictor variables is appropriate when the continuous and categorical predictors do not interact in influencing responses on the outcome, and the separate slope design is appropriate when the continuous and categorical predictors do interact in influencing responses. The homogeneity of slopes designs can be used to test whether the continuous and categorical predictors interact in influencing responses, and thus, whether the traditional ANCOVA design or the separate slope design is appropriate for modeling the effects of the predictors. For the same example data used to illustrate the traditional ANCOVA and separate slope designs, the overparameterized ** X **matrix for the design that includes the main effect of

*P*, the main effect of the three-level categorical predictor

*A*, and the 2-way interaction of

*P*by

*A*would be

If the *b** _{5}*,

*b*

*, or*

_{6}*b*

*coefficient in the regression equation*

_{7}Y = b_{0} + b_{1}X_{1} + b_{2}X_{2} + b_{3}X_{3} + b_{4}X_{4} + b_{5}X_{5} + b_{6}X_{6} + b_{7}X_{7}

is non-zero, the separate slope model should be used. If instead all 3 of these regression coefficients are zero the traditional ANCOVA design should be used.

The sigma-restricted ** X **matrix for the homogeneity of slopes design would be

Using this ** X **matrix, if the

*b*

*, or*

_{4}*b*

*coefficient in the regression equation*

_{5}Y = b_{0} + b_{1}X_{1} + b_{2}X_{2} + b_{3}X_{3} + b_{4}X_{4} + b_{5}X_{5}

is non-zero, the separate slope model should be used. If instead both of these regression coefficients are zero the traditional ANCOVA design should be used.

**Mixed Model ANOVA and ANCOVA.** Designs that contain random effects for one or more categorical predictor variables are called mixed-model designs. Random effects are classification effects where the levels of the effects are assumed to be randomly selected from an infinite population of possible levels. The solution for the normal equations in mixed-model designs is identical to the solution for fixed-effect designs (i.e., designs which do not contain Random effects. Mixed-model designs differ from fixed-effect designs only in the way in which effects are tested for significance. In fixed-effect designs, between effects are always tested using the mean squared residual as the error term. In mixed-model designs, between effects are tested using relevant error terms based on the covariation of random sources of variation in the design. Specifically, this is done using Satterthwaite’s method of denominator synthesis (Satterthwaite, 1946), which finds the linear combinations of sources of random variation that serve as appropriate error terms for testing the significance of the respective effect of interest. A basic discussion of these types of designs, and methods for estimating variance components for the random effects can also be found in the *Variance Components and Mixed Model ANOVA/ANCOVA* topic.

Mixed-model designs, like nested designs and separate slope designs, are designs in which the sigma-restricted coding of categorical predictors is overly restrictive. Mixed-model designs require estimation of the covariation between the levels of categorical predictor variables, and the sigma-restricted coding of categorical predictors suppresses this covariation. Thus, only the overparameterized model is used to represent mixed-model designs (some programs will use the sigma-restricted approach and a so-called “restricted model” for random effects; however, only the overparameterized model as described in General Linear Models applies to both balanced and unbalanced designs, as well as designs with missing cells; see Searle, Casella, & McCullock, 1992, p. 127). It is important to recognize, however, that sigma-restricted coding can be used to represent any between design, with the exceptions of mixed-model, nested, and separate slope designs. Furthermore, some types of hypotheses can only be tested using the sigma-restricted coding (i.e., the effective hypothesis, Hocking, 1996), thus the greater generality of the overparameterized model for representing between designs does not justify it being used exclusively for representing categorical predictors in the general linear model.

### Within-Subject (Repeated Measures) Designs

- Overview
- One-way within-subject designs
- Multi-way within-subject designs
- The multivariate approach to Repeated Measures
- Doubly multivariate within-subject designs

**Overview. **It is quite common for researchers to administer the same test to the same subjects repeatedly over a period of time or under varying circumstances. In essence, we are interested in examining differences within each subject, for example, subjects’ improvement over time. Such designs are referred to as within-subject designs or repeated measures designs. A basic introduction to repeated measures designs is also provided in the *Between-groups and repeated measures* section of the *ANOVA/MANOVA* topic.

For example, imagine that we want to monitor the improvement of students’ algebra skills over two months of instruction. A standardized algebra test is administered after one month (level 1 of the repeated measures factor), and a comparable test is administered after two months (level 2 of the repeated measures factor). Thus, the repeated measures factor (*Time*) has 2 levels.** **Now, suppose that scores for the 2 algebra tests (i.e., values on the *Y** _{1} *and

*Y*

*variables at*

_{2}*Time 1*and

*Time 2*, respectively) are transformed into scores on a new composite variable (i.e., values on the

*T*

*), using the linear transformation*

_{1}**T = YM**

where ** M **is an orthonormal contrast matrix. Specifically, if

then the difference of the mean score on *T** _{1}* from 0 indicates the improvement (or deterioration) of scores across the 2 levels of

*Time*.

**One-Way Within-Subject Designs.** The example algebra skills study with the *Time *repeated measures factor (see also within-subjects design *Overview*) illustrates a one-way within-subject design. In such designs, orthonormal contrast transformations of the scores on the original dependent *Y *variables are performed via the ** M **transformation (orthonormal transformations correspond to orthogonal rotations of the original variable axes). If any

*b*

*coefficient in the regression of a transformed*

_{0}*T*variable on the intercept is non-zero, this indicates a change in responses across the levels of the repeated measures factor, that is, the presence of a main effect for the repeated measure factor on responses.

What if the between design includes effects other than the intercept? If any of the *b** _{1}* through

*b*

*coefficients in the regression of a transformed*

_{k}*T*variable on

**are non-zero, this indicates a different change in responses across the levels of the repeated measures factor for different levels of the corresponding between effect, i.e., the presence of a within by between interaction effect on responses.**

*X*The same between-subject effects that can be tested in designs with no repeated-measures factors can also be tested in designs that do include repeated-measures factors. This is accomplished by creating a transformed dependent variable which is the sum of the original dependent variables divided by the square root of the number of original dependent variables. The same tests of between-subject effects that are performed in designs with no repeated-measures factors (including tests of the between intercept) are performed on this transformed dependent variable.

**Multi-Way Within-Subject Designs.** Suppose that in the example algebra skills study with the *Time *repeated measures factor (see the within-subject designs Overview), students were given a number problem test and then a word problem test on each testing occasion. *Test* could then be considered as a second repeated measures factor, with scores on the number problem tests representing responses at level 1 of the *Test* repeated measure factor, and scores on the word problem tests representing responses at level 2 of the *Test* repeated measure factor. The within subject design for the study would be a 2 (*Time) *by 2 (*Test*) full-factorial design, with effects for *Time, Test, *and the *Time *by *Test *interaction.

To construct transformed dependent variables representing the effects of *Time, Test, *and the *Time *by *Test *interaction, three respective ** M **transformations of the original dependent

*Y*variables are performed. Assuming that the original

*Y*variables are in the order

*Time 1 – Test 1, Time 1 – Test 2, Time 2 – Test 1,*and

*Time 2 – Test 2,*the

*M*matrices for the

*Time, Test,*and the

*Time*by

*Test*interaction would be

The differences of the mean scores on the transformed *T* variables from 0 are then used to interpret the corresponding within-subject effects. If the *b** _{0}* coefficient in the regression of a transformed

*T*variable on the intercept is non-zero, this indicates a change in responses across the levels of a repeated measures effect, that is, the presence of the corresponding main or interaction effect for the repeated measure factors on responses.

Interpretation of within by between interaction effects follow the same procedures as for one-way within designs, except that now within by between interactions are examined for each within effect by between effect combination.

**Multivariate Approach to Repeated Measures.** When the repeated measures factor has more than 2 levels, then the ** M **matrix will have more than a single column. For example, for a repeated measures factor with 3 levels (e.g.,

*Time 1*,

*Time 2*,

*Time 3*), the

*M*matrix will have 2 columns (e.g., the two transformations of the dependent variables could be (1)

*Time 1*vs.

*Time 2*and

*Time 3*combined, and (2)

*Time 2*vs.

*Time 3*). Consequently, the nature of the design is really multivariate, that is, there are two simultaneous dependent variables, which are transformations of the original dependent variables. Therefore, when testing repeated measures effects involving more than a single degree of freedom (e.g., a repeated measures main effect with more than 2 levels), you can compute multivariate test statistics to test the respective hypotheses. This is a different (and usually the preferred) approach than the univariate method that is still widely used. For a further discussion of the multivariate approach to testing repeated measures effects, and a comparison to the traditional univariate approach, see the

*Sphericity and compound symmetry*section of the

*ANOVA/MANOVA*topic.

**Doubly Multivariate Designs.** If the product of the number of levels for each within-subject factor is equal to the number of original dependent variables, the within-subject design is called a univariate repeated measures design. The within design is univariate because there is one dependent variable representing each combination of levels of the within-subject factors. Note that this use of the term *univariate design *is not to be confused with the univariate and multivariate approach to the analysis of repeated measures designs, both of which can be used to analyze such univariate (single-dependent-variable-only) designs. When there are two or more dependent variables for each combination of levels of the within-subject factors, the within-subject design is called a multivariate repeated measures design, or more commonly, a doubly multivariate within-subject design. This term is used because the analysis for each dependent measure can be done via the multivariate approach; so when there is more than one dependent measure, the design can be considered doubly-multivariate.

Doubly multivariate design are analyzed using a combination of univariate repeated measures and multivariate analysis techniques. To illustrate, suppose in an algebra skills study, tests are administered three times (repeated measures factor *Time *with 3 levels). Two test scores are recorded at each level of *Time*: a *Number Problem *score and a *Word Problem *score. Thus, scores on the two types of tests could be treated as multiple measures on which improvement (or deterioration) across *Time* could be assessed. ** M **transformed variables could be computed for each set of test measures, and multivariate tests of significance could be performed on the multiple transformed measures, as well as on the each individual test measure.

### Multivariate Designs

**Overview.** When there are multiple dependent variables in a design, the design is said to be multivariate. Multivariate measures of association are by nature more complex than their univariate counterparts (such as the correlation coefficient, for example). This is because multivariate measures of association must take into account not only the relationships of the predictor variables with responses on the dependent variables, but also the relationships among the multiple dependent variables. By doing so, however, these measures of association provide information about the strength of the relationships between predictor and dependent variables independent of the dependent variable interrelationships. A basic discussion of multivariate designs is also presented in the *Multivariate Designs* section in the *ANOVA/MANOVA* topic.

The most commonly used multivariate measures of association all can be expressed as functions of the eigenvalues of the product matrix

**E ^{-1}H**

where ** E **is the error SSCP matrix (i.e., the matrix of sums of squares and cross-products for the dependent variables that are not accounted for by the predictors in the between design), and

*H*is a hypothesis SSCP matrix (i.e., the matrix of sums of squares and cross-products for the dependent variables that are accounted for by all the predictors in the between design, or the sums of squares and cross-products for the dependent variables that are accounted for by a particular effect). If

l_{i} = the ordered eigenvalues of **E ^{-1}H**, if

**E**exists

^{-1}then the 4 commonly used multivariate measures of association are

Wilks’ lambda = **P**[1/(1+l_{i})]

Pillai’s trace = **S**l_{i}/(1+l_{i})

Hotelling-Lawley trace = **S**l_{i}

Roy’s largest root = l_{1}

These 4 measures have different upper and lower bounds, with Wilks’ lambda perhaps being the most easily interpretable of the 4 measures. Wilks’ lambda can range from 0 to 1, with 1 indicating no relationship of predictors to responses and 0 indicating a perfect relationship of predictors to responses. 1 – Wilks’ lambda can be interpreted as the multivariate counterpart of a univariate R-squared, that is, it indicates the proportion of generalized variance in the dependent variables that is accounted for by the predictors.

The 4 measures of association are also used to construct multivariate tests of significance. These multivariate tests are covered in detail in a number of sources (e.g., Finn, 1974; Tatsuoka, 1971).

## Estimation and Hypothesis Testing

The following sections discuss details concerning hypothesis testing in the context of *STATISTICA*‘s *GLM* module, for example, how the test for the overall model fit is computed, the options for computing tests for categorical effects in unbalanced or incomplete designs, how and when custom-error terms can be chosen, and the logic of testing custom-hypotheses in factorial or regression designs.

### Whole Model Tests

**Partitioning Sums of Squares.** A fundamental principle of least squares methods is that variation on a dependent variable can be partitioned, or divided into parts, according to the sources of the variation. Suppose that a dependent variable is regressed on one or more predictor variables, and that for convenience the dependent variable is scaled so that its mean is 0. Then a basic least squares identity is that the total sum of squared values on the dependent variable equals the sum of squared predicted values plus the sum of squared residual values. Stated more generally,

S(y – y-bar)^{2} = S(y-hat – y-bar)^{2} + S(y – y-hat)^{2}

where the term on the left is the total sum of squared deviations of the observed values on the dependent variable from the dependent variable mean, and the respective terms on the right are (1) the sum of squared deviations of the predicted values for the dependent variable from the dependent variable mean and (2) the sum of the squared deviations of the observed values on the dependent variable from the predicted values, that is, the sum of the squared residuals. Stated yet another way,

Total SS = Model SS + Error SS

Note that the Total SS is always the same for any particular data set, but that the Model SS and the Error SS depend on the regression equation. Assuming again that the dependent variable is scaled so that its mean is 0, the Model SS and the Error SS can be computed using

Model SS =** b’X’Y**

Error SS = **Y’Y – b’X’Y**

**Testing the Whole Model**. Given the Model SS and the Error SS, we can perform a test that all the regression coefficients for the *X* variables (*b**1* through *b**k*) are zero. This test is equivalent to a comparison of the fit of the regression surface defined by the predicted values (computed from the whole model regression equation) to the fit of the regression surface defined solely by the dependent variable mean (computed from the reduced regression equation containing only the intercept). Assuming that ** X’X **is full-rank, the whole model hypothesis mean square

MSH = (Model SS)/k

is an estimate of the variance of the predicted values. The error mean square

s^{2} = MSE = (Error SS)/(n-k-1)

is an unbiased estimate of the residual or error variance. The test statistic is

F = MSH/MSE

where *F *has (*k, n – k – *1) degrees of freedom.

If ** X’X **is not full rank,

*r*+ 1

*is substituted for*

*k*, where

*r*is the rank or the number of non-redundant columns of

*X’X*.

Note that in the case of non-intercept models, some multiple regression programs will compute the full model test based on the proportion of variance around 0 (zero) accounted for by the predictors; for more information (see Kvålseth, 1985; Okunade, Chang, and Evans, 1993), while other will actually compute both values (i.e., based on the residual variance around 0, and around the respective dependent variable means.

**Limitations of Whole Model Tests**. For designs such as one-way ANOVA or simple regression designs, the whole model test by itself may be sufficient for testing general hypotheses about whether or not the single predictor variable is related to the outcome. In more complex designs, however, hypotheses about specific *X *variables or subsets of *X *variables are usually of interest. For example, you might want to make inferences about whether a subset of regression coefficients are 0, or you might want to test whether subpopulation means corresponding to combinations of specific *X *variables differ*. *The whole model test is usually insufficient for such purposes.

A variety of methods have been developed for testing specific hypotheses. Like whole model tests, many of these methods rely on comparisons of the fit of different models (e.g., Type I, Type II, and the effective hypothesis sums of squares). Other methods construct tests of linear combinations of regression coefficients in order to test mean differences (e.g., Type III, Type IV, and Type V sums of squares). For designs that contain only first-order effects of continuous predictor variables (i.e., multiple regression designs), many of these methods are equivalent (i.e., Type II through Type V sums of squares all test the significance of partial regression coefficients). However, there are important distinctions between the different hypothesis testing techniques for certain types of ANOVA designs (i.e., designs with unequal cell *n*‘s and/or missing cells).

All methods for testing hypotheses, however, involve the same hypothesis testing strategy employed in whole model tests, that is, the sums of squares attributable to an effect (using a given criterion) is computed, and then the mean square for the effect is tested using an appropriate error term.

When there are categorical predictors in the model, arranged in a factorial ANOVA design, then we are typically interested in the main effects for and interaction effects between the categorical predictors. However, when the design is not balanced (has unequal cell *n’*s, and consequently, the coded effects for the categorical factors are usually correlated), or when there are missing cells in a full factorial ANOVA design, then there is ambiguity regarding the specific comparisons between the (population, or least-squares) cell means that constitute the main effects and interactions of interest. These issues are discussed in great detail in Milliken and Johnson (1986), and if you routinely analyze incomplete factorial designs, you should consult their discussion of various problems and approaches to solving them.

In addition to the widely used methods that are commonly labeled *Type I, II*, *III*, and *IV* sums of squares (see Goodnight, 1980), we also offer different methods for testing effects in incomplete designs, that are widely used in other areas (and traditions) of research.

**Type V sums of squares.** Specifically, we propose the term *Type V* *sums of squares *to denote the approach that is widely used in industrial experimentation, to analyze fractional factorial designs; these types of designs are discussed in detail in the *2**(k-p) Fractional Factorial Designs* section of the *Experimental Design* topic. In effect, for those effects for which tests are performed all population marginal means (least squares means) are estimable.

**Type VI sums of squares.** Second, in keeping with the *Type i *labeling convention, we propose the term *Type VI sums of squares *to denote the approach that is often used in programs that only implement the sigma-restricted model (which is not well suited for certain types of designs; we offer a choice between the sigma-restricted and overparameterized model models). This approach is identical to what is described as the *effective hypothesis *method in Hocking (1996).

**Contained Effects**. The following descriptions will use the term *contained effect*. An effect *E1 *(e.g., *A * B *interaction) is contained in another effect *E2 *if:

- Both effects involve the same continuous predictor variable (if included in the model; e.g.,
*A * B * X*would be contained in*A * C * X*, where*A*,*B*, and*C*are categorical predictors, and*X*is a continuous predictor); or *E2*has more categorical predictors than does*E1*, and, if*E1*includes any categorical predictors, they also appear in*E2*(e.g.,*A * B*would be contained in the*A * B * C*interaction).

**Type I Sums of Squares**. Type I sums of squares involve a sequential partitioning of the whole model sums of squares. A hierarchical series of regression equations are estimated, at each step adding an additional effect into the model. In Type I sums of squares, the sums of squares for each effect are determined by subtracting the predicted sums of squares with the effect in the model from the predicted sums of squares for the preceding model not including the effect. Tests of significance for each effect are then performed on the increment in the predicted sums of squares accounted for by the effect. Type I sums of squares are therefore sometimes called sequential or hierarchical sums of squares.

Type I sums of squares are appropriate to use in balanced (equal *n*) ANOVA designs in which effects are entered into the model in their natural order (i.e., any main effects are entered before any two-way interaction effects, any two-way interaction effects are entered before any three-way interaction effects, and so on). Type I sums of squares are also useful in polynomial regression designs in which any lower-order effects are entered before any higher-order effects. A third use of Type I sums of squares is to test hypotheses for hierarchically nested designs, in which the first effect in the design is nested within the second effect, the second effect is nested within the third, and so on.

One important property of Type I sums of squares is that the sums of squares attributable to each effect add up to the whole model sums of squares. Thus, Type I sums of squares provide a complete decomposition of the predicted sums of squares for the whole model. This is not generally true for any other type of sums of squares. An important limitation of Type I sums of squares, however, is that the sums of squares attributable to a specific effect will generally depend on the order in which the effects are entered into the model. This lack of invariance to order of entry into the model limits the usefulness of Type I sums of squares for testing hypotheses for certain designs (e.g., fractional factorial designs).

**Type II Sums of Squares**. Type II sums of squares are sometimes called partially sequential sums of squares. Like Type I sums of squares, Type II sums of squares for an effect controls for the influence of other effects. Which other effects to control for, however, is determined by a different criterion. In Type II sums of squares, the sums of squares for an effect is computed by controlling for the influence of all other effects of equal or lower degree. Thus, sums of squares for main effects control for all other main effects, sums of squares for two-way interactions control for all main effects and all other two-way interactions, and so on.

Unlike Type I sums of squares, Type II sums of squares are invariant to the order in which effects are entered into the model. This makes Type II sums of squares useful for testing hypotheses for multiple regression designs, for main effect ANOVA designs, for full-factorial ANOVA designs with equal cell *n*s, and for hierarchically nested designs.

There is a drawback to the use of Type II sums of squares for factorial designs with unequal cell *n*s. In these situations, Type II sums of squares test hypotheses that are complex functions of the cell *ns* that ordinarily are not meaningful. Thus, a different method for testing hypotheses is usually preferred.

**Type III Sums of Squares**. Type I and Type II sums of squares usually are not appropriate for testing hypotheses for factorial ANOVA designs with unequal *n*s. For ANOVA designs with unequal *n*s, however, Type III sums of squares test the same hypothesis that would be tested if the cell *n*s were equal, provided that there is at least one observation in every cell. Specifically, in no-missing-cell designs, Type III sums of squares test hypotheses about differences in subpopulation (or marginal) means. When there are no missing cells in the design, these subpopulation means are least squares means, which are the best linear-unbiased estimates of the marginal means for the design (see, Milliken and Johnson, 1986).

Tests of differences in least squares means have the important property that they are invariant to the choice of the coding of effects for categorical predictor variables (e.g., the use of the sigma-restricted or overparameterized model) and to the choice of the particular g2 inverse of ** X’X **used to solve the normal equations. Thus, tests of linear combinations of least squares means in general, including Type III tests of differences in least squares means, are said to not depend on the parameterization of the design. This makes Type III sums of squares useful for testing hypotheses for any design for which Type I or Type II sums of squares are appropriate, as well as for any unbalanced ANOVA design with no missing cells.

The Type III sums of squares attributable to an effect is computed as the sums of squares for the effect controlling for any effects of equal or lower degree and orthogonal to any higher-order interaction effects (if any) that contain it. The orthogonality to higher-order containing interactions is what gives Type III sums of squares the desirable properties associated with linear combinations of least squares means in ANOVA designs with no missing cells. But for ANOVA designs with missing cells, Type III sums of squares generally do not test hypotheses about least squares means, but instead test hypotheses that are complex functions of the patterns of missing cells in higher-order containing interactions and that are ordinarily not meaningful. In this situation Type V sums of squares or tests of the effective hypothesis (Type VI sums of squares) are preferred.

**Type IV Sums of Squares**. Type IV sums of squares were designed to test “balanced” hypotheses for lower-order effects in ANOVA designs with missing cells. Type IV sums of squares are computed by equitably distributing cell contrast coefficients for lower-order effects across the levels of higher-order containing interactions.

Type IV sums of squares are not recommended for testing hypotheses for lower-order effects in ANOVA designs with missing cells, even though this is the purpose for which they were developed. This is because Type IV sum-of-squares are invariant to some but not all g2 inverses of ** X’X **that could be used to solve the normal equations. Specifically, Type IV sums of squares are invariant to the choice of a g2 inverse of

**given a particular ordering of the levels of the categorical predictor variables, but are not invariant to different orderings of levels. Furthermore, as with Type III sums of squares, Type IV sums of squares test hypotheses that are complex functions of the patterns of missing cells in higher-order containing interactions and that are ordinarily not meaningful.**

*X’X*Statisticians who have examined the usefulness of Type IV sums of squares have concluded that Type IV sums of squares are not up to the task for which they were developed:

- Milliken & Johnson (1992, p. 204) write: “It seems likely that few, if any, of the hypotheses tested by the Type IV analysis of
*[some programs]*will be of particular interest to the experimenter.” - Searle (1987, p. 463-464) writes: “In general, [Type IV] hypotheses determined in this nature are not necessarily of any interest.”; and (p. 465) “This characteristic of Type IV sums of squares for rows depending on the sequence of rows establishes their non-uniqueness, and this in turn emphasizes that the hypotheses they are testing are by no means necessarily of any general interest.”
- Hocking (1985, p. 152), in an otherwise comprehensive introduction to general linear models, writes: “For the missing cell problem,
*[some programs]*offers a fourth analysis, Type IV, which we shall not discuss.”

So, we recommend that you use the Type IV sums of squares solution with caution, and that you understand fully the nature of the (often non-unique) hypotheses that are being testing, before attempting interpretations of the results. Furthermore, in ANOVA designs with no missing cells, Type IV sums of squares are always equal to Type III sums of squares, so the use of Type IV sums of squares is either (potentially) inappropriate, or unnecessary, depending on the presence of missing cells in the design.

**Type V Sums of Squares**. Type V sums of squares were developed as an alternative to Type IV sums of squares for testing hypotheses in ANOVA designs in missing cells. Also, this approach is widely used in industrial experimentation, to analyze fractional factorial designs; these types of designs are discussed in detail in the *2**(k-p) Fractional Factorial Designs* section of the *Experimental Design* topic. In effect, for effects for which tests are performed all population marginal means (least squares means) are estimable.

Type V sums of squares involve a combination of the methods employed in computing Type I and Type III sums of squares. Specifically, whether or not an effect is eligible to be dropped from the model is determined using Type I procedures, and then hypotheses are tested for effects not dropped from the model using Type III procedures. Type V sums of squares can be illustrated by using a simple example. Suppose that the effects considered are *A*, *B, *and *A *by *B, *in that order, and that *A* and *B* are both categorical predictors with, say, 3 and 2 levels, respectively. The intercept is first entered into the model. Then *A* is entered into the model, and its degrees of freedom are determined (i.e., the number of non-redundant columns for *A* in ** X’X, **given the intercept). If

*A*‘s degrees of freedom are less than 2 (i.e., its number of levels minus 1), it is eligible to be dropped. Then

*B*is entered into the model, and its degrees of freedom are determined (i.e., the number of non-redundant columns for

*B*in

**, given the intercept and**

*X’X**A*). If

*B*‘s degrees of freedom are less than 1 (i.e., its number of levels minus 1), it is eligible to be dropped. Finally,

*A*by

*B*is entered into the model, and its degrees of freedom are determined (i.e., the number of non-redundant columns for

*A*by

*B*in

**, given the intercept,**

*X’X**A*, and

*B*). If

*B*‘s degrees of freedom are less than 2 (i.e., the product of the degrees of freedom for its factors if there were no missing cells), it is eligible to be dropped. Type III sums of squares are then computed for the effects that were not found to be eligible to be dropped, using the reduced model in which any eligible effects are dropped. Tests of significance, however, use the error term for the whole model prior to dropping any eligible effects.

Note that Type V sums of squares involve determining a reduced model for which all effects remaining in the model have at least as many degrees of freedom as they would have if there were no missing cells. This is equivalent to finding a subdesign with no missing cells such that the Type III sums of squares for all effects in the subdesign reflect differences in least squares means.

Appropriate caution should be exercised when using Type V sums of squares. Dropping an effect from a model is the same as assuming that the effect is unrelated to the outcome (see, e.g., Hocking, 1996). The reasonableness of the assumption does not necessarily insure its validity, so when possible the relationships of dropped effects to the outcome should be inspected. It is also important to note that Type V sums of squares are not invariant to the order in which eligibility for dropping effects from the model is evaluated. Different orders of effects could produce different reduced models.

In spite of these limitations, Type V sums of squares for the reduced model have all the same properties of Type III sums of squares for ANOVA designs with no missing cells. Even in designs with many missing cells (such as fractional factorial designs, in which many high-order interaction effects are assumed to be zero), Type V sums of squares provide tests of meaningful hypotheses, and sometimes hypotheses that cannot be tested using any other method.

**Type VI (Effective Hypothesis) Sums of Squares**. Type I through Type V sums of squares can all be viewed as providing tests of hypotheses that subsets of partial regression coefficients (controlling for or orthogonal to appropriate additional effects) are zero. Effective hypothesis tests (developed by Hocking, 1996) are based on the philosophy that the only unambiguous estimate of an effect is the proportion of variability on the outcome that is uniquely attributable to the effect. The overparameterized coding of effects for categorical predictor variables generally cannot be used to provide such unique estimates for lower-order effects. Effective hypothesis tests, which we propose to call Type VI sums of squares, use the sigma-restricted coding of effects for categorical predictor variables to provide unique effect estimates even for lower-order effects.

The method for computing Type VI sums of squares is straightforward. The sigma-restricted coding of effects is used, and for each effect, its Type VI sums of squares is the difference of the model sums of squares for all other effects from the whole model sums of squares. As such, the Type VI sums of squares provide an unambiguous estimate of the variability of predicted values for the outcome uniquely attributable to each effect.

In ANOVA designs with missing cells, Type VI sums of squares for effects can have fewer degrees of freedom than they would have if there were no missing cells, and for some missing cell designs, can even have zero degrees of freedom. The philosophy of Type VI sums of squares is to test as much as possible of the original hypothesis given the observed cells. If the pattern of missing cells is such that no part of the original hypothesis can be tested, so be it. The inability to test hypotheses is simply the price we pay for having no observations at some combinations of the levels of the categorical predictor variables. The philosophy is that it is better to admit that a hypothesis cannot be tested than it is to test a distorted hypothesis that may not meaningfully reflect the original hypothesis.

Type VI sums of squares cannot generally be used to test hypotheses for nested ANOVA designs, separate slope designs, or mixed-model designs, because the sigma-restricted coding of effects for categorical predictor variables is overly restrictive in such designs. This limitation, however, does not diminish the fact that Type VI sums of squares can b

### Error Terms for Tests

**Lack-of-Fit Tests using Pure Error**. Whole model tests and tests based on the 6 types of sums of squares use the mean square residual as the error term for tests of significance. For certain types of designs, however, the residual sum of squares can be further partitioned into meaningful parts which are relevant for testing hypotheses. One such type of design is a simple regression design in which there are subsets of cases all having the same values on the predictor variable. For example, performance on a task could be measured for subjects who work on the task under several different room temperature conditions. The test of significance for the *Temperature* effect in the linear regression of *Performance* on *Temperature* would not necessarily provide complete information on how *Temperature *relates to *Performance*; the regression coefficient for *Temperature *only reflects its linear effect on the outcome.

One way to glean additional information from this type of design is to partition the residual sums of squares into lack-of-fit and pure error components. In the example just described, this would involve determining the difference between the sum of squares that cannot be predicted by *Temperature *levels, given the linear effect of *Temperature *(residual sums of squares) and the pure error; this difference would be the sums of squares associated with the lack-of-fit (in this example, of the linear model). The test of lack-of-fit, using the mean square pure error as the error term, would indicate whether non-linear effects of *Temperature* are needed to adequately model *Tempature’s* influence on the outcome. Further, the linear effect could be tested using the pure error term, thus providing a more sensitive test of the linear effect independent of any possible nonlinear effect.

**Designs with Zero Degrees of Freedom for Error**. When the model degrees of freedom equal the number of cases or subjects, the residual sums of squares will have zero degrees of freedom and preclude the use of standard hypothesis tests. This sometimes occurs for overfitted designs (designs with many predictors, or designs with categorical predictors having many levels). However, in some designed experiments, such as experiments using split-plot designs or highly fractionalized factorial designs as commonly used in industrial experimentation, it is no accident that the residual sum of squares has zero degrees of freedom. In such experiments, mean squares for certain effects are planned to be used as error terms for testing other effects, and the experiment is designed with this in mind. It is entirely appropriate to use alternatives to the mean square residual as error terms for testing hypotheses in such designs.

**Tests in Mixed Model Designs**. Designs which contain random effects for one or more categorical predictor variables are called mixed-model designs. These types of designs, and the analysis of those designs, is also described in detail in the *Variance Components and Mixed Model ANOVA/ANCOVA* topic. Random effects are classification effects where the levels of the effects are assumed to be randomly selected from an infinite population of possible levels. The solution for the normal equations in mixed-model designs is identical to the solution for fixed-effect designs (i.e., designs which do not contain random effects). Mixed-model designs differ from fixed-effect designs only in the way in which effects are tested for significance. In fixed-effect designs, between effects are always tested using the mean square residual as the error term. In mixed-model designs, between effects are tested using relevant error terms based on the covariation of sources of variation in the design. Also, only the overparameterized model is used to code effects for categorical predictors in mixed-models, because the sigma-restricted model is overly restrictive.

The covariation of sources of variation in the design is estimated by the elements of a matrix called the Expected Mean Squares (EMS) matrix. This non-square matrix contains elements for the covariation of each combination of pairs of sources of variation and for each source of variation with *Error*. Specifically, each element is the mean square for one effect (indicated by the column) that is expected to be accounted by another effect (indicated by the row), given the observed covariation in their levels. Note that expected mean squares can be computing using any type of sums of squares from Type I through Type V. Once the EMS matrix is computed, it is used to the solve for the linear combinations of sources of random variation that are appropriate to use as error terms for testing the significance of the respective effects. This is done using Satterthwaite’s method of denominator synthesis (Satterthwaite, 1946). Detailed discussions of methods for testing effects in mixed-models, and related methods for estimating variance components for random effects, can be found in the *Variance Components and Mixed Model ANOVA/ANCOVA* topic.

### Testing Specific Hypotheses

Whole model tests and tests based on sums of squares attributable to specific effects illustrate two general types of hypotheses that can be tested using the general linear model. Still, there may be other types of hypotheses the researcher wishes to test that do not fall into either of these categories. For example, hypotheses about subsets of effects may be of interest, or hypotheses involving comparisons of specific levels of categorical predictor variables may be of interest.

**Estimability of Hypotheses**. Before considering tests of specific hypotheses of this sort, it is important to address the issue of estimability. A test of a specific hypothesis using the general linear model must be framed in terms of the regression coefficients for the solution of the normal equations. If the ** X’X **matrix is less than full rank, the regression coefficients depend on the particular g2 inverse used for solving the normal equations, and the regression coefficients will not be unique. When the regression coefficients are not unique, linear functions (

*f*) of the regression coefficients having the form

f = Lb

where *L* is a vector of coefficients, will also in general not be unique. However, *Lb *for an *L *which satisfies

L = L(X’X)^{–}X’X

is invariant for all possible g2 inverses, and is therefore called an estimable function.

The theory of estimability of linear functions is an advanced topic in the theory of algebraic invariants (Searle, 1987, provides a comprehensive introduction), but its implications are clear enough. One instance of non-estimability of a hypothesis has been encountered in tests of the effective hypothesis which have zero degrees of freedom. On the other hand, Type III sums of squares for categorical predictor variable effects in ANOVA designs with no missing cells (and the least squares means in such designs) provide an example of estimable functions which do not depend on the model parameterization (i.e., the particular g2 inverse used to solve the normal equations). The general implication of the theory of estimability of linear functions is that hypotheses which cannot be expressed as linear combinations of the rows of *X* (i.e., the combinations of observed levels of the categorical predictor variables) are not estimable, and therefore cannot be tested. Stated another way, we simply cannot test specific hypotheses that are not represented in the data. The notion of estimability is valuable because the test for estimability makes explicit which specific hypotheses can be tested and which cannot.

Linear Combinations of Effects. In multiple regression designs, it is common for hypotheses of interest to involve subsets of effects. In mixture designs, for example, we might be interested in simultaneously testing whether the main effect and any of the two-way interactions involving a particular predictor variable are non-zero. It is also common in multiple regression designs for hypotheses of interest to involves comparison of slopes. For example, we might be interested in whether the regression coefficients for two predictor variables differ. In both factorial regression and factorial ANOVA designs with many factors, it is often of interest whether sets of effects, say, all three-way and higher-order interactions, are nonzero. Tests of these types of specific hypotheses involve (1) constructing one or more *L*s reflecting the hypothesis, (2) testing the estimability of the hypothesis by determining whether

L = L(X’X)^{–}X’X

and if so, using (3)

(Lb)’-L’)^{-1}(Lb)

to estimate the sums of squares accounted for by the hypothesis. Finally, (4) the hypothesis is tested for significance using the usual mean square residual as the error term. To illustrate this 4-step procedure, suppose that a test of the difference in the regression slopes is desired for the (intercept plus) 2 predictor variables in a first-order multiple regression design. The coefficients for *L* would be

L = [0 1 -1]

(note that the first coefficient 0 excludes the intercept from the comparison) for which *Lb *is estimable if the 2 predictor variables are not redundant with each other. The hypothesis sums of squares reflect the difference in the partial regression coefficients for the 2 predictor variables, which is tested for significance using the mean square residual as the error term.

Planned Comparisons of Least Square Means. Usually, experimental hypotheses are stated in terms that are more specific than simply main effects or interactions. We may have the specific hypothesis that a particular textbook will improve math skills in males, but not in females, while another book would be about equally effective for both genders, but less effective overall for males. Now generally, we are predicting an interaction here: the effectiveness of the book is modified (qualified) by the student’s gender. However, we have a particular prediction concerning the nature of the interaction: we expect a significant difference between genders for one book, but not the other. This type of specific prediction is usually tested by testing planned comparisons of least squares means (estimates of the population marginal means), or as it is sometimes called, contrast analysis.

Briefly, contrast analysis allows us to test the statistical significance of predicted specific differences in particular parts of our complex design. The 4-step procedure for testing specific hypotheses is used to specify and test specific predictions. Contrast analysis is a major and indispensable component of the analysis of many complex experimental designs (see also for details).

To learn more about the logic and interpretation of contrast analysis refer to the *ANOVA/MANOVA* topic *Overview* section.

**Post-Hoc Comparisons.** Sometimes we find effects in an experiment that were not expected. Even though in most cases a creative experimenter will be able to explain almost any pattern of means, it would not be appropriate to analyze and evaluate that pattern as if we had predicted it all along. The problem here is one of capitalizing on chance when performing multiple tests post-hoc, that is, without a priori hypotheses. To illustrate this point, let’s consider the following “experiment.” Imagine we were to write down a number between 1 and 10 on 100 pieces of paper. We then put all of those pieces into a hat and draw 20 samples (of pieces of paper) of 5 observations each, and compute the means (from the numbers written on the pieces of paper) for each group. How likely do you think it is that we will find two sample means that are significantly different from each other? It is very likely! Selecting the extreme means obtained from 20 samples is very different from taking only 2 samples from the hat in the first place, which is what the test via the contrast analysis implies. Without going into further detail, there are several so-called post-hoc tests that are explicitly based on the first scenario (taking the extremes from 20 samples), that is, they are based on the assumption that we have chosen for our comparison the most extreme (different) means out of *k* total means in the design. Those tests apply “corrections” that are designed to offset the advantage of post-hoc selection of the most extreme comparisons. Whenever we find unexpected results in an experiment, we should use those post-hoc procedures to test their statistical significance.

### Testing Hypotheses for Repeated Measures and Dependent Variables

In the discussion of different hypotheses that can be tested using the general linear model, the tests have been described as tests for “the dependent variable” or “the outcome.” This has been done solely to simplify the discussion. When there are multiple dependent variables reflecting the levels of repeated measure factors, the general linear model performs tests using orthonormalized *M*-transformations of the dependent variables. When there are multiple dependent variables but no repeated measure factors, the general linear model performs tests using the hypothesis sums of squares and cross-products for the multiple dependent variables, which are tested against the residual sums of squares and cross-products for the multiple dependent variables. Thus, the same hypothesis testing procedures which apply to univariate designs with a single dependent variable also apply to repeated measure and multivariate designs.

## Healthcare Intelligence Network

## New Chart: Top 5 Tools to Identify Patients at Risk for Readmissions

**SUMMARY:** Anxious to avoid looming financial penalties for excessive hospital readmissions, healthcare organizations have tightened coordination of care and management of care transitions for Medicare beneficiaries at risk of rehospitalization. We wanted to see how healthcare organizations identify patients most at risk for returning to the hospital.

**Click here to see a larger, printable version of this chart.**

HIN’s third annual survey on Reducing Hospital Readmissions, conducted in February 2012, documented the highest rates of targeted programs to reduce readmissions in the survey’s three-year history. The survey captured details on these programs to curb readmission rates, along with the conditions most likely to trigger readmissions and much more. According to 119 healthcare companies who responded to the survey, the top 5 tools used to identify patients most at risk for readmissions are:

- Risk stratification: 55.2 percent
- Chart review: 41.4 percent
- Health claims: 39.7 percent
- Predictive modeling: 36.2 percent
- EHRs: 32.8 percent

For additional research data and insights on this topic:

** Download the executive summary of 2012 Healthcare Benchmarks: Reducing Hospital Readmissions**.

## Distribution Fitting, Formulate Hypotheses

## General Purpose

In some research applications, we can formulate hypotheses about the specific distribution of the variable of interest. For example, variables whose values are determined by an infinite number of independent random events will be distributed following the normal distribution: we can think of a person’s height as being the result of very many independent factors such as numerous specific genetic predispositions, early childhood diseases, nutrition, etc. (see the animation below for an example of the normal distribution). As a result, height tends to be normally distributed in the U.S. population. On the other hand, if the values of a variable are the result of very rare events, then the variable will be distributed according to the *Poisson* distribution (sometimes called the distribution of rare events). For example, industrial accidents can be thought of as the result of the intersection of a series of unfortunate (and unlikely) events, and their frequency tends to be distributed according to the Poisson distribution. These and other distributions are described in greater detail in the respective glossary topics.

Another common application where distribution fitting procedures are useful is when we want to verify the assumption of normality before using some parametric test (see General Purpose of Nonparametric Tests). For example, you may want to use the Kolmogorov-Smirnov test for normality or the Shapiro-Wilks’ W test to test for normality.

## Fit of the Observed Distribution

For predictive purposes it is often desirable to understand the shape of the underlying distribution of the population. To determine this underlying distribution, it is common to fit the observed distribution to a theoretical distribution by comparing the frequencies observed in the data to the expected frequencies of the theoretical distribution (i.e., a Chi-square goodness of fit test). In addition to this type a test, some software packages also allow you to compute Maximum Likelihood tests and Method of Matching Moments (see Fitting Distributions by Moments in the Process Analysis topic) tests.

**Which Distribution to use. **As described above, certain types of variables follow specific distributions. Variables whose values are determined by an infinite number of independent random events will be distributed following the normal distribution, whereas variables whose values are the result of an extremely rare event would follow the Poisson distribution. The major distributions that have been proposed for modeling survival or failure times are the exponential (and linear exponential) distribution, the Weibull distribution of extreme events, and the Gompertz distribution. The section on types of distributions contains a number of distributions generally giving a brief example of what type of data would most commonly follow a specific distribution as well as the probability density function (pdf) for each distribution.

## Types of Distributions

**Bernoulli Distribution**. This distribution best describes all situations where a “trial” is made resulting in either “success” or “failure,” such as when tossing a coin, or when modeling the success or failure of a surgical procedure. The Bernoulli distribution is defined as:

**f(x) = p ^{x} *(1-p)^{1-x}, for x = 0, 1**

where

p |
is the probability that a particular event (e.g., success) will occur. |

**Beta Distribution. **The beta distribution arises from a transformation of the F distribution and is typically used to model the distribution of order statistics. Because the beta distribution is bounded on both sides, it is often used for representing processes with natural lower and upper limits. For examples, refer to Hahn and Shapiro (1967). The beta distribution is defined as:

**f(x) = G (?+?)/[G(?)G(?)] * x ^{?-1}*(1-x)^{?-1}, for 0 < x < 1, ? > 0, ? > 0**

where

G |
is the Gamma function |

?, ? |
are the shape parameters (Shape1 and Shape2, respectively) |

The animation above shows the beta distribution as the two shape parameters change.

**Binomial Distribution. **The binomial distribution is useful for describing distributions of binomial events, such as the number of males and females in a random sample of companies, or the number of defective components in samples of 20 units taken from a production process. The binomial distribution is defined as:

**f(x) = [n!/(x!*(n-x)!)]*p ^{x} * q^{n-x}, for x = 0,1,2,…,n**

where

p |
is the probability that the respective event will occur |

q |
is equal to 1-p |

n |
is the maximum number of independent trials. |

**Cauchy Distribution. **The Cauchy distribution is interesting for theoretical reasons. Although its mean can be taken as zero, since it is symmetrical about zero, the expectation, variance, higher moments, and moment generating function do not exist. The Cauchy distribution is defined as:

**f(x) = 1/(?*p*{1+[(x- ?)/ ?] ^{2}}), for 0 < ? **

where

? |
is the location parameter (median) |

? |
is the scale parameter |

p |
is the constant Pi (3.1415…) |

The animation above shows the changing shape of the Cauchy distribution when the location parameter equals 0 and the scale parameter equals 1, 2, 3, and 4.

**Chi-square Distribution. **The sum of n independent squared random variables, each distributed following the standard normal distribution, is distributed as Chi-square with n degrees of freedom. This distribution is most frequently used in the modeling of random variables (e.g., representing frequencies) in statistical applications. The Chi-square distribution is defined by:

**f(x) = {1/[2 ^{?/2}* G(?/2)]} * [x^{(?/2)-1} * e^{-x/2}], for ? = 1, 2, …, 0 < x **

where

? | is the degrees of freedom |

e |
is the base of the natural logarithm, sometimes called Euler’s e (2.71…) |

G |
(gamma) is the Gamma function. |

The above animation shows the shape of the Chi-square distribution as the degrees of freedom increase (1, 2, 5, 10, 25 and 50).

**Exponential Distribution. **If *T* is the time between occurrences of rare events that happen on the average with a rate l per unit of time, then *T* is distributed exponentially with parameter l (lambda). Thus, the exponential distribution is frequently used to model the time interval between successive random events. Examples of variables distributed in this manner would be the gap length between cars crossing an intersection, life-times of electronic devices, or arrivals of customers at the check-out counter in a grocery store. The exponential distribution function is defined as:

**f(x) = ? *e ^{-? x} for 0 = x < 8, ? > 0**

where

? |
is an exponential function parameter (an alternative parameterization is scale parameter b=1/l) |

e |
is the base of the natural logarithm, sometimes called Euler’s e (2.71…) |

**Extreme Value. **The extreme value distribution is often used to model extreme events, such as the size of floods, gust velocities encountered by airplanes, maxima of stock market indices over a given year, etc.; it is also often used in reliability testing, for example in order to represent the distribution of failure times for electric circuits (see Hahn and Shapiro, 1967). The extreme value (Type I) distribution has the probability density function:

**f(x) = 1/b * e^[-(x-a)/b] * e^{-e^[-(x-a)/b]}, for -8 < x < 8, b > 0**

where

a |
is the location parameter |

b |
is the scale parameter |

e |
is the base of the natural logarithm, sometimes called Euler’s e (2.71…) |

**F Distribution. **Snedecor’s F distribution is most commonly used in tests of variance (e.g., ANOVA). The ratio of two chi-squares divided by their respective degrees of freedom is said to follow an F distribution. The F distribution (for x > 0) has the probability density function (for n = 1, 2, …; w = 1, 2, …):

**f(x) = [G{(?+?)/2}]/[G(?/2)G(?/2)] * (?/?) ^{(?/2)} * x^{[(?/2)-1]} * {1+[(?/?)*x]}^{[-(?+?)/2]}, for 0 = x < 8, ?=1,2,…, ?=1,2,…**

where

?, ? |
are the shape parameters, degrees of freedom |

G |
is the Gamma function |

The animation above shows various tail areas (p-values) for an F distribution with both degrees of freedom equal to 10.

**Gamma Distribution. **The probability density function of the exponential distribution has a mode of zero. In many instances, it is known *a priori* that the mode of the distribution of a particular random variable of interest is not equal to zero (e.g., when modeling the distribution of the life-times of a product such as an electric light bulb, or the serving time taken at a ticket booth at a baseball game). In those cases, the gamma distribution is more appropriate for describing the underlying distribution. The gamma distribution is defined as:

**f(x) = {1/[bG(c)]}*[x/b] ^{c-1}*e^{-x/b} for 0 = x, c > 0**

where

G |
is the Gamma function |

c |
is the Shape parameter |

b |
is the Scale parameter. |

e |
is the base of the natural logarithm, sometimes called Euler’s e (2.71…) |

The animation above shows the gamma distribution as the shape parameter changes from 1 to 6.

**Geometric Distribution. **If independent Bernoulli trials are made until a “success” occurs, then the total number of trials required is a geometric random variable. The geometric distribution is defined as:

**f(x) = p*(1-p) ^{x}, for x = 1,2,…**

where

p |
is the probability that a particular event (e.g., success) will occur. |

**Gompertz Distribution. **The Gompertz distribution is a theoretical distribution of survival times. Gompertz (1825) proposed a probability model for human mortality, based on the assumption that the “average exhaustion of a man’s power to avoid death to be such that at the end of equal infinitely small intervals of time he lost equal portions of his remaining power to oppose destruction which he had at the commencement of these intervals” (Johnson, Kotz, Balakrishnan, 1995, p. 25). The resultant hazard function:

**r(x)=Bc ^{x}, for x = 0, B > 0, c = 1**

is often used in survival analysis. See Johnson, Kotz, Balakrishnan (1995) for additional details.

**Laplace Distribution. **For interesting mathematical applications of the Laplace distribution see Johnson and Kotz (1995). The Laplace (or Double Exponential) distribution is defined as:

**f(x) = 1/(2b) * e ^{[-(|x-a|/b)]}, for -8 < x < 8**

where

a |
is the location parameter (mean) |

b |
is the scale parameter |

e |
is the base of the natural logarithm, sometimes called Euler’s e (2.71…) |

The graphic above shows the changing shape of the Laplace distribution when the location parameter equals 0 and the scale parameter equals 1, 2, 3, and 4.

**Logistic Distribution. **The logistic distribution is used to model binary responses (e.g., Gender) and is commonly used in logistic regression. The logistic distribution is defined as:

**f(x) = (1/b) * e ^{[-(x-a)/b]} * {1+e^{[-(x-a)/b]}^-2}, for -8 < x < 8, 0 < b**

where

a |
is the location parameter (mean) |

b |
is the scale parameter |

e |
is the base of the natural logarithm, sometimes called Euler’s e (2.71…) |

The graphic above shows the changing shape of the logistic distribution when the location parameter equals 0 and the scale parameter equals 1, 2, and 3.

**Log-normal Distribution. **The log-normal distribution is often used in simulations of variables such as personal incomes, age at first marriage, or tolerance to poison in animals. In general, if x is a sample from a normal distribution, then y = e^{x} is a sample from a log-normal distribution. Thus, the log-normal distribution is defined as:

**f(x) = 1/[xs(2p) ^{1/2}] * e^{-[log(x)-µ]**2/2s**2}, for 0 < x < 8, µ > 0, s > 0**

where

µ |
is the scale parameter |

s |
is the shape parameter |

e |
is the base of the natural logarithm, sometimes called Euler’s e (2.71…) |

The animation above shows the log-normal distribution with mu equal to 0 and sigma equals .10, .30, .50, .70, and .90.

**Normal Distribution. **The normal distribution (the “bell-shaped curve” which is symmetrical about the mean) is a theoretical function commonly used in inferential statistics as an approximation to sampling distributions (see also Elementary Concepts). In general, the normal distribution provides a good model for a random variable, when:

- There is a strong tendency for the variable to take a central value;
- Positive and negative deviations from this central value are equally likely;
- The frequency of deviations falls off rapidly as the deviations become larger.

As an underlying mechanism that produces the normal distribution, we can think of an infinite number of independent random (binomial) events that bring about the values of a particular variable. For example, there are probably a nearly infinite number of factors that determine a person’s height (thousands of genes, nutrition, diseases, etc.). Thus, height can be expected to be normally distributed in the population. The normal distribution function is determined by the following formula:

**f(x) = 1/[(2*p) ^{1/2}*s] * e**{-1/2*[(x-µ)/s]^{2} }, for -8 < x < 8 **

where

µ |
is the mean |

s |
is the standard deviation |

e |
is the base of the natural logarithm, sometimes called Euler’s e (2.71…) |

p |
is the constant Pi (3.14…) |

The animation above shows several tail areas of the standard normal distribution (i.e., the normal distribution with a mean of 0 and a standard deviation of 1). The standard normal distribution is often used in hypothesis testing.

**Pareto Distribution. **The Pareto distribution is commonly used in monitoring production processes (see Quality Control and Process Analysis). For example, a machine which produces copper wire will occasionally generate a flaw at some point along the wire. The Pareto distribution can be used to model the length of wire between successive flaws. The standard Pareto distribution is defined as:

**f(x) = c/x ^{c+1}, for 1 = x, c < 0**

where

c |
is the shape parameter |

The animation above shows the Pareto distribution for the shape parameter equal to 1, 2, 3, 4, and 5.

**Poisson Distribution. **The Poisson distribution is also sometimes referred to as the distribution of rare events. Examples of Poisson distributed variables are number of accidents per person, number of sweepstakes won per person, or the number of catastrophic defects found in a production process. It is defined as:

**f(x) = (? ^{x}*e^{-?})/x!, for x = 0,1,2,…, 0 < ?**

where

? |
(lambda) is the expected value of x (the mean) |

e |
is the base of the natural logarithm, sometimes called Euler’s e (2.71…) |

**Rayleigh Distribution. **If two independent variables y_{1} and y_{2} are independent from each other and normally distributed with equal variance, then the variable x = Ö(y_{1}^{2}+ y_{2}^{2}) will follow the Rayleigh distribution. Thus, an example (and appropriate metaphor) for such a variable would be the distance of darts from the target in a dart-throwing game, where the errors in the two dimensions of the target plane are independent and normally distributed. The Rayleigh distribution is defined as:

**f(x) = x/b ^{2} * e^[-(x^{2}/2b^{2})], for 0 = x < 8, b > 0**

where

b |
is the scale parameter |

e |
is the base of the natural logarithm, sometimes called Euler’s e (2.71…) |

The graphic above shows the changing shape of the Rayleigh distribution when the scale parameter equals 1, 2, and 3.

**Rectangular Distribution. **The rectangular distribution is useful for describing random variables with a constant probability density over the defined range a<b.

**f(x) = 1/(b-a), for a<x<b
= 0 , elsewhere**

where

a<b |
are constants. |

**Student’s t Distribution. **The student’s t distribution is symmetric about zero, and its general shape is similar to that of the standard normal distribution. It is most commonly used in testing hypothesis about the mean of a particular population. The student’s t distribution is defined as (for n = 1, 2, . . .):

**f(x) = G[(?+1)/2] / G(?/2) * (?*p) ^{-1/2} * [1 + (x^{2}/?)^{-(?+1)/2}] **

where

? |
is the shape parameter, degrees of freedom |

G |
is the Gamma function |

p |
is the constant Pi (3.14 . . .) |

The shape of the student’s t distribution is determined by the degrees of freedom. As shown in the animation above, its shape changes as the degrees of freedom increase.

**Weibull Distribution. **As described earlier, the exponential distribution is often used as a model of time-to-failure measurements, when the failure (hazard) rate is constant over time. When the failure probability varies over time, then the Weibull distribution is appropriate. Thus, the Weibull distribution is often used in reliability testing (e.g., of electronic relays, ball bearings, etc.; see Hahn and Shapiro, 1967). The Weibull distribution is defined as:

**f(x) = c/b*(x/b) ^{(c-1)} * e^{[-(x/b)^c]}, for 0 = x < 8, b > 0, c > 0**

where

b |
is the scale parameter |

c |
is the shape parameter |

e |
is the base of the natural logarithm, sometimes called Euler’s e (2.71…) |

The animation above shows the Weibull distribution as the shape parameter increases (.5, 1, 2, 3, 4, 5, and 10).

## Discriminant Function Analysis

# Discover Which Variables Discriminate Between Groups, Discriminant Function Analysis

## General Purpose

Discriminant function analysis is used to determine which variables discriminate between two or more naturally occurring groups. For example, an educational researcher may want to investigate which variables discriminate between high school graduates who decide (1) to go to college, (2) to attend a trade or professional school, or (3) to seek no further training or education. For that purpose the researcher could collect data on numerous variables prior to students’ graduation. After graduation, most students will naturally fall into one of the three categories. *Discriminant Analysis* could then be used to determine which variable(s) are the best predictors of students’ subsequent educational choice.

A medical researcher may record different variables relating to patients’ backgrounds in order to learn which variables best predict whether a patient is likely to recover completely (group 1), partially (group 2), or not at all (group 3). A biologist could record different characteristics of similar types (groups) of flowers, and then perform a discriminant function analysis to determine the set of characteristics that allows for the best discrimination between the types.

## Computational Approach

Computationally, discriminant function analysis is very similar to analysis of variance (ANOVA). Let us consider a simple example. Suppose we measure height in a random sample of 50 males and 50 females. Females are, on the average, not as tall as males, and this difference will be reflected in the difference in means (for the variable *Height*). Therefore, variable height allows us to discriminate between males and females with a better than chance probability: if a person is tall, then he is likely to be a male, if a person is short, then she is likely to be a female.

We can generalize this reasoning to groups and variables that are less “trivial.” For example, suppose we have two groups of high school graduates: Those who choose to attend college after graduation and those who do not. We could have measured students’ stated intention to continue on to college one year prior to graduation. If the means for the two groups (those who actually went to college and those who did not) are different, then we can say that intention to attend college as stated one year prior to graduation allows us to discriminate between those who are and are not college bound (and this information may be used by career counselors to provide the appropriate guidance to the respective students).

To summarize the discussion so far, the basic idea underlying discriminant function analysis is to determine whether groups differ with regard to the mean of a variable, and then to use that variable to predict group membership (e.g., of new cases).

**Analysis of Variance. **Stated in this manner, the discriminant function problem can be rephrased as a one-way analysis of variance (ANOVA) problem. Specifically, one can ask whether or not two or more groups are *significantly different* from each other with respect to the mean of a particular variable. To learn more about how one can test for the statistical significance of differences between means in different groups you may want to read the Overview section to ANOVA/MANOVA. However, it should be clear that, if the means for a variable are significantly different in different groups, then we can say that this variable discriminates between the groups.

In the case of a single variable, the final significance test of whether or not a variable discriminates between groups is the *F* test. As described in Elementary Concepts and ANOVA /MANOVA, *F* is essentially computed as the ratio of the between-groups variance in the data over the pooled (average) within-group variance. If the between-group variance is significantly larger then there must be significant differences between means.

**Multiple Variables. **Usually, one includes several variables in a study in order to see which one(s) contribute to the discrimination between groups. In that case, we have a matrix of total variances and covariances; likewise, we have a matrix of pooled within-group variances and covariances. We can compare those two matrices via multivariate *F* tests in order to determined whether or not there are any significant differences (with regard to all variables) between groups. This procedure is identical to multivariate analysis of variance or MANOVA. As in MANOVA, one could first perform the multivariate test, and, if statistically significant, proceed to see which of the variables have significantly different means across the groups. Thus, even though the computations with multiple variables are more complex, the principal reasoning still applies, namely, that we are looking for variables that discriminate between groups, as evident in observed mean differences.

## Stepwise Discriminant Analysis

Probably the most common application of discriminant function analysis is to include many measures in the study, in order to determine the ones that discriminate between groups. For example, an educational researcher interested in predicting high school graduates’ choices for further education would probably include as many measures of personality, achievement motivation, academic performance, etc. as possible in order to learn which one(s) offer the best prediction.

**Model. **Put another way, we want to build a “model” of how we can best predict to which group a case belongs. In the following discussion we will use the term “in the model” in order to refer to variables that are included in the prediction of group membership, and we will refer to variables as being “not in the model” if they are not included.

**Forward stepwise analysis.** In stepwise discriminant function analysis, a model of discrimination is built step-by-step. Specifically, at each step all variables are reviewed and evaluated to determine which one will contribute most to the discrimination between groups. That variable will then be included in the model, and the process starts again.

**Backward stepwise analysis. **One can also step backwards; in that case all variables are included in the model and then, at each step, the variable that contributes least to the prediction of group membership is eliminated. Thus, as the result of a successful discriminant function analysis, one would only keep the “important” variables in the model, that is, those variables that contribute the most to the discrimination between groups.

** F to enter, F to remove.** The stepwise procedure is “guided” by the respective

*F*to enter and

*F*to remove values. The

*F*value for a variable indicates its statistical significance in the discrimination between groups, that is, it is a measure of the extent to which a variable makes a unique contribution to the prediction of group membership. If you are familiar with stepwise multiple regression procedures, then you may interpret the

*F*to enter/remove values in the same way as in stepwise regression.

**Capitalizing on chance. **A common misinterpretation of the results of stepwise discriminant analysis is to take statistical significance levels at face value. By nature, the stepwise procedures will capitalize on chance because they “pick and choose” the variables to be included in the model so as to yield maximum discrimination. Thus, when using the stepwise approach the researcher should be aware that the significance levels do not reflect the true *alpha* error rate, that is, the probability of erroneously rejecting *H _{0}* (the null hypothesis that there is no discrimination between groups).

## Interpreting a Two-Group Discriminant Function

In the two-group case, discriminant function analysis can also be thought of as (and is analogous to) multiple regression (see Multiple Regression; the two-group discriminant analysis is also called *Fisher linear discriminant analysis* after Fisher, 1936; computationally all of these approaches are analogous). If we code the two groups in the analysis as *1* and *2*, and use that variable as the dependent variable in a multiple regression analysis, then we would get results that are analogous to those we would obtain via *Discriminant Analysis*. In general, in the two-group case we fit a linear equation of the type:

Group = a + b_{1}*x_{1} + b_{2}*x_{2} + … + b_{m}*x_{m}

where *a* is a constant and *b _{1}* through

*b*are regression coefficients. The interpretation of the results of a two-group problem is straightforward and closely follows the logic of multiple regression: Those variables with the largest (standardized) regression coefficients are the ones that contribute most to the prediction of group membership.

_{m}## Discriminant Functions for Multiple Groups

When there are more than two groups, then we can estimate more than one discriminant function like the one presented above. For example, when there are three groups, we could estimate (1) a function for discriminating between group 1 and groups 2 and 3 combined, and (2) another function for discriminating between group 2 and group 3. For example, we could have one function that discriminates between those high school graduates that go to college and those who do not (but rather get a job or go to a professional or trade school), and a second function to discriminate between those graduates that go to a professional or trade school versus those who get a job. The *b* coefficients in those discriminant functions could then be interpreted as before.

**Canonical analysis. **When actually performing a multiple group discriminant analysis, we do not have to specify how to combine groups so as to form different discriminant functions. Rather, you can automatically determine some optimal combination of variables so that the first function provides the most overall discrimination between groups, the second provides second most, and so on. Moreover, the functions will be independent or *orthogonal*, that is, their contributions to the discrimination between groups will not overlap. Computationally, you will perform a canonical correlation analysis (see also Canonical Correlation) that will determine the successive functions and canonical *roots* (the term root refers to the eigenvalues that are associated with the respective canonical function). The maximum number of functions will be equal to the number of groups minus one, or the number of variables in the analysis, whichever is smaller.

**Interpreting the discriminant functions. **As before, we will get *b* (and standardized *beta*) coefficients for each variable in each discriminant (now also called *canonical*) function, and they can be interpreted as usual: the larger the standardized coefficient, the greater is the contribution of the respective variable to the discrimination between groups. (Note that we could also interpret the *structure coefficients*; see below.) However, these coefficients do not tell us between which of the groups the respective functions discriminate. We can identify the nature of the discrimination for each discriminant (canonical) function by looking at the means for the functions across groups. We can also visualize how the two functions discriminate between groups by plotting the individual scores for the two discriminant functions (see the example graph below).

In this example, *Root* (function) *1* seems to discriminate mostly between groups *Setosa*, and *Virginic* and *Versicol* combined. In the vertical direction (*Root 2*), a slight trend of *Versicol* points to fall below the center line (*0*) is apparent.

**Factor structure matrix. **Another way to determine which variables “mark” or define a particular discriminant function is to look at the factor structure. The factor structure coefficients are the correlations between the variables in the model and the discriminant functions; if you are familiar with factor analysis (see *Factor Analysis*) you may think of these correlations as factor *loadings* of the variables on each discriminant function.

Some authors have argued that these structure coefficients should be used when interpreting the substantive “meaning” of discriminant functions. The reasons given by those authors are that (1) supposedly the structure coefficients are more stable, and (2) they allow for the interpretation of factors (discriminant functions) in the manner that is analogous to factor analysis. However, subsequent Monte Carlo research (Barcikowski & Stevens, 1975; Huberty, 1975) has shown that the discriminant function coefficients and the structure coefficients are about equally unstable, unless the n is fairly large (e.g., if there are 20 times more cases than there are variables). The most important thing to remember is that the discriminant function coefficients denote the unique (partial) contribution of each variable to the discriminant function(s), while the structure coefficients denote the simple correlations between the variables and the function(s). If one wants to assign substantive “meaningful” labels to the discriminant functions (akin to the interpretation of factors in factor analysis), then the structure coefficients should be used (interpreted); if one wants to learn what is each variable’s unique contribution to the discriminant function, use the discriminant function coefficients (weights).

**Significance of discriminant functions. **One can test the number of roots that add *significantly* to the discrimination between group. Only those found to be statistically significant should be used for interpretation; non-significant functions (roots) should be ignored.

**Summary. **To summarize, when interpreting multiple discriminant functions, which arise from analyses with more than two groups and more than one variable, one would first test the different functions for statistical significance, and only consider the significant functions for further examination. Next, we would look at the standardized *b* coefficients for each variable for each significant function. The larger the standardized *b* coefficient, the larger is the respective variable’s unique contribution to the discrimination specified by the respective discriminant function. In order to derive substantive “meaningful” labels for the discriminant functions, one can also examine the factor structure matrix with the correlations between the variables and the discriminant functions. Finally, we would look at the means for the significant discriminant functions in order to determine between which groups the respective functions seem to discriminate.

## Assumptions

As mentioned earlier, discriminant function analysis is computationally very similar to MANOVA, and all assumptions for *MANOVA* mentioned in ANOVA/MANOVA apply. In fact, you may use the wide range of diagnostics and statistical tests of assumption that are available to examine your data for the discriminant analysis.

**Normal distribution.** It is assumed that the data (for the variables) represent a sample from a multivariate normal distribution. You can examine whether or not variables are normally distributed with histograms of frequency distributions. However, note that violations of the normality assumption are usually not “fatal,” meaning, that the resultant significance tests etc. are still “trustworthy.” You may use specific tests for normality in addition to graphs.

**Homogeneity of variances/covariances. **It is assumed that the variance/covariance matrices of variables are homogeneous across groups. Again, minor deviations are not that important; however, before accepting final conclusions for an important study it is probably a good idea to review the within-groups variances and correlation matrices. In particular a scatterplot matrix can be produced and can be very useful for this purpose. When in doubt, try re-running the analyses excluding one or two groups that are of less interest. If the overall results (interpretations) hold up, you probably do not have a problem. You may also use the numerous tests available to examine whether or not this assumption is violated in your data. However, as mentioned in *ANOVA/MANOVA*, the multivariate Box *M* test for homogeneity of variances/covariances is particularly sensitive to deviations from multivariate normality, and should not be taken too “seriously.”

**Correlations between means and variances. **The major “real” threat to the validity of significance tests occurs when the means for variables across groups are correlated with the variances (or standard deviations). Intuitively, if there is large variability in a group with particularly high means on some variables, then those high means are not reliable. However, the overall significance tests are based on pooled variances, that is, the average variance across all groups. Thus, the significance tests of the relatively larger means (with the large variances) would be based on the relatively smaller pooled variances, resulting erroneously in statistical significance. In practice, this pattern may occur if one group in the study contains a few extreme outliers, who have a large impact on the means, and also increase the variability. To guard against this problem, inspect the descriptive statistics, that is, the means and standard deviations or variances for such a correlation.

**The matrix ill-conditioning problem. **Another assumption of discriminant function analysis is that the variables that are used to discriminate between groups are not completely redundant. As part of the computations involved in discriminant analysis, you will invert the variance/covariance matrix of the variables in the model. If any one of the variables is completely redundant with the other variables then the matrix is said to be *ill-conditioned*, and it cannot be inverted. For example, if a variable is the sum of three other variables that are also in the model, then the matrix is ill-conditioned.

**Tolerance values. **In order to guard against matrix ill-conditioning, constantly check the so-called tolerance value for each variable. This tolerance value is computed as *1 minus R-square* of the respective variable with all other variables included in the current model. Thus, it is the proportion of variance that is unique to the respective variable. You may also refer to Multiple Regression to learn more about multiple regression and the interpretation of the tolerance value. In general, when a variable is almost completely redundant (and, therefore, the matrix ill-conditioning problem is likely to occur), the tolerance value for that variable will approach 0.

## Classification

Another major purpose to which discriminant analysis is applied is the issue of predictive classification of cases. Once a model has been finalized and the discriminant functions have been derived, how well can we *predict* to which group a particular case belongs?

** A priori and post hoc predictions. **Before going into the details of different estimation procedures, we would like to make sure that this difference is clear. Obviously, if we estimate, based on some data set, the discriminant functions that best discriminate between groups, and then use the

*same*data to evaluate how accurate our prediction is, then we are very much capitalizing on chance. In general, one will

*always*get a worse classification when predicting cases that were not used for the estimation of the discriminant function. Put another way,

*post hoc*predictions are always better than

*a priori*predictions. (The trouble with predicting the future

*a priori*is that one does not know what will happen; it is much easier to find ways to predict what we already know has happened.) Therefore, one should never base one’s confidence regarding the correct classification of future observations on the same data set from which the discriminant functions were derived; rather, if one wants to classify cases predictively, it is necessary to collect new data to “try out” (cross-validate) the utility of the discriminant functions.

**Classification functions. **These are not to be confused with the discriminant functions. The classification functions can be used to determine to which group each case most likely belongs. There are as many classification functions as there are groups. Each function allows us to compute *classification scores* for each case for each group, by applying the formula:

S_{i} = c_{i} + w_{i1}*x_{1} + w_{i2}*x_{2} + … + w_{im}*x_{m}

In this formula, the subscript *i* denotes the respective group; the subscripts *1, 2, …, m* denote the *m* variables; *c _{i}* is a constant for the

*i*‘th group,

*w*is the weight for the

_{ij}*j*‘th variable in the computation of the classification score for the

*i*‘th group;

*x*is the observed value for the respective case for the

_{j}*j*‘th variable.

*S*is the resultant classification score.

_{i}We can use the classification functions to directly compute classification scores for some new observations.

**Classification of cases. **Once we have computed the classification scores for a case, it is easy to decide how to classify the case: in general we classify the case as belonging to the group for which it has the highest classification score (unless the *a priori* classification probabilities are widely disparate; see below). Thus, if we were to study high school students’ post-graduation career/educational choices (e.g., attending college, attending a professional or trade school, or getting a job) based on several variables assessed one year prior to graduation, we could use the classification functions to predict what each student is most likely to do after graduation. However, we would also like to know the *probability* that the student will make the predicted choice. Those probabilities are called *posterior* probabilities, and can also be computed. However, to understand how those probabilities are derived, let us first consider the so-called *Mahalanobis* distances.

**Mahalanobis distances. **You may have read about these distances in other parts of the manual. In general, the Mahalanobis distance is a measure of distance between two points in the space defined by two or more correlated variables. For example, if there are two variables that are uncorrelated, then we could plot points (cases) in a standard two-dimensional scatterplot; the Mahalanobis distances between the points would then be identical to the Euclidean distance; that is, the distance as, for example, measured by a ruler. If there are three uncorrelated variables, we could also simply use a ruler (in a 3-D plot) to determine the distances between points. If there are more than 3 variables, we cannot represent the distances in a plot any more. Also, when the variables are correlated, then the axes in the plots can be thought of as being *non-orthogonal*; that is, they would not be positioned in right angles to each other. In those cases, the simple Euclidean distance is not an appropriate measure, while the Mahalanobis distance will adequately account for the correlations.

**Mahalanobis distances and classification. **For each group in our sample, we can determine the location of the point that represents the means for all variables in the multivariate space defined by the variables in the model. These points are called group *centroids*. For each case we can then compute the Mahalanobis distances (of the respective case) from each of the group centroids. Again, we would classify the case as belonging to the group to which it is closest, that is, where the Mahalanobis distance is smallest.

**Posterior classification probabilities. **Using the Mahalanobis distances to do the classification, we can now derive probabilities. The probability that a case belongs to a particular group is basically proportional to the Mahalanobis distance from that group centroid (it is not exactly proportional because we assume a multivariate normal distribution around each centroid). Because we compute the location of each case from our prior knowledge of the values for that case on the variables in the model, these probabilities are called *posterior* probabilities. In summary, the posterior probability is the probability, based on our knowledge of the values of other variables, that the respective case belongs to a particular group. Some software packages will automatically compute those probabilities for all cases (or for selected cases only for cross-validation studies).

**A priori classification probabilities. **There is one additional factor that needs to be considered when classifying cases. Sometimes, we know ahead of time that there are more observations in one group than in any other; thus, the *a priori* probability that a case belongs to that group is higher. For example, if we know ahead of time that 60% of the graduates from our high school usually go to college (20% go to a professional school, and another 20% get a job), then we should adjust our prediction accordingly: *a priori*, and all other things being equal, it is more likely that a student will attend college that choose either of the other two options. You can specify different *a priori* probabilities, which will then be used to adjust the classification of cases (and the computation of posterior probabilities) accordingly.

In practice, the researcher needs to ask him or herself whether the unequal number of cases in different groups in the sample is a reflection of the true distribution in the population, or whether it is only the (random) result of the sampling procedure. In the former case, we would set the *a priori* probabilities to be proportional to the sizes of the groups in our sample, in the latter case we would specify the *a priori* probabilities as being equal in each group. The specification of different *a priori* probabilities can greatly affect the accuracy of the prediction.

**Summary of the prediction. **A common result that one looks at in order to determine how well the current classification functions predict group membership of cases is the *classification matrix*. The classification matrix shows the number of cases that were correctly classified (on the diagonal of the matrix) and those that were misclassified.

**Another word of caution. **To reiterate, *post hoc* predicting of what has happened in the past is not that difficult. It is not uncommon to obtain very good classification if one uses the same cases from which the classification functions were computed. In order to get an idea of how well the current classification functions “perform,” one must classify (*a priori*) *different* cases, that is, cases that were not used to estimate the classification functions. You can include or exclude cases from the computations; thus, the classification matrix can be computed for “old” cases as well as “new” cases. Only the classification of new cases allows us to assess the predictive validity of the classification functions (see also cross-validation); the classification of old cases only provides a useful diagnostic tool to identify outliers or areas where the classification function seems to be less adequate.

**Summary. **In general *Discriminant Analysis* is a very useful tool (1) for detecting the variables that allow the researcher to discriminate between different (naturally occurring) groups, and (2) for classifying cases into different groups with a better than chance accuracy.

## StatSoft Recognized for “Top Commercial Tool” in Poll

*“For the first time, the number of users of free/open source software exceeded the number of users of commercial software. The usage of Big Data software grew five-fold. R, Excel, and RapidMiner were the most popular tools, **with StatSoft STATISTICA getting the top commercial tool spot.”* – KDnuggets.com

The 13th Annual KDnuggets™ Software Poll asked: **What Analytics, Data Mining, or Big Data software have you used in the past 12 months for a real project (not just evaluation)?**

This May 2012 poll attracted “a very large number of participants and used email verification” to ensure one vote per respondent. Once again, StatSoft’s *STATISTICA* received very high marks, earning “top commercial tool” in this poll.

Complete poll results and analysis can be found at KDnuggets.com.

KDnuggets.com is a data mining portal and newsletter publisher for the data mining community with more than 12,000 subscribers.

## The 13th Annual KDnuggets™ Software Poll – STATISTICA

*“For the first time, the number of users of free/open source software exceeded the number of users of commercial software. The usage of Big Data software grew five-fold. R, Excel, and RapidMiner were the most popular tools, with StatSoft STATISTICA getting the top commercial tool spot.”* – KDnuggets.com

The 13th Annual KDnuggets™ Software Poll asked: **What Analytics, Data Mining, or Big Data software have you used in the past 12 months for a real project (not just evaluation)?**

This May 2012 poll attracted “a very large number of participants and used email verification” to ensure one vote per respondent. Once again, StatSoft’s *STATISTICA* received very high marks, earning “top commercial tool” in this poll.

Complete poll results and analysis can be found at http://www.kdnuggets.com/2012/05/top-analytics-data-mining-big-data-software.html.

KDnuggets.com is a data mining portal and newsletter publisher for the data mining community with more than 12,000 subscribers.