Monthly Archives: June 2012
Design of Experiments: Science, Industrial DOE
DOE Overview
Experiments in Science and Industry
Experimental methods are widely used in research as well as in industrial settings, however, sometimes for very different purposes. The primary goal in scientific research is usually to show the statistical significance of an effect that a particular factor exerts on the dependent variable of interest (for details concerning the concept of statistical significance see Elementary Concepts).
In industrial settings, the primary goal is usually to extract the maximum amount of unbiased information regarding the factors affecting a production process from as few (costly) observations as possible. While in the former application (in science) analysis of variance (ANOVA) techniques are used to uncover the interactive nature of reality, as manifested in higherorder interactions of factors, in industrial settings interaction effects are often regarded as a “nuisance” (they are often of no interest; they only complicate the process of identifying important factors).
Differences in Techniques
These differences in purpose have a profound effect on the techniques that are used in the two settings. If you review a standard ANOVA text for the sciences, for example the classic texts by Winer (1962) or Keppel (1982), you will find that they will primarily discuss designs with up to, perhaps, five factors (designs with more than six factors are usually impractical; see ANOVA/MANOVA). The focus of these discussions is how to derive valid and robust statistical significance tests. However, if you review standard texts on experimentation in industry (Box, Hunter, and Hunter, 1978; Box and Draper, 1987; Mason, Gunst, and Hess, 1989; Taguchi, 1987) you will find that they will primarily discuss designs with many factors (e.g., 16 or 32) in which interaction effects cannot be evaluated, and the primary focus of the discussion is how to derive unbiased main effect (and, perhaps, twoway interaction) estimates with a minimum number of observations.
This comparison can be expanded further, however, a more detailed description of experimental design in industry will now be discussed and other differences will become clear. Note that the General Linear Models and ANOVA/MANOVA topics contain detailed discussions of typical design issues in scientific research; the General Linear Model procedure is a very comprehensive implementation of the general linear model approach to ANOVA/MANOVA (univariate and multivariate ANOVA). There are of course applications in industry where general ANOVA designs, as used in scientific research, can be immensely useful. You may want to read the General Linear Models and ANOVA/MANOVA topics to gain a more general appreciation of the range of methods encompassed by the term Experimental Design.
Overview
The general ideas and principles on which experimentation in industry is based, and the types of designs used will be discussed in the following paragraphs. The following paragraphs are meant to be introductory in nature. However, it is assumed that you are familiar with the basic ideas of analysis of variance and the interpretation of main effects and interactions in ANOVA. Otherwise, it is strongly recommended that you read the Introductory Overview section for ANOVA/MANOVA and General Linear Models.
General Ideas
In general, every machine used in a production process allows its operators to adjust various settings, affecting the resultant quality of the product manufactured by the machine. Experimentation allows the production engineer to adjust the settings of the machine in a systematic manner and to learn which factors have the greatest impact on the resultant quality. Using this information, the settings can be constantly improved until optimum quality is obtained. To illustrate this reasoning, here are a few examples:
Example 1: Dyestuff manufacture. Box and Draper (1987, page 115) report an experiment concerned with the manufacture of certain dyestuff. Quality in this context can be described in terms of a desired (specified) hue and brightness and maximum fabric strength. Moreover, it is important to know what to change in order to produce a different hue and brightness should the consumers’ taste change. Put another way, the experimenter would like to identify the factors that affect the brightness, hue, and strength of the final product. In the example described by Box and Draper, there are 6 different factors that are evaluated in a 2**(60) design (the 2**(kp) notation is explained below). The results of the experiment show that the three most important factors determining fabric strength are the Polysulfide index, Time, and Temperature (see Box and Draper, 1987, page 116). You can summarize the expected effect (predicted means) for the variable of interest (i.e., fabric strength in this case) in a so called cubeplot. This plot shows the expected (predicted) mean fabric strength for the respective low and high settings for each of the three variables (factors).
Example 1.1: Screening designs. In the previous example, 6 different factors were simultaneously evaluated. It is not uncommon, that there are very many (e.g., 100) different factors that may potentially be important. Special designs (e.g., PlackettBurman designs, see Plackett and Burman, 1946) have been developed to screen such large numbers of factors in an efficient manner, that is, with the least number of observations necessary. For example, you can design and analyze an experiment with 127 factors and only 128 runs (observations); still, you will be able to estimate the main effects for each factor, and thus, you can quickly identify which ones are important and most likely to yield improvements in the process under study.
Example 2: 3**3 design. Montgomery (1976, page 204) describes an experiment conducted in order identify the factors that contribute to the loss of soft drink syrup due to frothing during the filling of five gallon metal containers. Three factors were considered: (a) the nozzle configuration, (b) the operator of the machine, and (c) the operating pressure. Each factor was set at three different levels, resulting in a complete 3**(30) experimental design (the 3**(kp) notation is explained below).
Moreover, two measurements were taken for each combination of factor settings, that is, the 3**(30) design was completely replicated once.
Example 3: Maximizing yield of a chemical reaction. The yield of many chemical reactions is a function of time and temperature. Unfortunately, these two variables often do not affect the resultant yield in a linear fashion. In other words, it is not so that “the longer the time, the greater the yield” and “the higher the temperature, the greater the yield.” Rather, both of these variables are usually related in a curvilinear fashion to the resultant yield.
Thus, in this example your goal as experimenter would be to optimize the yield surface that is created by the two variables: time and temperature.
Example 4: Testing the effectiveness of four fuel additives. Latin square designs are useful when the factors of interest are measured at more than two levels, and the nature of the problem suggests some blocking. For example, imagine a study of 4 fuel additives on the reduction in oxides of nitrogen (see Box, Hunter, and Hunter, 1978, page 263). You may have 4 drivers and 4 cars at your disposal. You are not particularly interested in any effects of particular cars or drivers on the resultant oxide reduction; however, you do not want the results for the fuel additives to be biased by the particular driver or car. Latin square designs allow you to estimate the main effects of all factors in the design in an unbiased manner. With regard to the example, the arrangement of treatment levels in a Latin square design assures that the variability among drivers or cars does not affect the estimation of the effect due to different fuel additives.
Example 5: Improving surface uniformity in the manufacture of polysilicon wafers. The manufacture of reliable microprocessors requires very high consistency in the manufacturing process. Note that in this instance, it is equally, if not more important to control the variability of certain product characteristics than it is to control the average for a characteristic. For example, with regard to the average surface thickness of the polysilicon layer, the manufacturing process may be perfectly under control; yet, if the variability of the surface thickness on a wafer fluctuates widely, the resultant microchips will not be reliable. Phadke (1989) describes how different characteristics of the manufacturing process (such as deposition temperature, deposition pressure, nitrogen flow, etc.) affect the variability of the polysilicon surface thickness on wafers. However, no theoretical model exists that would allow the engineer to predict how these factors affect the uniformness of wafers. Therefore, systematic experimentation with the factors is required to optimize the process. This is a typical example where Taguchi robust design methods would be applied.
Example 6: Mixture designs. Cornell (1990, page 9) reports an example of a typical (simple) mixture problem. Specifically, a study was conducted to determine the optimum texture of fish patties as a result of the relative proportions of different types of fish (Mullet, Sheepshead, and Croaker) that made up the patties. Unlike in nonmixture experiments, the total sum of the proportions must be equal to a constant, for example, to 100%. The results of such experiments are usually graphically represented in socalled triangular (or ternary) graphs.
In general, the overall constraint — that the three components must sum to a constant — is reflected in the triangular shape of the graph (see above).
Example 6.1: Constrained mixture designs. It is particularly common in mixture designs that the relative amounts of components are further constrained (in addition to the constraint that they must sum to, for example, 100%). For example, suppose we wanted to design the besttasting fruit punch consisting of a mixture of juices from five fruits. Since the resulting mixture is supposed to be a fruit punch, pure blends consisting of the pure juice of only one fruit are necessarily excluded. Additional constraints may be placed on the “universe” of mixtures due to cost constraints or other considerations, so that one particular fruit cannot, for example, account for more than 30% of the mixtures (otherwise the fruit punch would be too expensive, the shelflife would be compromised, the punch could not be produced in large enough quantities, etc.). Such socalled constrained experimental regions present numerous problems, which, however, can be addressed.
In general, under those conditions, you seek to design an experiment that can potentially extract the maximum amount of information about the respective response function (e.g., taste of the fruit punch) in the experimental region of interest.
Computational Problems
There are basically two general issues to which Experimental Design is addressed:
 How to design an optimal experiment, and
 How to analyze the results of an experiment.
With regard to the first question, there are different considerations that enter into the different types of designs, and they will be discussed shortly. In the most general terms, the goal is always to allow the experimenter to evaluate in an unbiased (or least biased) way, the consequences of changing the settings of a particular factor, that is, regardless of how other factors were set. In more technical terms, you attempt to generate designs where main effects are unconfounded among themselves, and in some cases, even unconfounded with the interaction of factors.
Components of Variance, Denominator Synthesis
There are several statistical methods for analyzing designs with random effects (see Methods for Analysis of Variance). The Variance Components and Mixed Model ANOVA/ANCOVA topic discusses numerous options for estimating variance components for random effects, and for performing approximate F tests based on synthesized error terms.
Summary
Experimental methods are finding increasing use in manufacturing to optimize the production process. Specifically, the goal of these methods is to identify the optimum settings for the different factors that affect the production process. In the discussion so far, the major classes of designs that are typically used in industrial experimentation have been introduced: 2**(kp) (twolevel, multifactor) designs, screening designs for large numbers of factors, 3**(kp) (threelevel, multifactor) designs (mixed designs with 2 and 3 level factors are also supported), central composite (or response surface) designs, Latin square designs, Taguchi robust design analysis, mixture designs, and special procedures for constructing experiments in constrained experimental regions. Interestingly, many of these experimental techniques have “made their way” from the production plant into management, and successful implementations have been reported in profit planning in business, cashflow optimization in banking, etc. (e.g., see Yokyama and Taguchi, 1975).
These techniques will now be described in greater detail in the following sections:
 2**(kp) Fractional Factorial Designs
 2**(kp) Maximally Unconfounded and Minimum Aberration Designs
 3**(kp) , BoxBehnken, and Mixed 2 and 3 Level Factorial Designs
 Central Composite and NonFactorial Response Surface Designs
 Latin Square Designs
 Taguchi Methods: Robust Design Experiments
 Mixture designs and triangular surfaces
 Designs for constrained surfaces and mixtures
 Constructing D and Aoptimal designs for surfaces and mixtures
2**(kp) Fractional Factorial Designs at 2 Levels
Basic Idea
In many cases, it is sufficient to consider the factors affecting the production process at two levels. For example, the temperature for a chemical process may either be set a little higher or a little lower, the amount of solvent in a dyestuff manufacturing process can either be slightly increased or decreased, etc. The experimenter would like to determine whether any of these changes affect the results of the production process. The most intuitive approach to study those factors would be to vary the factors of interest in a full factorial design, that is, to try all possible combinations of settings. This would work fine, except that the number of necessary runs in the experiment (observations) will increase geometrically. For example, if you want to study 7 factors, the necessary number of runs in the experiment would be 2**7 = 128. To study 10 factors you would need 2**10 = 1,024 runs in the experiment. Because each run may require timeconsuming and costly setting and resetting of machinery, it is often not feasible to require that many different production runs for the experiment. In these conditions, fractional factorials are used that “sacrifice” interaction effects so that main effects may still be computed correctly.
Generating the Design
A technical description of how fractional factorial designs are constructed is beyond the scope of this introduction. Detailed accounts of how to design 2**(kp) experiments can be found, for example, in Bayne and Rubin (1986), Box and Draper (1987), Box, Hunter, and Hunter (1978), Montgomery (1991), Daniel (1976), Deming and Morgan (1993), Mason, Gunst, and Hess (1989), or Ryan (1989), to name only a few of the many text books on this subject. In general, it will successively “use” the highestorder interactions to generate new factors. For example, consider the following design that includes 11 factors but requires only 16 runs (observations).
Design: 2**(117), Resolution III  

Run  A  B  C  D  E  F  G  H  I  J  K 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
Reading the design. The design displayed above should be interpreted as follows. Each column contains +1’s or 1’s to indicate the setting of the respective factor (high or low, respectively). So for example, in the first run of the experiment, set all factors A through K to the plus setting (e.g., a little higher than before); in the second run, set factors A, B, and C to the positive setting, factor D to the negative setting, and so on. Note that there are numerous options provided to display (and save) the design using notation other than ±1 to denote factor settings. For example, you may use actual values of factors (e.g., 90 degrees Celsius and 100 degrees Celsius) or text labels (Low temperature, High temperature).
Randomizing the runs. Because many other things may change from production run to production run, it is always a good practice to randomize the order in which the systematic runs of the designs are performed.
The Concept of Design Resolution
The design above is described as a 2**(117) design of resolution III (three). This means that you study overall k = 11 factors (the first number in parentheses); however, p = 7 of those factors (the second number in parentheses) were generated from the interactions of a full 2**[(117) = 4] factorial design. As a result, the design does not give full resolution; that is, there are certain interaction effects that are confounded with (identical to) other effects. In general, a design of resolution R is one where no lway interactions are confounded with any other interaction of order less than Rl. In the current example, R is equal to 3. Here, no l = 1 level interactions (i.e., main effects) are confounded with any other interaction of order less than Rl = 31 = 2. Thus, main effects in this design are confounded with two way interactions; and consequently, all higherorder interactions are equally confounded. If you had included 64 runs, and generated a 2**(115) design, the resultant resolution would have been R = IV (four). You would have concluded that no l=1way interaction (main effect) is confounded with any other interaction of order less than Rl = 41 = 3. In this design then, main effects are not confounded with twoway interactions, but only with threeway interactions. What about the twoway interactions? No l=2way interaction is confounded with any other interaction of order less than Rl = 42 = 2. Thus, the twoway interactions in that design are confounded with each other.
PlackettBurman (Hadamard Matrix) Designs for Screening
When you need to screen a large number of factors to identify those that may be important (i.e., those that are related to the dependent variable of interest), you want to employ a design that allows you to test the largest number of factor main effects with the least number of observations, that is to construct a resolution III design with as few runs as possible. One way to design such experiments is to confound all interactions with “new” main effects. Such designs are also sometimes called saturated designs, because all information in those designs is used to estimate the parameters, leaving no degrees of freedom to estimate the error term for the ANOVA. Because the added factors are created by equating (aliasing, see below), the “new” factors with the interactions of a full factorial design, these designs always will have 2**k runs (e.g., 4, 8, 16, 32, and so on). Plackett and Burman (1946) showed how full factorial design can be fractionalized in a different manner, to yield saturated designs where the number of runs is a multiple of 4, rather than a power of 2. These designs are also sometimes called Hadamard matrix designs. Of course, you do not have to use all available factors in those designs, and, in fact, sometimes you want to generate a saturated design for one more factor than you are expecting to test. This will allow you to estimate the random error variability, and test for the statistical significance of the parameter estimates.
Enhancing Design Resolution via Foldover
One way in which a resolution III design can be enhanced and turned into a resolution IV design is via foldover (e.g., see Box and Draper, 1987, Deming and Morgan, 1993): Suppose you have a 7factor design in 8 runs:
Design: 2**(74) design  

Run  A  B  C  D  E  F  G 
1 2 3 4 5 6 7 8 
1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 
This is a resolution III design, that is, the twoway interactions will be confounded with the main effects. You can turn this design into a resolution IV design via the Foldover (enhance resolution) option. The foldover method copies the entire design and appends it to the end, reversing all signs:
Design: 2**(74) design (+Foldover)  

Run 
A 
B 
C 
D 
E 
F 
G 
New: H 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
Thus, the standard run number 1 was 1, 1, 1, 1, 1, 1, 1; the new run number 9 (the first run of the “foldedover” portion) has all signs reversed: 1, 1, 1, 1, 1, 1, 1. In addition to enhancing the resolution of the design, we also have gained an 8’th factor (factor H), which contains all +1’s for the first eight runs, and 1’s for the foldedover portion of the new design. Note that the resultant design is actually a 2**(84) design of resolution IV (see also Box and Draper, 1987, page 160).
Aliases of Interactions: Design Generators
To return to the example of the resolution R = III design, now that you know that main effects are confounded with twoway interactions, you may ask the question, “Which interaction is confounded with which main effect?”
Factor 
Fractional Design Generators 2**(117) design (Factors are denoted by numbers) Alias 

5 6 7 8 9 10 11 
123 234 134 124 1234 12 13 
Design generators. The design generators shown above are the “key” to how factors 5 through 11 were generated by assigning them to particular interactions of the first 4 factors of the full factorial 2**4 design. Specifically, factor 5 is identical to the 123 (factor 1 by factor 2 by factor 3) interaction. Factor 6 is identical to the 234 interaction, and so on. Remember that the design is of resolution III (three), and you expect some main effects to be confounded with some twoway interactions; indeed, factor 10 (ten) is identical to the 12 (factor 1 by factor 2) interaction, and factor 11 (eleven) is identical to the 13 (factor 1 by factor 3) interaction. Another way in which these equivalencies are often expressed is by saying that the main effect for factor 10 (ten) is an alias for the interaction of 1 by 2. (The term alias was first used by Finney, 1945).
To summarize, whenever you want to include fewer observations (runs) in your experiment than would be required by the full factorial 2**k design, you “sacrifice” interaction effects and assign them to the levels of factors. The resulting design is no longer a full factorial but a fractional factorial.
The fundamental identity. Another way to summarize the design generators is in a simple equation. Namely, if, for example, factor 5 in a fractional factorial design is identical to the 123 (factor 1 by factor 2 by factor 3) interaction, then it follows that multiplying the coded values for the 123 interaction by the coded values for factor 5 will always result in +1 (if all factor levels are coded ±1); or:
I = 1235
where I stands for +1 (using the standard notation as, for example, found in Box and Draper, 1987). Thus, we also know that factor 1 is confounded with the 235 interaction, factor 2 with the 135, interaction, and factor 3 with the 125 interaction, because, in each instance their product must be equal to 1. The confounding of twoway interactions is also defined by this equation, because the 12 interaction multiplied by the 35 interaction must yield 1, and hence, they are identical or confounded. Therefore, you can summarize all confounding in a design with such a fundamental identity equation.
Blocking
In some production processes, units are produced in natural “chunks” or blocks. You want to make sure that these blocks do not bias your estimates of main effects. For example, you may have a kiln to produce special ceramics, but the size of the kiln is limited so that you cannot produce all runs of your experiment at once. In that case you need to break up the experiment into blocks. However, you do not want to run positive settings of all factors in one block, and all negative settings in the other. Otherwise, any incidental differences between blocks would systematically affect all estimates of the main effects of the factors of interest. Rather, you want to distribute the runs over the blocks so that any differences between blocks (i.e., the blocking factor) do not bias your results for the factor effects of interest. This is accomplished by treating the blocking factor as another factor in the design. Consequently, you “lose” another interaction effect to the blocking factor, and the resultant design will be of lower resolution. However, these designs often have the advantage of being statistically more powerful, because they allow you to estimate and control the variability in the production process that is due to differences between blocks.
Replicating the Design
It is sometimes desirable to replicate the design, that is, to run each combination of factor levels in the design more than once. This will allow you to later estimate the socalled pure error in the experiment. The analysis of experiments is further discussed below; however, it should be clear that, when replicating the design, you can compute the variability of measurements within each unique combination of factor levels. This variability will give an indication of the random error in the measurements (e.g., due to uncontrolled factors, unreliability of the measurement instrument, etc.), because the replicated observations are taken under identical conditions (settings of factor levels). Such an estimate of the pure error can be used to evaluate the size and statistical significance of the variability that can be attributed to the manipulated factors.
Partial replications. When it is not possible or feasible to replicate each unique combination of factor levels (i.e., the full design), you can still gain an estimate of pure error by replicating only some of the runs in the experiment. However, you must be careful to consider the possible bias that may be introduced by selectively replicating only some runs. If you only replicate those runs that are most easily repeated (e.g., gathers information at the points where it is “cheapest”), you may inadvertently only choose those combinations of factor levels that happen to produce very little (or very much) random variability – causing you to underestimate (or overestimate) the true amount of pure error. Thus, you should carefully consider, typically based on your knowledge about the process that is being studied, which runs should be replicated, that is, which runs will yield a good (unbiased) estimate of pure error.
Adding Center Points
Designs with factors that are set at two levels implicitly assume that the effect of the factors on the dependent variable of interest (e.g., fabric Strength) is linear. It is impossible to test whether or not there is a nonlinear (e.g., quadratic) component in the relationship between a factor A and a dependent variable, if A is only evaluated at two points (.i.e., at the low and high settings). If you suspects that the relationship between the factors in the design and the dependent variable is rather curvelinear, then you should include one or more runs where all (continuous) factors are set at their midpoint. Such runs are called centerpoint runs (or center points), since they are, in a sense, in the center of the design (see graph).
Later in the analysis (see below), you can compare the measurements for the dependent variable at the center point with the average for the rest of the design. This provides a check for curvature (see Box and Draper, 1987): If the mean for the dependent variable at the center of the design is significantly different from the overall mean at all other points of the design, then you have good reason to believe that the simple assumption that the factors are linearly related to the dependent variable, does not hold.
Analyzing the Results of a 2**(kp) Experiment
Analysis of variance. Next, you need to determine exactly which of the factors significantly affected the dependent variable of interest. For example, in the study reported by Box and Draper (1987, page 115), it is desired to learn which of the factors involved in the manufacture of dyestuffs affected the strength of the fabric. In this example, factors 1 (Polysulfide), 4 (Time), and 6 (Temperature) significantly affected the strength of the fabric. Note that to simplify matters, only main effects are shown below.
ANOVA; Var.:STRENGTH; Rsqr = .60614; Adj:.56469 (fabrico.sta)  

2**(60) design; MS Residual = 3.62509 DV: STRENGTH 

SS  df  MS  F  p  
(1)POLYSUFD (2)REFLUX (3)MOLES (4)TIME (5)SOLVENT (6)TEMPERTR Error Total SS 
48.8252 7.9102 .1702 142.5039 2.7639 115.8314 206.6302 524.6348 
1 1 1 1 1 1 57 63 
48.8252 7.9102 .1702 142.5039 2.7639 115.8314 3.6251 
13.46867 2.18206 .04694 39.31044 .76244 31.95269 
.000536 .145132 .829252 .000000 .386230 .000001 
Pure error and lack of fit. If the experimental design is at least partially replicated, then you can estimate the error variability for the experiment from the variability of the replicated runs. Since those measurements were taken under identical conditions, that is, at identical settings of the factor levels, the estimate of the error variability from those runs is independent of whether or not the “true” model is linear or nonlinear in nature, or includes higherorder interactions. The error variability so estimated represents pure error, that is, it is entirely due to unreliabilities in the measurement of the dependent variable. If available, you can use the estimate of pure error to test the significance of the residual variance, that is, all remaining variability that cannot be accounted for by the factors and their interactions that are currently in the model. If, in fact, the residual variability is significantly larger than the pure error variability, then you can conclude that there is still some statistically significant variability left that is attributable to differences between the groups, and hence, that there is an overall lack of fit of the current model.
ANOVA; Var.:STRENGTH; Rsqr = .58547; Adj:.56475 (fabrico.sta)  

2**(30) design; MS Pure Error = 3.594844 DV: STRENGTH 

SS  df  MS  F  p  
(1)POLYSUFD (2)TIME (3)TEMPERTR Lack of Fit Pure Error Total SS 
48.8252 142.5039 115.8314 16.1631 201.3113 524.6348 
1 1 1 4 56 63 
48.8252 142.5039 115.8314 4.0408 3.5948 
13.58200 39.64120 32.22154 1.12405 
.000517 .000000 .000001 .354464 
For example, the table above shows the results for the three factors that were previously identified as most important in their effect on fabric strength; all other factors were ignored in the analysis. As you can see in the row with the label Lack of Fit, when the residual variability for this model (i.e., after removing the three main effects) is compared against the pure error estimated from the withingroup variability, the resulting F test is not statistically significant. Therefore, this result additionally supports the conclusion that, indeed, factors Polysulfide, Time, and Temperature significantly affected resultant fabric strength in an additive manner (i.e., there are no interactions). Or, put another way, all differences between the means obtained in the different experimental conditions can be sufficiently explained by the simple additive model for those three variables.
Parameter or effect estimates. Now, look at how these factors affected the strength of the fabrics.
Effect  Std.Err.  t (57)  p  

Mean/Interc. (1)POLYSUFD (2)REFLUX (3)MOLES (4)TIME (5)SOLVENT (6)TEMPERTR 
11.12344 1.74688 .70313 .10313 2.98438 .41562 2.69062 
.237996 .475992 .475992 .475992 .475992 .475992 .475992 
46.73794 3.66997 1.47718 .21665 6.26980 .87318 5.65267 
.000000 .000536 .145132 .829252 .000000 .386230 .000001 
The numbers above are the effect or parameter estimates. With the exception of the overall Mean/Intercept, these estimates are the deviations of the mean of the negative settings from the mean of the positive settings for the respective factor. For example, if you change the setting of factor Time from low to high, then you can expect an improvement in Strength by 2.98; if you set the value for factor Polysulfd to its high setting, you can expect a further improvement by 1.75, and so on.
As you can see, the same three factors that were statistically significant show the largest parameter estimates; thus the settings of these three factors were most important for the resultant strength of the fabric.
For analyses including interactions, the interpretation of the effect parameters is a bit more complicated. Specifically, the twoway interaction parameters are defined as half the difference between the main effects of one factor at the two levels of a second factor (see Mason, Gunst, and Hess, 1989, page 127); likewise, the threeway interaction parameters are defined as half the difference between the twofactor interaction effects at the two levels of a third factor, and so on.
Regression coefficients. You can also look at the parameters in the multiple regression model (see Multiple Regression). To continue this example, consider the following prediction equation:
Strength = const + b_{1} *x_{1} +… + b_{6} *x_{6}
Here x_{1} through x_{6} stand for the 6 factors in the analysis. The Effect Estimates shown earlier also contains these parameter estimates:
Coeff. 
Std.Err. Coeff. 
95.% Cnf.Limt 
+95.% Cnf.Limt 


Mean/Interc. (1)POLYSUFD (2)REFLUX (3)MOLES (4)TIME (5)SOLVENT (6)TEMPERTR 
11.12344 .87344 .35156 .05156 1.49219 .20781 1.34531 
.237996 .237996 .237996 .237996 .237996 .237996 .237996 
10.64686 .39686 .12502 .42502 1.01561 .68439 .86873 
11.60002 1.35002 .82814 .52814 1.96877 .26877 1.82189 
Actually, these parameters contain little “new” information, as they simply are onehalf of the parameter values (except for the Mean/Intercept) shown earlier. This makes sense since now, the coefficient can be interpreted as the deviation of the highsetting for the respective factors from the center. However, note that this is only the case if the factor values (i.e., their levels) are coded as 1 and +1, respectively. Otherwise, the scaling of the factor values will affect the magnitude of the parameter estimates. In the example data reported by Box and Draper (1987, page 115), the settings or values for the different factors were recorded on very different scales:
data file: FABRICO.STA [ 64 cases with 9 variables ] 2**(60) Design, Box & Draper, p. 117 


POLYSUFD  REFLUX  MOLES  TIME  SOLVENT  TEMPERTR  STRENGTH  HUE  BRIGTHNS  
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 . . . 
6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 . . . 
150 150 170 170 150 150 170 170 150 150 170 170 150 150 170 . . . 
1.8 1.8 1.8 1.8 2.4 2.4 2.4 2.4 1.8 1.8 1.8 1.8 2.4 2.4 2.4 . . . 
24 24 24 24 24 24 24 24 36 36 36 36 36 36 36 . . . 
30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 . . . 
120 120 120 120 120 120 120 120 120 120 120 120 120 120 120 . . . 
3.4 9.7 7.4 10.6 6.5 7.9 10.3 9.5 14.3 10.5 7.8 17.2 9.4 12.1 9.5 . . . 
15.0 5.0 23.0 8.0 20.0 9.0 13.0 5.0 23.0 1.0 11.0 5.0 15.0 8.0 15.0 . . . 
36.0 35.0 37.0 34.0 30.0 32.0 28.0 38.0 40.0 32.0 32.0 28.0 34.0 26.0 30.0 . . . 
Shown below are the regression coefficient estimates based on the uncoded original factor values:
Regressn Coeff. 
Std.Err. 
t (57) 
p 


Mean/Interc. (1)POLYSUFD (2)REFLUX (3)MOLES (4)TIME (5)SOLVENT (6)TEMPERTR 
46.0641 1.7469 .0352 .1719 .2487 .0346 .2691 
8.109341 .475992 .023800 .793320 .039666 .039666 .047599 
5.68037 3.66997 1.47718 .21665 6.26980 .87318 5.65267 
.000000 .000536 .145132 .829252 .000000 .386230 .000001 
Because the metric for the different factors is no longer compatible, the magnitudes of the regression coefficients are not compatible either. This is why it is usually more informative to look at the ANOVA parameter estimates (for the coded values of the factor levels), as shown before. However, the regression coefficients can be useful when you want to make predictions for the dependent variable based on the original metric of the factors.
Graph Options
Diagnostic plots of residuals. To start with, before accepting a particular “model” that includes a particular number of effects (e.g., main effects for Polysulfide, Time, and Temperature in the current example), you should always examine the distribution of the residual values. These are computed as the difference between the predicted values (as predicted by the current model) and the observed values. You can compute the histogram for these residual values, as well as probability plots (as shown below).
The parameter estimates and ANOVA table are based on the assumption that the residuals are normally distributed (see also Elementary Concepts). The histogram provides one way to check (visually) whether this assumption holds. The socalled normal probability plot is another common tool to assess how closely a set of observed values (residuals in this case) follows a theoretical distribution. In this plot the actual residual values are plotted along the horizontal Xaxis; the vertical Yaxis shows the expected normal values for the respective values, after they were rankordered. If all values fall onto a straight line, then you can be satisfied that the residuals follow the normal distribution.
Pareto chart of effects. The Pareto chart of effects is often an effective tool for communicating the results of an experiment, in particular to laymen.
In this graph, the ANOVA effect estimates are sorted from the largest absolute value to the smallest absolute value. The magnitude of each effect is represented by a column, and often, a line going across the columns indicates how large an effect has to be (i.e., how long a column must be) to be statistically significant.
Normal probability plot of effects. Another useful, albeit more technical summary graph, is the normal probability plot of the estimates. As in the normal probability plot of the residuals, first the effect estimates are rank ordered, and then a normal z score is computed based on the assumption that the estimates are normally distributed. This z score is plotted on the Yaxis; the observed estimates are plotted on the Xaxis (as shown below).
Square and cube plots. These plots are often used to summarize predicted values for the dependent variable, given the respective high and low setting of the factors. The square plot (see below) will show the predicted values (and, optionally, their confidence intervals) for two factors at a time. The cube plot will show the predicted values (and, optionally, confidence intervals) for three factors at a time.
Interaction plots. A general graph for showing the means is the standard interaction plot, where the means are indicated by points connected by lines. This plot (see below) is particularly useful when there are significant interaction effects in the model.
Surface and contour plots. When the factors in the design are continuous in nature, it is often also useful to look at surface and contour plots of the dependent variable as a function of the factors.
These types of plots will further be discussed later in this section, in the context of 3**(kp), and central composite and response surface designs.
Summary
2**(kp) designs are the “workhorse” of industrial experiments. The impact of a large number of factors on the production process can simultaneously be assessed with relative efficiency (i.e., with few experimental runs). The logic of these types of experiments is straightforward (each factor has only two settings).
Disadvantages. The simplicity of these designs is also their major flaw. As mentioned before, underlying the use of twolevel factors is the belief that the resultant changes in the dependent variable (e.g., fabric strength) are basically linear in nature. This is often not the case, and many variables are related to quality characteristics in a nonlinear fashion. In the example above, if you were to continuously increase the temperature factor (which was significantly related to fabric strength), you would of course eventually hit a “peak,” and from there on the fabric strength would decrease as the temperature increases. While this types of curvature in the relationship between the factors in the design and the dependent variable can be detected if the design included center point runs, you cannot fit explicit nonlinear (e.g., quadratic) models with 2**(kp) designs (however, central composite designs will do exactly that).
Another problem of fractional designs is the implicit assumption that higherorder interactions do not matter; but sometimes they do, for example, when some other factors are set to a particular level, temperature may be negatively related to fabric strength. Again, in fractional factorial designs, higherorder interactions (greater than twoway) particularly will escape detection.
2**(kp) Maximally Unconfounded and Minimum Aberration Designs
Basic Idea
2**(kp) fractional factorial designs are often used in industrial experimentation because of the economy of data collection that they provide. For example, suppose dan engineer needed to investigate the effects of varying 11 factors, each with 2 levels, on a manufacturing process. Let’s call the number of factors k, which would be 11 for this example. An experiment using a full factorial design, wherfire the effects of every combination of levels of each factor are studied, would require 2**(k) experimental runs, or 2048 runs for this example. To minimize the data collection effort, the engineer might decide to forego investigation of higherorder interaction effects of the 11 factors, and focus instead on identifying the main effects of the 11 factors and any loworder interaction effects that could be estimated from an experiment using a smaller, more reasonable number of experimental runs. There is another, more theoretical reason for not conducting huge, full factorial 2 level experiments. In general, it is not logical to be concerned with identifying higherorder interaction effects of the experimental factors, while ignoring lowerorder nonlinear effects, such as quadratic or cubic effects, which cannot be estimated if only 2 levels of each factor are employed. So although practical considerations often lead to the need to design experiments with a reasonably small number of experimental runs, there is a logical justification for such experiments.
The alternative to the 2**(k) full factorial design is the 2**(kp) fractional factorial design, which requires only a “fraction” of the data collection effort required for full factorial designs. For our example with k=11 factors, if only 64 experimental runs can be conducted, a 2**(115) fractional factorial experiment would be designed with 2**6 = 64 experimental runs. In essence, a kp = 6 way full factorial experiment is designed, with the levels of the p factors being “generated” by the levels of selected higher order interactions of the other 6 factors. Fractional factorials “sacrifice” higher order interaction effects so that lower order effects may still be computed correctly. However, different criteria can be used in choosing the higher order interactions to be used as generators, with different criteria sometimes leading to different “best” designs.
2**(kp) fractional factorial designs can also include blocking factors. In some production processes, units are produced in natural “chunks” or blocks. To make sure that these blocks do not bias your estimates of the effects for the k factors, blocking factors can be added as additional factors in the design. Consequently, you may “sacrifice” additional interaction effects to generate the blocking factors, but these designs often have the advantage of being statistically more powerful, because they allow you to estimate and control the variability in the production process that is due to differences between blocks.
Design Criteria
Many of the concepts discussed in this overview are also addressed in the Overview of 2**(kp) Fractional factorial designs. However, a technical description of how fractional factorial designs are constructed is beyond the scope of either introductory overview. Detailed accounts of how to design 2**(kp) experiments can be found, for example, in Bayne and Rubin (1986), Box and Draper (1987), Box, Hunter, and Hunter (1978), Montgomery (1991), Daniel (1976), Deming and Morgan (1993), Mason, Gunst, and Hess (1989), or Ryan (1989), to name only a few of the many text books on this subject.
In general, the 2**(kp) maximally unconfounded and minimum aberration designs techniques will successively select which higherorder interactions to use as generators for the p factors. For example, consider the following design that includes 11 factors but requires only 16 runs (observations).
Design: 2**(117), Resolution III  

Run  A  B  C  D  E  F  G  H  I  J  K 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
Interpreting the design. The design displayed in the Scrollsheet above should be interpreted as follows. Each column contains +1’s or 1’s to indicate the setting of the respective factor (high or low, respectively). So for example, in the first run of the experiment, all factors A through K are set to the higher level, and in the second run, factors A, B, and C are set to the higher level, but factor D is set to the lower level, and so on. Notice that the settings for each experimental run for factor E can be produced by multiplying the respective settings for factors A, B, and C. The A x B x C interaction effect therefore cannot be estimated independently of the factor E effect in this design because these two effects are confounded. Likewise, the settings for factor F can be produced by multiplying the respective settings for factors B, C, and D. We say that ABC and BCD are the generators for factors E and F, respectively.
The maximum resolution design criterion. In the Scrollsheet shown above, the design is described as a 2**(117) design of resolution III (three). This means that you study overall k = 11 factors, but p = 7 of those factors were generated from the interactions of a full 2**[(117) = 4] factorial design. As a result, the design does not give full resolution; that is, there are certain interaction effects that are confounded with (identical to) other effects. In general, a design of resolution R is one where no lway interactions are confounded with any other interaction of order less than R – l. In the current example, R is equal to 3. Here, no l = 1way interactions (i.e., main effects) are confounded with any other interaction of order less than R – l = 3 1 = 2. Thus, main effects in this design are unconfounded with each other, but are confounded with twofactor interactions; and consequently, with other higherorder interactions. One obvious, but nevertheless very important overall design criterion is that the higherorder interactions to be used as generators should be chosen such that the resolution of the design is as high as possible.
The maximum unconfounding design criterion. Maximizing the resolution of a design, however, does not by itself ensure that the selected generators produce the “best” design. Consider, for example, two different resolution IV designs. In both designs, main effects would be unconfounded with each other and 2factor interactions would be unconfounded with main effects, i.e, no l = 2way interactions are confounded with any other interaction of order less than R – l = 4 – 2 = 2. The two designs might be different, however, with regard to the degree of confounding for the 2factor interactions. For resolution IV designs, the “crucial order,” in which confounding of effects first appears, is for 2factor interactions. In one design, none of the “crucial order,” 2factor interactions might be unconfounded with all other 2factor interactions, while in the other design, virtually all of the 2factor interactions might be unconfounded with all of the other 2factor interactions. The second “almost resolution V” design would be preferable to the first “just barely resolution IV” design. This suggests that even though the maximum resolution design criterion should be the primary criterion, a subsidiary criterion might be that generators should be chosen such that the maximum number of interactions of less than or equal to the crucial order, given the resolution, are unconfounded with all other interactions of the crucial order. This is called the maximum unconfounding design criterion, and is one of the optional, subsidiary design criterion to use in a search for a 2**(kp) design.
The minimum aberration design criterion. The minimum aberration design criterion is another optional, subsidiary criterion to use in a search for a 2**(kp) design. In some respects, this criterion is similar to the maximum unconfounding design criterion. Technically, the minimum aberration design is defined as the design of maximum resolution “which minimizes the number of words in the defining relation that are of minimum length” (Fries & Hunter, 1980). Less technically, the criterion apparently operates by choosing generators that produce the smallest number of pairs of confounded interactions of the crucial order. For example, the minimum aberration resolution IV design would have the minimum number of pairs of confounded 2factor interactions.
To illustrate the difference between the maximum unconfounding and minimum aberration criteria, consider the maximally unconfounded 2**(94) design and the minimum aberration 2**(94) design, as for example, listed in Box, Hunter, and Hunter (1978). If you compare these two designs, you will find that in the maximally unconfounded design, 15 of the 36 2factor interactions are unconfounded with any other 2factor interactions, while in the minimum aberration design, only 8 of the 36 2factor interactions are unconfounded with any other 2factor interactions. The minimum aberration design, however, produces 18 pairs of confounded interactions, while the maximally unconfounded design produces 21 pairs of confounded interactions. So, the two criteria lead to the selection of generators producing different “best” designs.
Fortunately, the choice of whether to use the maximum unconfounding criterion or the minimum aberration criterion makes no difference in the design which is selected (except for, perhaps, relabeling of the factors) when there are 11 or fewer factors, with the single exception of the 2**(94) design described above (see Chen, Sun, & Wu, 1993). For designs with more than 11 factors, the two criteria can lead to the selection of very different designs, and for lack of better advice, we suggest using both criteria, comparing the designs that are produced, and choosing the design that best suits your needs. We will add, editorially, that maximizing the number of totally unconfounded effects often makes more sense than minimizing the number of pairs of confounded effects.
Summary
2**(kp) fractional factorial designs are probably the most frequently used type of design in industrial experimentation. Things to consider in designing any 2**(kp) fractional factorial experiment include the number of factors to be investigated, the number of experimental runs, and whether there will be blocks of experimental runs. Beyond these basic considerations, you should also take into account whether the number of runs will allow a design of the required resolution and degree of confounding for the crucial order of interactions, given the resolution.
3**(kp), BoxBehnken, and Mixed 2 and 3 Level Factorial Designs
Overview
In some cases, factors that have more than 2 levels need to be examined. For example, if you suspect that the effect of the factors on the dependent variable of interest is not simply linear, then, as discussed earlier (see 2**(kp) designs), you need at least 3 levels in order to test for the linear and quadratic effects (and interactions) for those factors. Also, sometimes some factors may be categorical in nature, with more than 2 categories. For example, you may have three different machines that produce a particular part.
Designing 3**(kp) Experiments
The general mechanism of generating fractional factorial designs at 3 levels (3**(kp) designs) is very similar to that described in the context of 2**(kp) designs. Specifically, you start with a full factorial design, and then use the interactions of the full design to construct “new” factors (or blocks) by making their factor levels identical to those for the respective interaction terms (i.e., by making the new factors aliases of the respective interactions).
For example, consider the following simple 3**(31) factorial design:
3**(31) fractional factorial design, 1 block , 9 runs 


Standard Run 
A 
B 
C 
1 2 3 4 5 6 7 8 9 
0 0 0 1 1 1 2 2 2 
0 1 2 0 1 2 0 1 2 
0 2 1 2 1 0 1 0 2 
As in the case of 2**(kp) designs, the design is constructed by starting with the full 31=2 factorial design; those factors are listed in the first two columns (factors A and B). Factor C is constructed from the interaction AB of the first two factors. Specifically, the values for factor C are computed as
C = 3 – mod_{3} (A+B)
Here, mod_{3}(x) stands for the socalled modulo3 operator, which will first find a number y that is less than or equal to x, and that is evenly divisible by 3, and then compute the difference (remainder) between number y and x. For example, mod_{3}(0) is equal to 0, mod_{3}(1) is equal to 1, mod_{3}(3) is equal to 0, mod_{3}(5) is equal to 2 (3 is the largest number that is less than or equal to 5, and that is evenly divisible by 3; finally, 53=2), and so on.
Fundamental identity. If you apply this function to the sum of columns A and B shown above, you will obtain the third column C. Similar to the case of 2**(kp) designs (see 2**(kp) designs for a discussion of the fundamental identity in the context of 2**(kp) designs), this confounding of interactions with “new” main effects can be summarized in an expression:
0 = mod_{3} (A+B+C)
If you look back at the 3**(31) design shown earlier, you will see that, indeed, if you add the numbers in the three columns they will all sum to either 0, 3, or 6, that is, values that are evenly divisible by 3 (and hence: mod_{3}(A+B+C)=0). Thus, you could write as a shortcut notation ABC=0, in order to summarize the confounding of factors in the fractional 3**(kp) design.
Some of the designs will have fundamental identities that contain the number 2 as a multiplier; e.g.,
0 = mod_{3} (B+C*2+D+E*2+F)
This notation can be interpreted exactly as before, that is, the modulo_{3} of the sum B+2*C+D+2*E+F must be equal to 0. The next example shows such an identity.
An Example 3**(41) Design in 9 Blocks
Here is the summary for a 4factor 3level fractional factorial design in 9 blocks that requires only 27 runs.
SUMMARY: 3**(41) fractional factorial
Design generators: ABCD
Block generators: AB,AC2
Number of factors (independent variables): 4
Number of runs (cases, experiments): 27
Number of blocks: 9
This design will allow you to test for linear and quadratic main effects for 4 factors in 27 observations, which can be gathered in 9 blocks of 3 observations each. The fundamental identity or design generator for the design is ABCD, thus the modulo_{3} of the sum of the factor levels across the four factors is equal to 0. The fundamental identity also allows you to determine the confounding of factors and interactions in the design (see McLean and Anderson, 1984, for details).
Unconfounded Effects (experi3.sta)  

EXPERIM. DESIGN 
List of uncorrelated factors and interactions 3**(41) fractional factorial design, 9 blocks, 27 runs 

Unconf. Effects (excl. blocks) 
Unconfounded if blocks included? 

1 2 3 4 5 6 7 8 
(1)A (L) A (Q) (2)B (L) B (Q) (3)C (L) C (Q) (4)D (L) D (Q) 
Yes Yes Yes Yes Yes Yes Yes Yes 
As you can see, in this 3**(41) design the main effects are not confounded with each other, even when the experiment is run in 9 blocks.
BoxBehnken Designs
In the case of 2**(kp) designs, Plackett and Burman (1946) developed highly fractionalized designs to screen the maximum number of (main) effects in the least number of experimental runs. The equivalent in the case of 3**(kp) designs are the socalled BoxBehnken designs (Box and Behnken, 1960; see also Box and Draper, 1984). These designs do not have simple design generators (they are constructed by combining twolevel factorial designs with incomplete block designs), and have complex confounding of interaction. However, the designs are economical and therefore particularly useful when it is expensive to perform the necessary experimental runs.
Analyzing the 3**(kp) Design
The analysis of these types of designs proceeds basically in the same way as was described in the context of 2**(kp) designs. However, for each effect, upi can now test for the linear effect and the quadratic (nonlinear effect). For example, when studying the yield of chemical process, then temperature may be related in a nonlinear fashion, that is, the maximum yield may be attained when the temperature is set at the medium level. Thus, nonlinearity often occurs when a process performs near its optimum.
ANOVA Parameter Estimates
To estimate the ANOVA parameters, the factors levels for the factors in the analysis are internally recoded so that you can test the linear and quadratic components in the relationship between the factors and the dependent variable. Thus, regardless of the original metric of factor settings (e.g., 100 degrees C, 110 degrees C, 120 degrees C), you can always recode those values to 1, 0, and +1 to perform the computations. The resultant ANOVA parameter estimates can be interpreted analogously to the parameter estimates for 2**(kp) designs.
For example, consider the following ANOVA results:
Factor  Effect  Std.Err.  t (69)  p 

Mean/Interc. BLOCKS(1) BLOCKS(2) (1)TEMPERAT (L) TEMPERAT (Q) (2)TIME (L) TIME (Q) (3)SPEED (L) SPEED (Q) 1L by 2L 1L by 2Q 1Q by 2L 1Q by 2Q 
103.6942 .8028 1.2307 .3245 .5111 .0017 .0045 10.3073 3.7915 3.9256 .4384 .4747 2.7499 
.390591 1.360542 1.291511 .977778 .809946 .977778 .809946 .977778 .809946 1.540235 1.371941 1.371941 .995575 
265.4805 .5901 .9529 .3319 .6311 .0018 .0056 10.5415 4.6812 2.5487 .3195 .3460 2.7621 
0.000000 .557055 .343952 .740991 .530091 .998589 .995541 .000000 .000014 .013041 .750297 .730403 .007353 
Maineffect estimates. By default, the Effect estimate for the linear effects (marked by the L next to the factor name) can be interpreted as the difference between the average response at the low and high settings for the respective factors. The estimate for the quadratic (nonlinear) effect (marked by the Q next to the factor name) can be interpreted as the difference between the average response at the center (medium) settings and the combined high and low settings for the respective factors.
Interaction effect estimates. As in the case of 2**(kp) designs, the linearbylinear interaction effect can be interpreted as half the difference between the linear main effect of one factor at the high and low settings of another. Analogously, the interactions by the quadratic components can be interpreted as half the difference between the quadratic main effect of one factor at the respective settings of another; that is, either the high or low setting (quadratic by linear interaction), or the medium or high and low settings combined (quadratic by quadratic interaction).
In practice, and from the standpoint of “interpretability of results,” you would usually try to avoid quadratic interactions. For example, a quadraticbyquadratic AbyB interaction indicates that the non linear effect of factor A is modified in a nonlinear fashion by the setting of B. This means that there is a fairly complex interaction between factors present in the data that will make it difficult to understand and optimize the respective process. Sometimes, performing nonlinear transformations (e.g., performing a log transformation) of the dependent variable values can remedy the problem.
Centered and noncentered polynomials. As mentioned above, the interpretation of the effect estimates applies only when you use the default parameterization of the model. In that case, you would code the quadratic factor interactions so that they become maximally “untangled” from the linear main effects.
Graphical Presentation of Results
The same diagnostic plots (e.g., of residuals) are available for 3**(kp) designs as were described in the context of 2**(kp) designs. Thus, before interpreting the final results, you should always first look at the distribution of the residuals for the final fitted model. The ANOVA assumes that the residuals (errors) are normally distributed.
Plot of means. When an interaction involves categorical factors (e.g., type of machine, specific operator of machine, and some distinct setting of the machine), then the best way to understand interactions is to look at the respective interaction plot of means.
Surface plot. When the factors in an interaction are continuous in nature, you may want to look at the surface plot that shows the response surface applied by the fitted model. Note that this graph also contains the prediction equation (in terms of the original metric of factors), that produces the respective response surface.
Designs for Factors at 2 and 3 Levels
You can also generate standard designs with 2 and 3 level factors. Specifically, you can generate the standard designs as enumerated by Connor and Young for the US National Bureau of Standards (see McLean and Anderson, 1984). The technical details of the method used to generate these designs are beyond the scope of this introduction. However, in general the technique is, in a sense, a combination of the procedures described in the context of 2**(kp) and 3**(kp) designs. It should be noted however, that, while all of these designs are very efficient, they are not necessarily orthogonal with respect to all main effects. This is, however, not a problem, if you use a general algorithm for estimating the ANOVA parameters and sums of squares, that does not require orthogonality of the design.
The design and analysis of these experiments proceeds along the same lines as discussed in the context of 2**(kp) and 3**(kp) experiments.
Central Composite and NonFactorial Response Surface Designs
Overview
The 2**(kp) and 3**(kp) designs all require that the levels of the factors are set at, for example, 2 or 3 levels. In many instances, such designs are not feasible, because, for example, some factor combinations are constrained in some way (e.g., factors A and B cannot be set at their high levels simultaneously). Also, for reasons related to efficiency, which will be discussed shortly, it is often desirable to explore the experimental region of interest at particular points that cannot be represented by a factorial design.
The designs (and how to analyze them) discussed in this section all pertain to the estimation (fitting) of response surfaces, following the general model equation:
y = b_{0} +b_{1} *x_{1} +…+b_{k} *x_{k} + b_{12} *x_{1} *x_{2} +b_{13} *x_{1} *x_{3} +…+b_{k1,k} *x_{k1} *x_{k} + b_{11} *x_{1}² +…+b_{kk} *x_{k}²
Put into words, you are fitting a model to the observed values of the dependent variable y, that include (1) main effects for factors x_{1} , …, x_{k}, (2) their interactions (x_{1}*x_{2}, x_{1}*x_{3}, … ,x_{k1}*x_{k}), and (3) their quadratic components (x_{1}**2, …, x_{k}**2). No assumptions are made concerning the “levels” of the factors, and you can analyze any set of continuous values for the factors.
There are some considerations concerning design efficiency and biases, which have led to standard designs that are ordinarily used when attempting to fit these response surfaces, and those standard designs will be discussed shortly (e.g., see Box, Hunter, and Hunter, 1978; Box and Draper, 1987; Khuri and Cornell, 1987; Mason, Gunst, and Hess, 1989; Montgomery, 1991). But, as will be discussed later, in constrained surface designs and D and Aoptimal designs, these standard designs can sometimes not be used for practical reasons. However, the central composite design analysis options do not make any assumptions about the structure of your data file, that is, the number of distinct factor values, or their combinations across the runs of the experiment, and, hence, these options can be used to analyze any type of design, to fit to the data the general model described above.
Design Considerations
Orthogonal designs. One desirable characteristic of any design is that the main effect and interaction estimates of interest are independent of each other. For example, suppose you had a two factor experiments, with both factors at two levels. Your design consists of four runs:
A  B  

Run 1 Run 2 Run 3 Run 4 
1 1 1 1 
1 1 1 1 
For the first two runs, both factors A and B are set at their high levels (+1). In the last two runs, both are set at their low levels (1). Suppose you wanted to estimate the independent contributions of factors A and B to the prediction of the dependent variable of interest. Clearly this is a silly design, because there is no way to estimate the A main effect and the B main effect. You can only estimate one effect – the difference between Runs 1+2 vs. Runs 3+4 – which represents the combined effect of A and B.
The point here is that, in order to assess the independent contributions of the two factors, the factor levels in the four runs must be set so that the “columns” in the design (under A and B in the illustration above) are independent of each other. Another way to express this requirement is to say that the columns of the design matrix (with as many columns as there are main effect and interaction parameters that you want to estimate) should be orthogonal (this term was first used by Yates, 1933). For example, if the four runs in the design are arranged as follows:
A  B  

Run 1 Run 2 Run 3 Run 4 
1 1 1 1 
1 1 1 1 
then the A and B columns are orthogonal. Now you can estimate the A main effect by comparing the high level for A within each level of B, with the low level for A within each level of B; the B main effect can be estimated in the same way.
Technically, two columns in a design matrix are orthogonal if the sum of the products of their elements within each row is equal to zero. In practice, you often encounter situations where the columns of the design matrix are not completely orthogonal, for example due to loss of some data in some runs or other constraints. In general, the rule here is that the more orthogonal the columns are, the better the design, that is, the more independent information can be extracted from the design regarding the respective effects of interest. Therefore, one consideration for choosing standard central composite designs is to find designs that are orthogonal or nearorthogonal.
Rotatable designs. The second consideration is related to the first requirement, in that it also has to do with how best to extract the maximum amount of (unbiased) information from the design, or specifically, from the experimental region of interest. Without going into details (see Box, Hunter, and Hunter, 1978; Box and Draper, 1987, Chapters 14; see also Deming and Morgan, 1993, Chapter 13), it can be shown that the standard error for the prediction of dependent variable values is proportional to:
(1 + f(x)’ * (X’X)¨¹ * f(x))**½
where f(x) stands for the (coded) factor effects for the respective model (f(x) is a vector, f(x)’ is the transpose of that vector), and X is the design matrix for the experiment, that is, the matrix of coded factor effects for all runs; X’X**1 is the inverse of the crossproduct matrix. Deming and Morgan (1993) refer to this expression as the normalized uncertainty; this function is also related to the variance function as defined by Box and Draper (1987). The amount of uncertainty in the prediction of dependent variable values depends on the variability of the design points, and their covariance over the runs. (Note that it is inversely proportional to the determinant of X’X; this issue is further discussed in the section on D and Aoptimal designs).
The point here is that, again, you want to choose a design that extracts the most information regarding the dependent variable, and leaves the least amount of uncertainty for the prediction of future values. It follows, that the amount of information (or normalized information according to Deming and Morgan, 1993) is the inverse of the normalized uncertainty.
For the simple 4run orthogonal experiment shown earlier, the information function is equal to
I_{x} = 4/(1 + x_{1}² + x_{2}²)
where x_{1} and x_{2} stand for the factor settings for factors A and B, respectively (see Box and Draper, 1987).
Inspection of this function in a plot (see above) shows that it is constant on circles centered at the origin. Thus any kind of rotation of the original design points will generate the same amount of information, that is, generate the same information function. Therefore, the 2by2 orthogonal design in 4 runs shown earlier is said to be rotatable.
As pointed out before, in order to estimate the second order, quadratic, or nonlinear component of the relationship between a factor and the dependent variable, you need at least 3 levels for the respective factors. What does the information function look like for a simple 3by3 factorial design,for the secondorder quadratic model as shown at the beginning of this section?
As it turns out (see Box and Draper, 1987 and Montgomery, 1991; refer also to the manual), this function looks more complex, contains “pockets” of highdensity information at the edges (which are probably of little particular interest to the experimenter), and clearly it is not constant on circles around the origin. Therefore, it is not rotatable, meaning different rotations of the design points will extract different amounts of information from the experimental region.
Starpoints and rotatable secondorder designs. It can be shown that by adding socalled star points to the simple (square or cube) 2level factorial design points, you can achieve rotatable, and often orthogonal or nearly orthogonal designs. For example, adding to the simple 2by2 orthogonal design shown earlier the following points, will produce a rotatable design.
A  B  

Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 Run 7 Run 8 Run 9 Run 10 
1 1 1 1 1.414 1.414 0 0 0 0 
1 1 1 1 0 0 1.414 1.414 0 0 
The first four runs in this design are the previous 2by2 factorial design points (or square points or cube points); runs 5 through 8 are the socalled star points or axial points, and runs 9 and 10 are center points.
The information function for this design for the secondorder (quadratic) model is rotatable, that is, it is constant on the circles around the origin.
Alpha for Rotatability and Orthogonality
The two design characteristics discussed so far — orthogonality and rotatability — depend on the number of center points in the design and on the socalled axial distance (alpha), which is the distance of the star points from the center of the design (i.e., 1.414 in the design shown above). It can be shown (e.g., see Box, Hunter, and Hunter, 1978; Box and Draper, 1987, Khuri and Cornell, 1987; Montgomery, 1991) that a design is rotatable if:
= ( n_{c} )^{¼}
where n_{c} stands for the number of cube points in the design (i.e., points in the factorial portion of the design).
A central composite design is orthogonal if you choose the axial distance so that:
= {[( n_{c} + n_{s} + n_{0} )^{½} – n_{c}^{½}]² * n_{c}/4}^{¼}
where
n_{c} is the number of cube points in the design
n_{s} is the number of star points in the design
n_{0} is the number of center points in the design
To make a design both (approximately) orthogonal and rotatable, you would first choose the axial distance for rotatability, and then add center points (see Khuri and Cornell, 1987), so that:
n_{0} 4*n_{c}^{½} + 4 – 2k
where k stands for the number of factors in the design.
Finally, if blocking is involved, Box and Draper (1987) give the following formula for computing the axial distance to achieve orthogonal blocking, and in most cases also reasonable information function contours, that is, contours that are close to spherical:
= [k*(l+n_{s0}/n_{s})/(1+n_{c0}/n_{c})]^{½}
where
n_{s0} is the number of center points in the star portion of the design
n_{s} is the number of noncenter star points in the design
n_{c0} is the number of center points in the cube portion of the design
n_{c} is the number of noncenter cube points in the design
Available Standard Designs
The standard central composite designs are usually constructed from a 2**(kp) design for the cube portion of the design, which is augmented with center points and star points. Box and Draper (1987) list a number of such designs.
Small composite designs. In the standard designs, the cube portion of the design is typically of resolution V (or higher). This is, however, not necessary, and in cases when the experimental runs are expensive, or when it is not necessary to perform a statistically powerful test of model adequacy, then you could choose for the cube portion designs of resolution III. For example, it could be constructed from highly fractionalized PlackettBurman designs. Hartley (1959) described such designs.
Analyzing Central Composite Designs
The analysis of central composite designs proceeds in much the same way as for the analysis of 3**(kp) designs. You fit to the data the general model described above; for example, for two variables you would fit the model:
y = b_{0} + b_{1}*x_{1} + b_{2}*x_{2} + b_{12}*x_{1}*x_{2} + b_{11}*x_{1}^{2} + b_{22}*x_{2}^{2}
The Fitted Response Surface
The shape of the fitted overall response can best be summarized in graphs and you can generate both contour plots and response surface plots (see examples below) for the fitted model.
Categorized Response Surfaces
You can fit 3D surfaces to your data, categorized by some other variable. For example, if you replicated a standard central composite design 4 times, it may be very informative to see how similar the surfaces are when fitted to each replication.
This would give you a graphical indication of the reliability of the results and where (e.g., in which region of the surface) deviations occur.
Clearly, the third replication produced a different surface. In replications 1, 2, and 4, the fitted surfaces are very similar to each other. Thus, you should investigate what could have caused this noticeable difference in the third replication of the design.
Latin Square Designs
Overview
Latin square designs (the term Latin square was first used by Euler, 1782) are used when the factors of interest have more than two levels and you know ahead of time that there are no (or only negligible) interactions between factors. For example, if you wanted to examine the effect of 4 fuel additives on reduction in oxides of nitrogen and had 4 cars and 4 drivers at your disposal, then you could of course run a full 4 x 4 x 4 factorial design, resulting in 64 experimental runs. However, you are not really interested in any (minor) interactions between the fuel additives and drivers, fuel additives and cars, or cars and drivers. You are mostly interested in estimating main effects, in particular the one for the fuel additives factor. At the same time, you want to make sure that the main effects for drivers and cars do not affect (bias) your estimate of the main effect for the fuel additive.
If you labeled the additives with the letters A, B, C, and D, the Latin square design that would allow you to derive unconfounded main effects estimates could be summarized as follows (see also Box, Hunter, and Hunter, 1978, page 263):
Car 


Driver 
1 
2 
3 
4 
1

A

B

D

C

Latin Square Designs
The example shown above is actually only one of the three possible arrangements in effect estimates. These “arrangements” are also called Latin square. The example above constitutes a 4 x 4 Latin square; and rather than requiring the 64 runs of the complete factorial, you can complete the study in only 16 runs.
GrecoLatin square. A nice feature of Latin Squares is that they can be superimposed to form what are called GrecoLatin squares (this term was first used by Fisher and Yates, 1934). For example, the following two 3 x 3 Latin squares can be superimposed to form a GrecoLatin square:
In the resultant GrecoLatin square design, you can evaluate the main effects of four 3level factors (row factor, column factor, Roman letters, Greek letters) in only 9 runs.
HyperGreco Latin square. For some numbers of levels, there are more than two possible Latin square arrangements. For example, there are three possible arrangements for 4level Latin squares. If all three of them are superimposed, you get a HyperGreco Latin square design. In that design you can estimate the main effects of all five 4level factors with only 16 runs in the experiment.
Analyzing the Design
Analyzing Latin square designs is straightforward. Also, plots of means can be produced to aid in the interpretation of results.
Very Large Designs, Random Effects, Unbalanced Nesting
Note that there are several other statistical methods that can also analyze these types of designs; see the section on Methods for Analysis of Variance for details. In particular, Variance Components and Mixed Model ANOVA/ANCOVA discusses very efficient methods for analyzing designs with unbalanced nesting (when the nested factors have different numbers of levels within the levels of the factors in which they are nested), very large nested designs (e.g., with more than 200 levels overall), or hierarchically nested designs (with or without random factors).
Taguchi Methods: Robust Design Experiments
Overview
Applications. Taguchi methods have become increasingly popular in recent years. The documented examples of sizable quality improvements that resulted from implementations of these methods (see, for example, Phadke, 1989; Noori, 1989) have added to the curiosity among American manufacturers. In fact, some of the leading manufacturers in this country have begun to use these methods with usually great success. For example, AT&T is using these methods in the manufacture of very large scale integrated (VLSI) circuits; also, Ford Motor Company has gained significant quality improvements due to these methods (American Supplier Institute, 1984 to 1988). However, as the details of these methods are becoming more widely known, critical appraisals are also beginning to appear (for example, Bhote, 1988; Tribus and Szonyi, 1989).
Overview. Taguchi robust design methods are set apart from traditional quality control procedures (see Quality Control and Process Analysis) and industrial experimentation in various respects. Of particular importance are:
 The concept of quality loss functions,
 The use of signaltonoise (S/N) ratios, and
 The use of orthogonal arrays.
These basic aspects of robust design methods will be discussed in the following sections. Several books have recently been published on these methods, for example, Peace (1993), Phadke (1989), Ross (1988), and Roy (1990), to name a few, and it is recommended that you refer to those books for further specialized discussions. Introductory overviews of Taguchi’s ideas about quality and quality improvement can also be found in Barker (1986), Garvin (1987), Kackar (1986), and Noori (1989).
Quality and Loss Functions
What is quality. Taguchi’s analysis begins with the question of how to define quality. It is not easy to formulate a simple definition of what constitutes quality; however, when your new car stalls in the middle of a busy intersection – putting yourself and other motorists at risk – you know that your car is not of high quality. Put another way, the definition of the inverse of quality is rather straightforward: it is the total loss to you and society due to functional variations and harmful side effects associated with the respective product. Thus, as an operational definition, you can measure quality in terms of this loss, and the greater the quality loss the lower the quality.
Discontinuous (stepshaped) loss function. You can formulate hypotheses about the general nature and shape of the loss function. Assume a specific ideal point of highest quality; for example, a perfect car with no quality problems. It is customary in statistical process control (SPC; see also Process Analysis) to define tolerances around the nominal ideal point of the production process. According to the traditional view implied by common SPC methods, as long as you are within the manufacturing tolerances you do not have a problem. Put another way, within the tolerance limits the quality loss is zero; once you move outside the tolerances, the quality loss is declared to be unacceptable. Thus, according to traditional views, the quality loss function is a discontinuous step function: as long as you are within the tolerance limits, quality loss is negligible; when you step outside those tolerances, quality loss becomes unacceptable.
Quadratic loss function. Is the step function implied by common SPC methods a good model of quality loss? Return to the “perfect automobile” example. Is there a difference between a car that, within one year after purchase, has nothing wrong with it, and a car where minor rattles develop, a few fixtures fall off, and the clock in the dashboard breaks (all inwarranty repairs, mind you…)? If you ever bought a new car of the latter kind, you know very well how annoying those admittedly minor quality problems can be. The point here is that it is not realistic to assume that, as you move away from the nominal specification in your production process, the quality loss is zero as long as you stay within the set tolerance limits. Rather, if you are not exactly “on target,” then loss will result, for example in terms of customer satisfaction. Moreover, this loss is probably not a linear function of the deviation from nominal specifications, but rather a quadratic function (inverted U). A rattle in one place in your new car is annoying, but you would probably not get too upset about it; add two more rattles, and you might declare the car “junk.” Gradual deviations from the nominal specifications do not produce proportional increments in loss, but rather squared increments.
Conclusion: Controlling variability. If, in fact, quality loss is a quadratic function of the deviation from a nominal value, then the goal of your quality improvement efforts should be to minimize the squared deviations or variance of the product around nominal (ideal) specifications, rather than the number of units within specification limits (as is done in traditional SPC procedures).
SignaltoNoise (S/N) Ratios
Measuring quality loss. Even though you have concluded that the quality loss function is probably quadratic in nature, you still do not know precisely how to measure quality loss. However, you know that whatever measure you decide upon should reflect the quadratic nature of the function.
Signal, noise, and control factors. The product of ideal quality should always respond in exactly the same manner to the signals provided by the user. When you turn the key in the ignition of your car you expect that the starter motor turns and the engine starts. In the idealquality car, the starting process would always proceed in exactly the same manner — for example, after three turns of the starter motor the engine comes to life. If, in response to the same signal (turning the ignition key) there is random variability in this process, then you have less than ideal quality. For example, due to such uncontrollable factors as extreme cold, humidity, engine wear, etc. the engine may sometimes start only after turning over 20 times and finally not start at all. This example illustrates the key principle in measuring quality according to Taguchi: You want to minimize the variability in the product’s performance in response to noise factors while maximizing the variability in response to signal factors.
Noise factors are those that are not under the control of the operator of a product. In the car example, those factors include temperature changes, different qualities of gasoline, engine wear, etc. Signal factors are those factors that are set or controlled by the operator of the product to make use of its intended functions (turning the ignition key to start the car).
Finally, the goal of your quality improvement effort is to find the best settings of factors under your control that are involved in the production process, in order to maximize the S/N ratio; thus, the factors in the experiment represent control factors.
S/N ratios. The conclusion of the previous paragraph is that quality can be quantified in terms of the respective product’s response to noise factors and signal factors. The ideal product will only respond to the operator’s signals and will be unaffected by random noise factors (weather, temperature, humidity, etc.). Therefore, the goal of your quality improvement effort can be stated as attempting to maximize the signaltonoise (S/N) ratio for the respective product. The S/N ratios described in the following paragraphs have been proposed by Taguchi (1987).
Smallerthebetter. In cases where you want to minimize the occurrences of some undesirable product characteristics, you would compute the following S/N ratio:
Eta = 10 * log_{10} [(1/n) * (y_{i}^{2})] for i = 1 to no. vars see outer arrays
Here, Eta is the resultant S/N ratio; n is the number of observations on the particular product, and y is the respective characteristic. For example, the number of flaws in the paint on an automobile could be measured as the y variable and analyzed via this S/N ratio. The effect of the signal factors is zero, since zero flaws is the only intended or desired state of the paint on the car. Note how this S/N ratio is an expression of the assumed quadratic nature of the loss function. The factor 10 ensures that this ratio measures the inverse of “bad quality;” the more flaws in the paint, the greater is the sum of the squared number of flaws, and the smaller (i.e., more negative) the S/N ratio. Thus, maximizing this ratio will increase quality.
Nominalthebest. Here, you have a fixed signal value (nominal value), and the variance around this value can be considered the result of noise factors:
Eta = 10 * log_{10} (Mean^{2}/Variance)
This signaltonoise ratio could be used whenever ideal quality is equated with a particular nominal value. For example, the size of piston rings for an automobile engine must be as close to specification as possible to ensure high quality.
Largerthebetter. Examples of this type of engineering problem are fuel economy (miles per gallon) of an automobile, strength of concrete, resistance of shielding materials, etc. The following S/N ratio should be used:
Eta = 10 * log_{10} [(1/n) * (1/y_{i}^{2})] for i = 1 to no. vars see outer arrays
Signed target. This type of S/N ratio is appropriate when the quality characteristic of interest has an ideal value of 0 (zero), and both positive and negative values of the quality characteristic may occur. For example, the dc offset voltage of a differential operational amplifier may be positive or negative (see Phadke, 1989). The following S/N ratio should be used for these types of problems:
Eta = 10 * log_{10}(s^{2}) for i = 1 to no. vars see outer arrays
where s^{2} stands for the variance of the quality characteristic across the measurements (variables).
Fraction defective. This S/N ratio is useful for minimizing scrap, minimizing the percent of patients who develop sideeffects to a drug, etc. Taguchi also refers to the resultant Eta values as Omegas; note that this S/N ratio is identical to the familiar logit transformation (see also Nonlinear Estimation):
Eta = 10 * log_{10}[p/(1p)]
where
p is the proportion defective
Ordered categories (the accumulation analysis). In some cases, measurements on a quality characteristic can only be obtained in terms of categorical judgments. For example, consumers may rate a product as excellent, good, average, or below average. In that case, you would attempt to maximize the number of excellent or good ratings. Typically, the results of an accumulation analysis are summarized graphically in a stacked bar plot.
Orthogonal Arrays
The third aspect of Taguchi robust design methods is the one most similar to traditional techniques. Taguchi has developed a system of tabulated designs (arrays) that allow for the maximum number of main effects to be estimated in an unbiased (orthogonal) manner, with a minimum number of runs in the experiment. Latin square designs, 2**(kp) designs (PlackettBurman designs, in particular), and BoxBehnken designs main are also aimed at accomplishing this goal. In fact, many of the standard orthogonal arrays tabulated by Taguchi are identical to fractional twolevel factorials, PlackettBurman designs, BoxBehnken designs, Latin square, GrecoLatin squares, etc.
Analyzing Designs
Most analyses of robust design experiments amount to a standard ANOVA of the respective S/N ratios, ignoring twoway or higherorder interactions. However, when estimating error variances, you customarily pool together main effects of negligible size.
Analyzing S/N ratios in standard designs. It should be noted at this point that, of course, all of the designs discussed up to this point (e.g., 2**(kp), 3**(kp), mixed 2 and 3 level factorials, Latin squares, central composite designs) can be used to analyze S/N ratios that you computed. In fact, the many additional diagnostic plots and other options available for those designs (e.g., estimation of quadratic components, etc.) may prove very useful when analyzing the variability (S/N ratios) in the production process.
Plot of means. A visual summary of the experiment is the plot of the average Eta (S/N ratio) by factor levels. In this plot, the optimum setting (i.e., largest S/N ratio) for each factor can easily be identified.
Verification experiments. For prediction purposes, you can compute the expected S/N ratio given a userdefined combination of settings of factors (ignoring factors that were pooled into the error term). These predicted S/N ratios can then be used in a verification experiment, where the engineer actually sets the machine accordingly and compares the resultant observed S/N ratio with the predicted S/N ratio from the experiment. If major deviations occur, you must conclude that the simple main effect model is not appropriate.
In those cases, Taguchi (1987) recommends transforming the dependent variable to accomplish additivity of factors, that is, to “make” the main effects model fit. Phadke (1989, Chapter 6) also discusses in detail methods for achieving additivity of factors.
Accumulation Analysis
When analyzing ordered categorical data, ANOVA is not appropriate. Rather, you produce a cumulative plot of the number of observations in a particular category. For each level of each factor, you plot the cumulative proportion of the number of defectives. Thus, this graph provides valuable information concerning the distribution of the categorical counts across the different factor settings.
Summary
To briefly summarize, when using Taguchi methods you first need to determine the design or control factors that can be set by the designer or engineer. Those are the factors in the experiment for which you will try different levels. Next, you decide to select an appropriate orthogonal array for the experiment. Next, you need to decide on how to measure the quality characteristic of interest. Remember that most S/N ratios require that multiple measurements are taken in each run of the experiment; for example, the variability around the nominal value cannot otherwise be assessed. Finally, you conduct the experiment and identify the factors that most strongly affect the chosen S/N ratio, and you reset your machine or production process accordingly.
Mixture Designs and Triangular Surfaces
Overview
Special issues arise when analyzing mixtures of components that must sum to a constant. For example, if you wanted to optimize the taste of a fruitpunch, consisting of the juices of 5 fruits, then the sum of the proportions of all juices in each mixture must be 100%. Thus, the task of optimizing mixtures commonly occurs in foodprocessing, refining, or the manufacturing of chemicals. A number of designs have been developed to address specifically the analysis and modeling of mixtures (see, for example, Cornell, 1990a, 1990b; Cornell and Khuri, 1987; Deming and Morgan, 1993; Montgomery, 1991).
Triangular Coordinates
The common manner in which mixture proportions can be summarized is via triangular (ternary) graphs. For example, suppose you have a mixture that consists of 3 components A, B, and C. Any mixture of the three components can be summarized by a point in the triangular coordinate system defined by the three variables.
For example, take the following 6 different mixtures of the 3 components.
A  B  C 

1 0 0 0.5 0.5 0 
0 1 0 0.5 0 0.5 
0 0 1 0 0.5 0.5 
The sum for each mixture is 1.0, so the values for the components in each mixture can be interpreted as proportions. If you graph these data in a regular 3D scatterplot, it becomes apparent that the points form a triangle in the 3D space. Only the points inside the triangle where the sum of the component values is equal to 1 are valid mixtures. Therefore, you can simply plot only the triangle to summarize the component values (proportions) for each mixture.
To readoff the coordinates of a point in the triangular graph, you would simply “drop” a line from each respective vertex to the side of the triangle below.
At the vertex for the particular factor, there is a pure blend, that is, one that only contains the respective component. Thus, the coordinates for the vertex point is 1 (or 100%, or however else the mixtures are scaled) for the respective component, and 0 (zero) for all other components. At the side opposite to the respective vertex, the value for the respective component is 0 (zero), and .5 (or 50%, etc.) for the other components.
Triangular Surfaces and Contours
You can now add to the triangle a fourth dimension that is perpendicular to the first three. Using that dimension, you could plot the values for a dependent variable or function (surface) that was fit to the dependent variable. Note that the response surface can either be shown in 3D, where the predicted response (Taste rating) is indicated by the distance of the surface from the triangular plane, or it can be indicated in a contour plot where the contours of constant height are plotted on the 2D triangle.
It should be mentioned at this point that you can produce categorized ternary graphs. These are very useful, because they allow you to fit to a dependent variable (e.g., Taste) a response surface, for different levels of a fourth component.
The Canonical Form of Mixture Polynomials
Fitting a response surface to mixture data is, in principle, done in the same manner as fitting surfaces to, for example, data from central composite designs. However, there is the issue that mixture data are constrained, that is, the sum of all component values must be constant.
Consider the simple case of two factors A and B. You may want to fit the simple linear model:
y = b_{0} + b_{A}*x_{A} + b_{B}*x_{B}
Here y stands for the dependent variable values, b_{A} and b_{B} stand for the regression coefficients, x_{A} and x_{B} stand for the values of the factors. Suppose that x_{A} and x_{B} must sum to 1; you can multiple b_{0} by 1=(x_{A} + x_{B}):
y = (b_{0}*x_{A} + b_{0}*x_{B}) + b_{A}*x_{A} + b_{B}*x_{B}
or:
y = b’_{A}*x_{A} + b’_{B}*x_{B}
where b’_{A} = b_{0} + b_{A} and b’_{B} = b_{0} + b_{B}. Thus, the estimation of this model comes down to fitting a no intercept multiple regression model. (See also Multiple Regression, for details concerning multiple regression.)
Common Models for Mixture Data
The quadratic and cubic model can be similarly simplified (as illustrated for the simple linear model above), yielding four standard models that are customarily fit to the mixture data. Here are the formulas for the 3variable case for those models (see Cornell, 1990, for additional details).
Linear model:
y = b_{1}*x_{1} + b_{2}*x_{2} + b_{3}*x_{3}
Quadratic model:
y = b_{1}*x_{1} + b_{2}*x_{2} + b_{3}*x_{3} + b_{12}*x_{1}*x_{2} + b_{13}*x_{1}*x_{3} + b_{23}*x_{2}*x_{3}
Special cubic model:
y = b_{1}*x_{1} + b_{2}*x_{2} + b_{3}*x_{3} + b_{12}*x_{1}*x_{2} + b_{13}*x_{1}*x_{3} + b_{23}*x_{2}*x_{3} + b_{123}*x_{1}*x_{2}*x_{3}
Full cubic model:
y = b_{1}*x_{1} + b_{2}*x_{2} + b_{3}*x_{3} + b_{12}*x_{1}*x_{2} + b_{13}*x_{1}*x_{3} + b_{23}*x_{2}*x_{3} + d_{12}*x_{1}*x_{2}*(x_{1} – x_{2}) + d_{13}*x_{1}*x_{3}*(x_{1} – x_{3}) + d_{23}*x_{2}*x_{3}*(x_{2} – x_{3}) + b_{123}*x_{1}*x_{2}*x_{3}
(Note that the d_{ij}‘s are also parameters of the model.)
Standard Designs for Mixture Experiments
Two different types of standard designs are commonly used for experiments with mixtures. Both of them will evaluate the triangular response surface at the vertices (i.e., the corners of the triangle) and the centroids (sides of the triangle). Sometimes, those designs are enhanced with additional interior points.
Simplexlattice designs. In this arrangement of design points, m+1 equally spaced proportions are tested for each factor or component in the model:
x_{i} = 0, 1/m, 2/m, …, 1 i = 1,2,…,q
and all combinations of factor levels are tested. The resulting design is called a {q,m} simplex lattice design. For example, a {q=3, m=2} simplex lattice design will include the following mixtures:
A  B  C 

1 0 0 .5 .5 0 
0 1 0 .5 0 .5 
0 0 1 0 .5 .5 
A {q=3,m=3} simplex lattice design will include the points:
A  B  C 

1 0 0 1/3 1/3 0 2/3 2/3 0 1/3 
0 1 0 2/3 0 1/3 1/3 0 2/3 1/3 
0 0 1 0 2/3 2/3 0 1/3 1/3 1/3 
Simplexcentroid designs. An alternative arrangement of settings introduced by Scheffé (1963) is the socalled simplexcentroid design. Here the design points correspond to all permutations of the pure blends (e.g., 1 0 0; 0 1 0; 0 0 1), the permutations of the binary blends (½ ½ 0; ½ 0 ½; 0 ½ ½), the permutations of the blends involving three components, and so on. For example, for 3 factors the simplex centroid design consists of the points:
A  B  C 

1 0 0 1/2 1/2 0 1/3 
0 1 0 1/2 0 1/2 1/3 
0 0 1 0 1/2 1/2 1/3 
Adding interior points. These designs are sometimes augmented with interior points (see Khuri and Cornell, 1987, page 343; Mason, Gunst, Hess; 1989; page 230). For example, for 3 factors you could add the interior points:
A  B  C 

2/3 1/6 1/6 
1/6 2/3 1/6 
1/6 1/6 2/3 
If you plot these points in a scatterplot with triangular coordinates, you can see how these designs evenly cover the experimental region defined by the triangle.
Lower Constraints
The designs described above all require vertex points, that is, pure blends consisting of only one ingredient. In practice, those points may often not be valid, that is, pure blends cannot be produced because of cost or other constraints. For example, suppose you wanted to study the effect of a food additive on the taste of the fruitpunch. The additional ingredient may only be varied within small limits, for example, it may not exceed a certain percentage of the total. Clearly, a fruit punch that is a pure blend, consisting only of the additive, would not be a fruit punch at all, or worse, may be toxic. These types of constraints are very common in many applications of mixture experiments.
Let’s consider a 3component example, where component A is constrained so that x_{A}.3. The total of the 3component mixture must be equal to 1. This constraint can be visualized in a triangular graph by a line at the triangular coordinate for x_{A}=.3, that is, a line that is parallel to the triangle’s edge opposite to the A vertex point.
You can now construct the design as before, except that one side of the triangle is defined by the constraint. Later, in the analysis, you can review the parameter estimates for the socalled pseudocomponents, treating the constrained triangle as if it were a full triangle.
Multiple constraints. Multiple lower constraints can be treated analogously, that is, you can construct the subtriangle within the full triangle, and then place the design points in that subtriangle according to the chosen design.
Upper and Lower Constraints
When there are both upper and lower constraints (as is often the case in experiments involving mixtures), then the standard simplexlattice and simplexcentroid designs can no longer be constructed, because the subregion defined by the constraints is no longer a triangle. There is a general algorithm for finding the vertex and centroid points for such constrained designs.
Note that you can still analyze such designs by fitting the standard models to the data.
Analyzing Mixture Experiments
The analysis of mixture experiments amounts to a multiple regression with the intercept set to zero. As explained earlier, the mixture constraint — that the sum of all components must be constant — can be accommodated by fitting multiple regression models that do not include an intercept term. If you are not familiar with multiple regression, you may want to review at this point Multiple Regression.
The specific models that are usually considered were described earlier. To summarize, you fit to the dependent variable response surfaces of increasing complexity, that is, starting with the linear model, then the quadratic model, special cubic model, and full cubic model. Shown below is a table with the number of terms or parameters in each model, for a selected number of components (see also Table 4, Cornell, 1990):
Model (Degree of Polynomial) 


No. of



Special

Full

2

2

3

—

—

Analysis of Variance
To decide which of the models of increasing complexity provides a sufficiently good fit to the observed data, you usually compare the models in a hierarchical, stepwise fashion. For example, consider a 3 component mixture to which the full cubic model was fitted.
ANOVA; Var.:DV (mixt4.sta)  

3 Factor mixture design; Mixture total=1., 14 Runs Sequential fit of models of increasing complexity 

Model 
SS Effect 
df Effect 
MS Effect 
SS Error 
df Error 
MS Error 
F 
p 
Rsqr 
Rsqr Adj. 
Linear Quadratic Special Cubic Cubic Total Adjusted 
44.755 30.558 .719 8.229 91.627 
2 3 1 3 13 
22.378 10.186 .719 2.743 7.048 
46.872 16.314 15.596 7.367 
11 8 7 4 
4.2611 2.0393 2.2279 1.8417 
5.2516 4.9949 .3225 1.4893 
.0251 .0307 .5878 .3452 
.4884 .8220 .8298 .9196 
.3954 .7107 .6839 .7387 
First, the linear model was fit to the data. Even though this model has 3 parameters, one for each component, this model has only 2 degrees of freedom. This is because of the overall mixture constraint, that the sum of all component values is constant. The simultaneous test for all parameters of this model is statistically significant (F(2,11)=5.25; p<.05). The addition of the 3 quadratic model parameters (b_{12}*x_{1}*x_{2}, b_{13}*x_{1}*x_{3}, b_{23}*x_{2}*x_{3}) further significantly improves the fit of the model (F(3,8)=4.99; p<.05). However, adding the parameters for the special cubic and cubic models does not significantly improve the fit of the surface. Thus you could conclude that the quadratic model provides an adequate fit to the data (of course, pending further examination of the residuals for outliers, etc.).
Rsquare. The Rsquare value can be interpreted as the proportion of variability around the mean for the dependent variable, that can be accounted for by the respective model. (Note that for non intercept models, some multiple regression programs will only compute the Rsquare value pertaining to the proportion of variance around 0 (zero) accounted for by the independent variables; for more information, see Kvalseth, 1985; Okunade, Chang, and Evans, 1993.)
Pure error and lack of fit. The usefulness of the estimate of pure error for assessing the overall lack of fit was discussed in the context of central composite designs. If some runs in the design were replicated, then you can compute an estimate of error variability based only on the variability between replicated runs. This variability provides a good indication of the unreliability in the measurements, independent of the model that was fit to the data, since it is based on identical factor settings (or blends in this case). You can test the residual variability after fitting the current model against this estimate of pure error. If this test is statistically significant, that is, if the residual variability is significantly larger than the pure error variability, then you can conclude that, most likely, there are additional significant differences between blends that cannot be accounted for by the current model. Thus, there may be an overall lack of fit of the current model. In that case, try a more complex model, perhaps by only adding individual terms of the next higherorder model (e.g., only the b_{13}*x_{1}*x_{3} to the linear model).
Parameter Estimates
Usually, after fitting a particular model, you would next review the parameter estimates. Remember that the linear terms in mixture models are constrained, that is, the sum of the components must be constant. Hence, independent statistical significance tests for the linear components cannot be performed.
PseudoComponents
To allow for scaleindependent comparisons of the parameter estimates, during the analysis, the component settings are customarily recoded to socalled pseudocomponents so that (see also Cornell, 1993, Chapter 3):
x’_{i} = (x_{i}L_{i})/(TotalL)
Here, x’_{i} stands for the i‘th pseudocomponent, x_{i} stands for the original component value, L_{i} stands for the lower constraint (limit) for the i‘th component, L stands for the sum of all lower constraints (limits) for all components in the design, and Total is the mixture total.
The issue of lower constraints was also discussed earlier in this section. If the design is a standard simplexlattice or simplexcentroid design (see above), then this transformation amounts to a rescaling of factors so as to form a subtriangle (subsimplex) as defined by the lower constraints. However, you can compute the parameter estimates based on the original (untransformed) metric of the components in the experiment. If you want to use the fitted parameter values for prediction purposes (i.e., to predict dependent variable values), then the parameters for the untransformed components are often more convenient to use. Note that the results dialog for mixture experiments contains options to make predictions for the dependent variable for userdefined values of the components, in their original metric.
Graph Options
Surface and contour plots. The respective fitted model can be visualized in triangular surface plots or contour plots, which, optionally, can also include the respective fitted function.
Note that the fitted function displayed in the surface and contour plots always pertains to the parameter estimates for the pseudocomponents.
Categorized surface plots. If your design involves replications (and the replications are coded in your data file), then you can use 3D Ternary Plots to look at the respective fit, replication by replication.
Of course, if you have other categorical variables in your study (e.g., operator or experimenter; machine, etc.) you can also categorize the 3D surface plot by those variables.
Trace plots. One aid for interpreting the triangular response surface is the socalled trace plot. Suppose you looked at the contour plot of the response surface for three components. Then, determine a reference blend for two of the components, for example, hold the values for A and B at 1/3 each. Keeping the relative proportions of A and B constant (i.e., equal proportions in this case), you can then plot the estimated response (values for the dependent variable) for different values of C.
If the reference blend for A and B is 1:1, then the resulting line or response trace is the axis for factor C; that is, the line from the C vertex point connecting with the opposite side of the triangle at a right angle. However, trace plots for other reference blends can also be produced. Typically, the trace plot contains the traces for all components, given the current reference blend.
Residual plots. Finally, it is important, after deciding on a model, to review the prediction residuals, in order to identify outliers or regions of misfitfit. In addition, you should review the standard normal probability plot of residuals and the scatterplot of observed versus predicted values. Remember that the multiple regression analysis (i.e., the process of fitting the surface) assumes that the residuals are normally distributed, and you should carefully review the residuals for any apparent outliers.
Designs for Constrained Surfaces and Mixtures
Overview
As mentioned in the context of mixture designs, it often happens in realworld studies that the experimental region of interest is constrained, that is, that not all factors settings can be combined with all settings for the other factors in the study. There is an algorithm suggested by Piepel (1988) and Snee (1985) for finding the vertices and centroids for such constrained regions.
Designs for Constrained Experimental Regions
When in an experiment with many factors, there are constraints concerning the possible values of those factors and their combinations, it is not clear how to proceed. A reasonable approach is to include in the experiments runs at the extreme vertex points and centroid points of the constrained region, which should usually provide good coverage of the constrained experimental region (e.g., see Piepel, 1988; Snee, 1975). In fact, the mixture designs reviewed in the previous section provide examples for such designs, since they are typically constructed to include the vertex and centroid points of the constrained region that consists of a triangle (simplex).
Linear Constraints
One general way in which you can summarize most constraints that occur in real world experimentation is in terms of a linear equation (see Piepel, 1988):
A_{1}x_{1} + A_{2}x_{2} + … + A_{q}x_{q} + A_{0} 0
Here, A_{0}, .., A_{q} are the parameters for the linear constraint on the q factors, and x_{1},.., x_{q} stands for the factor values (levels) for the q factors. This general formula can accommodate even very complex constraints. For example, suppose that in a twofactor experiment the first factor must always be set at least twice as high as the second, that is, x_{1} 2*x_{2}. This simple constraint can be rewritten as x_{1}2*x_{2} 0. The ratio constraint 2*x_{1} /x_{2} 1 can be rewritten as 2*x_{1} – x_{2} 0, and so on.
The problem of multiple upper and lower constraints on the component values in mixtures was discussed earlier, in the context of mixture experiments. For example, suppose in a threecomponent mixture of fruit juices, the upper and lower constraints on the components are (see example 3.2, in Cornell 1993):
40% Watermelon (x_{1}) 80%
10% Pineapple (x_{2}) 50%
10% Orange (x_{3}) 30%
These constraints can be rewritten as linear constraints into the form:
Watermelon: 
x_{1}400 x_{1}+800 
Pineapple: 
x_{2}100 x_{2}+500 
Orange: 
x_{3}100 x_{3}+300 
Thus, the problem of finding design points for mixture experiments with components with multiple upper and lower constraints is only a special case of general linear constraints.
The Piepel & Snee Algorithm
For the special case of constrained mixtures, algorithms such as the XVERT algorithm (see, for example, Cornell, 1990) are often used to find the vertex and centroid points of the constrained region (inside the triangle of three components, tetrahedron of four components, etc.). The general algorithm proposed by Piepel (1988) and Snee (1979) for finding vertices and centroids can be applied to mixtures as well as nonmixtures. The general approach of this algorithm is described in detail by Snee (1979).
Specifically, it will consider onebyone each constraint, written as a linear equation as described above. Each constraint represents a line (or plane) through the experimental region. For each successive constraint you will evaluate whether or not the current (new) constraint crosses into the current valid region of the design. If so, new vertices will be computed which define the new valid experimental region, updated for the most recent constraint. It will then check whether or not any of the previously processed constraints have become redundant, that is, define lines or planes in the experimental region that are now entirely outside the valid region. After all constraints have been processed, it will then compute the centroids for the sides of the constrained region (of the order requested by the user). For the twodimensional (twofactor) case, you can easily recreate this process by simply drawing lines through the experimental region, one for each constraint; what is left is the valid experimental region.
For more information, see Piepel (1988) or Snee (1979).
Choosing Points for the Experiment
Once the vertices and centroids have been computed, you may face the problem of having to select a subset of points for the experiment. If each experimental run is costly, then it may not be feasible to simply run all vertex and centroid points. In particular, when there are many factors and constraints, then the number of centroids can quickly get very large.
If you are screening a large number of factors, and are not interested in nonlinear effects, then choosing the vertex points only will usually yield good coverage of the experimental region. To increase statistical power (to increase the degrees of freedom for the ANOVA error term), you may also want to include a few runs with the factors set at the overall centroid of the constrained region.
If you are considering a number of different models that you might fit once the data have been collected, then you may want to use the D and Aoptimal design options. Those options will help you select the design points that will extract the maximum amount of information from the constrained experimental region, given your models.
Analyzing Designs for Constrained Surfaces and Mixtures
As mentioned in the section on central composite designs and mixture designs, once the constrained design points have been chosen for the final experiment, and the data for the dependent variables of interest have been collected, the analysis of these designs can proceed in the standard manner.
For example, Cornell (1990, page 68) describes an experiment of three plasticizers, and their effect on resultant vinyl thickness (for automobile seat covers). The constraints for the three plasticizers components x_{1}, x_{2}, and x_{3} are:
.409 x_{1} .849
.000 x_{2} .252
.151 x_{3} .274
(Note that these values are already rescaled, so that the total for each mixture must be equal to 1.) The vertex and centroid points generated are:
x_{1}  x_{2}  x_{3} 

.8490 .7260 .4740 .5970 .6615 .7875 .6000 .5355 .7230 
.0000 .0000 .2520 .2520 .1260 .0000 .1260 .2520 .1260 
.1510 .2740 .2740 .1510 .2125 .2125 .2740 .2125 .1510 
Constructing D and AOptimal Designs
Overview
In the sections on standard factorial designs (see 2**(kp) Fractional Factorial Designs and 3**(kp), Box Behnken, and Mixed 2 and 3 Level Factorial Designs) and Central Composite Designs, the property of orthogonality of factor effects was discussed. In short, when the factor level settings for two factors in an experiment are uncorrelated, that is, when they are varied independently of each other, then they are said to be orthogonal to each other. (If you are familiar with matrix and vector algebra, two column vectors X_{1} and X_{2} in the design matrix are orthogonal if X_{1}‘*X_{2}= 0). Intuitively, it should be clear that you can extract the maximum amount of information regarding a dependent variable from the experimental region (the region defined by the settings of the factor levels), if all factor effects are orthogonal to each other. Conversely, suppose you ran a fourrun experiment for two factors as follows:
x_{1}  x_{2}  

Run 1 Run 2 Run 3 Run 4 
1 1 1 1 
1 1 1 1 
Now the columns of factor settings for X_{1} and X_{2} are identical to each other (their correlation is 1), and there is no way in the results to distinguish between the main effect for X_{1} and X_{2}.
The D and Aoptimal design procedures provide various options to select from a list of valid (candidate) points (i.e., combinations of factor settings) those points that will extract the maximum amount of information from the experimental region, given the respective model that you expect to fit to the data. You need to supply the list of candidate points, for example the vertex and centroid points computed by the Designs for constrained surface and mixtures option, specify the type of model you expect to fit to the data, and the number of runs for the experiment. It will then construct a design with the desired number of cases, that will provide as much orthogonality between the columns of the design matrix as possible.
The reasoning behind D and Aoptimality is discussed, for example, in Box and Draper (1987, Chapter 14). The different algorithms used for searching for optimal designs are described in Dykstra (1971), Galil and Kiefer (1980), and Mitchell (1974a, 1974b). A detailed comparison study of the different algorithms is discussed in Cook and Nachtsheim (1980).
Basic Ideas
A technical discussion of the reasoning (and limitations) of D and Aoptimal designs is beyond the scope of this introduction. However, the general ideas are fairly straightforward. Consider again the simple twofactor experiment in four runs.
x_{1}  x_{2}  

Run 1 Run 2 Run 3 Run 4 
1 1 1 1 
1 1 1 1 
As mentioned above, this design, of course, does not allow you to test, independently, the statistical significance of the two variables’ contribution to the prediction of the dependent variable. If you computed the correlation matrix for the two variables, they would correlate at 1:
x_{1}  x_{2}  

x_{1} x_{2} 
1.0 1.0 
1.0 1.0 
Normally, you would run this experiment so that the two factors are varied independently of each other:
x_{1}  x_{2}  

Run 1 Run 2 Run 3 Run 4 
1 1 1 1 
1 1 1 1 
Now the two variables are uncorrelated, that is, the correlation matrix for the two factors is:
x_{1}  x_{2}  

x_{1} x_{2} 
1.0 0.0 
0.0 1.0 
Another term that is customarily used in this context is that the two factors are orthogonal. Technically, if the sum of the products of the elements of two columns (vectors) in the design (design matrix) is equal to 0 (zero), then the two columns are orthogonal.
The determinant of the design matrix. The determinant D of a square matrix (like the 2by2 correlation matrices shown above) is a specific numerical value, that reflects the amount of independence or redundancy between the columns and rows of the matrix. For the 2by2 case, it is simply computed as the product of the diagonal elements minus the offdiagonal elements of the matrix (for larger matrices the computations are more complex). For example, for the two matrices shown above, the determinant D is:
D_{1} = 
1.0 1.0 1.0 1.0 
= 1*1 – 1*1 = 0 
D_{2} = 
1.0 0.0 0.0 1.0 
= 1*1 – 0*0 = 1 
Thus, the determinant for the first matrix computed from completely redundant factor settings is equal to 0. The determinant for the second matrix, when the factors are orthogonal, is equal to 1.
Doptimal designs. This basic relationship extends to larger design matrices, that is, the more redundant the vectors (columns) of the design matrix, the closer to 0 (zero) is the determinant of the correlation matrix for those vectors; the more independent the columns, the larger is the determinant of that matrix. Thus, finding a design matrix that maximizes the determinant D of this matrix means finding a design where the factor effects are maximally independent of each other. This criterion for selecting a design is called the Doptimality criterion.
Matrix notation. Actually, the computations are commonly not performed on the correlation matrix of vectors, but on the simple crossproduct matrix. In matrix notation, if the design matrix is denoted by X, then the quantity of interest here is the determinant of X’X (X– transposed times X). Thus, the search for Doptimal designs aims to maximize X’X, where the vertical lines (..) indicate the determinant.
Aoptimal designs. Looking back at the computations for the determinant, another way to look at the issue of independence is to maximize the diagonal elements of the X’X matrix, while minimizing the offdiagonal elements. The socalled trace criterion or Aoptimality criterion expresses this idea. Technically, the Acriterion is defined as:
A = trace(X’X)^{1}
where trace stands for the sum of the diagonal elements (of the (X’X)^{1} matrix).
The information function. It should be mentioned at this point that Doptimal designs minimize the expected prediction error for the dependent variable, that is, those designs will maximize the precision of prediction, and thus the information (which is defined as the inverse of the error) that is extracted from the experimental region of interest.
Measuring Design Efficiency
A number of standard measures have been proposed to summarize the efficiency of a design.
Defficiency. This measure is related to the Doptimality criterion:
Defficiency = 100 * (X’X^{1/p}/N)
Here, p is the number of factor effects in the design (columns in X), and N is the number of requested runs. This measure can be interpreted as the relative number of runs (in percent) that would be required by an orthogonal design to achieve the same value of the determinant X’X. However, remember that an orthogonal design may not be possible in many cases, that is, it is only a theoretical “yardstick.” Therefore, you should use this measure rather as a relative indicator of efficiency, to compare other designs of the same size, and constructed from the same design points candidate list. Also note that this measure is only meaningful (and will only be reported) if you chose to recode the factor settings in the design (i.e., the factor settings for the design points in the candidate list), so that they have a minimum of 1 and a maximum of +1.
Aefficiency. This measure is related to the Aoptimality criterion:
Aefficiency = 100 * p/trace(N*(X’X)^{1})
Here, p stands for the number of factor effects in the design, N is the number of requested runs, and trace stands for the sum of the diagonal elements (of (N*(X’X)^{1}) ). This measure can be interpreted as the relative number of runs (in percent) that would be required by an orthogonal design to achieve the same value of the trace of (X’X)^{1}. However, again you should use this measure as a relative indicator of efficiency, to compare other designs of the same size and constructed from the same design points candidate list; also this measure is only meaningful if you chose to recode the factor settings in the design to the 1 to +1 range.
Gefficiency. This measure is computed as:
Gefficiency = 100 * square root(p/N)/_{M}
Again, p stands for the number of factor effects in the design and N is the number of requested runs; _{M} (sigma_{M}) stands for the maximum standard error for prediction across the list of candidate points. This measure is related to the socalled G optimality criterion; Goptimal designs are defined as those that will minimize the maximum value of the standard error of the predicted response.
Constructing Optimal Designs
The optimal design facilities will “search for” optimal designs, given a list of “candidate points.” Put another way, given a list of points that specifies which regions of the design are valid or feasible, and given a userspecified number of runs for the final experiment, it will select points to optimize the respective criterion. This “searching for” the best design is not an exact method, but rather an algorithmic procedure that employs certain search strategies to find the best design (according to the respective optimality criterion).
The search procedures or algorithms that have been proposed are described below (for a review and detailed comparison, see Cook and Nachtsheim, 1980). They are reviewed here in the order of speed, that is, the Sequential or Dykstra method is the fastest method, but often most likely to fail, that is, to yield a design that is not optimal (e.g., only locally optimal; this issue will be discussed shortly).
Sequential or Dykstra method. This algorithm is due to Dykstra (1971). Starting with an empty design, it will search through the candidate list of points, and choose in each step the one that maximizes the chosen criterion. There are no iterations involved, they will simply pick the requested number of points sequentially. Thus, this method is the fastest of the ones discussed. Also, by default, this method is used to construct the initial designs for the remaining methods.
Simple exchange (WynnMitchell) method. This algorithm is usually attributed to Mitchell and Miller (1970) and Wynn (1972). The method starts with an initial design of the requested size (by default constructed via the sequential search algorithm described above). In each iteration, one point (run) in the design will be dropped from the design and another added from the list of candidate points. The choice of points to be dropped or added is sequential, that is, at each step the point that contributes least with respect to the chosen optimality criterion (D or A) is dropped from the design; then the algorithm chooses a point from the candidate list so as to optimize the respective criterion. The algorithm stops when no further improvement is achieved with additional exchanges.
DETMAX algorithm (exchange with excursions). This algorithm, due to Mitchell (1974b), is probably the best known and most widely used optimal design search algorithm. Like the simple exchange method, first an initial design is constructed (by default, via the sequential search algorithm described above). The search begins with a simple exchange as described above. However, if the respective criterion (D or A) does not improve, the algorithm will undertake excursions. Specifically, the algorithm will add or subtract more than one point at a time, so that, during the search, the number of points in the design may vary between N_{D}+ N_{excursion} and N_{D}– N_{excursion}, where N_{D} is the requested design size, and N_{excursion} refers to the maximum allowable excursion, as specified by the user. The iterations will stop when the chosen criterion (D or A) no longer improves within the maximum excursion.
Modified Fedorov (simultaneous switching). This algorithm represents a modification (Cook and Nachtsheim, 1980) of the basic Fedorov algorithm described below. It also begins with an initial design of the requested size (by default constructed via the sequential search algorithm). In each iteration, the algorithm will exchange each point in the design with one chosen from the candidate list, so as to optimize the design according to the chosen criterion (D or A). Unlike the simple exchange algorithm described above, the exchange is not sequential, but simultaneous. Thus, in each iteration each point in the design is compared with each point in the candidate list, and the exchange is made for the pair that optimizes the design. The algorithm terminates when there are no further improvements in the respective optimality criterion.
Fedorov (simultaneous switching). This is the original simultaneous switching method proposed by Fedorov (see Cook and Nachtsheim, 1980). The difference between this procedure and the one described above (modified Fedorov) is that in each iteration only a single exchange is performed, that is, in each iteration all possible pairs of points in the design and those in the candidate list are evaluated. The algorithm will then exchange the pair that optimizes the design (with regard to the chosen criterion). Thus, it is easy to see that this algorithm potentially can be somewhat slow, since in each iteration N_{D}*N_{C} comparisons are performed, in order to exchange a single point.
General Recommendations
If you think about the basic strategies represented by the different algorithms described above, it should be clear that there are usually no exact solutions to the optimal design problem. Specifically, the determinant of the X’X matrix (and trace of its inverse) are complex functions of the list of candidate points. In particular, there are usually several “local minima” with regard to the chosen optimality criterion; for example, at any point during the search a design may appear optimal unless you simultaneously discard half of the points in the design and choose certain other points from the candidate list; but, if you only exchange individual points or only a few points (via DETMAX), then no improvement occurs.
Therefore, it is important to try a number of different initial designs and algorithms. If after repeating the optimization several times with random starts the same, or very similar, final optimal design results, then you can be reasonably sure that you are not “caught” in a local minimum or maximum.
Also, the methods described above vary greatly with regard to their ability to get “trapped” in local minima or maxima. As a general rule, the slower the algorithm (i.e., the further down on the list of algorithms described above), the more likely is the algorithm to yield a truly optimal design. However, note that the modified Fedorov algorithm will practically perform just as well as the unmodified algorithm (see Cook and Nachtsheim, 1980); therefore, if time is not a consideration, we recommend the modified Fedorov algorithm as the best method to use.
Doptimality and Aoptimality. For computational reasons (see Galil and Kiefer, 1980), updating the trace of a matrix (for the Aoptimality criterion) is much slower than updating the determinant (for Doptimality). Thus, when you choose the Aoptimality criterion, the computations may require significantly more time as compared to the Doptimality criterion. Since in practice, there are many other factors that will affect the quality of an experiment (e.g., the measurement reliability for the dependent variable), we generally recommend that you use the D optimality criterion. However, in difficult design situations, for example, when there appear to be many local maxima for the D criterion, and repeated trials yield very different results, you may want to run several optimization trials using the A criterion to learn more about the different types of designs that are possible.
Avoiding Matrix Singularity
It may happen during the search process that it cannot compute the inverse of the X’X matrix (for Aoptimality), or that the determinant of the matrix becomes almost 0 (zero). At that point, the search can usually not continue. To avoid this situation, perform the optimization based on an augmented X’X matrix:
X’X_{augmented} = X’X + *(X_{0}‘X_{0}/N_{0})
where X_{0} stands for the design matrix constructed from the list of all N_{0} candidate points, and (alpha) is a userdefined small constant. Thus, you can turn off this feature by setting to 0 (zero).
“Repairing” Designs
The optimal design features can be used to “repair” designs. For example, suppose you ran an orthogonal design, but some data were lost (e.g., due to equipment malfunction), and now some effects of interest can no longer be estimated. You could of course make up the lost runs, but suppose you do not have the resources to redo them all. In that case, you can set up the list of candidate points from among all valid points for the experimental region, add to that list all the points that you have already run, and instruct it to always force those points into the final design (and never to drop them out; you can mark points in the candidate list for such forced inclusion). It will then only consider to exclude those points from the design that you did not actually run. In this manner you can, for example, find the best single run to add to an existing experiment, that would optimize the respective criterion.
Constrained Experimental Regions and Optimal Design
A typical application of the optimal design features is to situations when the experimental region of interest is constrained. As described earlier in this section, there are facilities for finding vertex and centroid points for linearly constrained regions and mixtures. Those points can then be submitted as the candidate list for constructing an optimal design of a particular size for a particular model. Thus, these two facilities combined provide a very powerful tool to cope with the difficult design situation when the design region of interest is subject to complex constraints, and you want to fit particular models with the least number of runs.
Special Topics
The following sections introduce several analysis techniques. The sections describe Response/desirability profiling, conducting Residual analyses, and performing BoxCox transformations of the dependent variable.
See also ANOVA/MANOVA, Methods for Analysis of Variance, and Variance Components and Mixed Model ANOVA/ANCOVA.
Profiling Predicted Responses and Response Desirability
Basic Idea. A typical problem in product development is to find a set of conditions, or levels of the input variables, that produces the most desirable product in terms of its characteristics, or responses on the output variables. The procedures used to solve this problem generally involve two steps: (1) predicting responses on the dependent, or Y variables, by fitting the observed responses using an equation based on the levels of the independent, or X variables, and (2) finding the levels of the X variables which simultaneously produce the most desirable predicted responses on the Y variables. Derringer and Suich (1980) give, as an example of these procedures, the problem of finding the most desirable tire tread compound. There are a number of Y variables, such as PICO Abrasion Index, 200 percent modulus, elongation at break, and hardness. The characteristics of the product in terms of the response variables depend on the ingredients, the X variables, such as hydrated silica level, silane coupling agent level, and sulfur. The problem is to select the levels for the X’s which will maximize the desirability of the responses on the Y’s. The solution must take into account the fact that the levels for the X’s that maximize one response may not maximize a different response.
When analyzing 2**(kp) (twolevel factorial) designs, 2level screening designs, 2**(kp) maximally unconfounded and minimum aberration designs, 3**(kp) and Box Behnken designs, Mixed 2 and 3 level designs, central composite designs, and mixture designs, Response/desirability profiling allows you to inspect the response surface produced by fitting the observed responses using an equation based on levels of the independent variables.
Prediction Profiles. When you analyze the results of any of the designs listed above, a separate prediction equation for each dependent variable (containing different coefficients but the same terms) is fitted to the observed responses on the respective dependent variable. Once these equations are constructed, predicted values for the dependent variables can be computed at any combination of levels of the predictor variables. A prediction profile for a dependent variable consists of a series of graphs, one for each independent variable, of the predicted values for the dependent variable at different levels of one independent variable, holding the levels of the other independent variables constant at specified values, called current values. If appropriate current values for the independent variables have been selected, inspecting the prediction profile can show which levels of the predictor variables produce the most desirable predicted response on the dependent variable.
You might be interested in inspecting the predicted values for the dependent variables only at the actual levels at which the independent variables were set during the experiment. Alternatively, you also might be interested in inspecting the predicted values for the dependent variables at levels other than the actual levels of the independent variables used during the experiment, to see if there might be intermediate levels of the independent variables that could produce even more desirable responses. Also, returning to the Derringer and Suich (1980) example, for some response variables, the most desirable values may not necessarily be the most extreme values, for example, the most desirable value of elongation may fall within a narrow range of the possible values.
Response Desirability. Different dependent variables might have different kinds of relationships between scores on the variable and the desirability of the scores. Lessfilling beer may be more desirable, but bettertasting beer can also be more desirable – lower “fillingness” scores and higher “taste” scores are both more desirable. The relationship between predicted responses on a dependent variable and the desirability of responses is called the desirability function. Derringer and Suich (1980) developed a procedure for specifying the relationship between predicted responses on a dependent variable and the desirability of the responses, a procedure that provides for up to three “inflection” points in the function. Returning to the tire tread compound example described above, their procedure involved transforming scores on each of the four tire tread compound outcome variables into desirability scores that could range from 0.0 for undesirable to 1.0 for very desirable. For example, their desirability function for hardness of the tire tread compound was defined by assigning a desirability value of 0.0 to hardness scores below 60 or above 75, a desirability value of 1.0 to midpoint hardness scores of 67.5, a desirability value that increased linearly from 0.0 up to 1.0 for hardness scores between 60 and 67.5 and a desirability value that decreased linearly from 1.0 down to 0.0 for hardness scores between 67.5 and 75.0. More generally, they suggested that procedures for defining desirability functions should accommodate curvature in the “falloff” of desirability between inflection points in the functions.
After transforming the predicted values of the dependent variables at different combinations of levels of the predictor variables into individual desirability scores, the overall desirability of the outcomes at different combinations of levels of the predictor variables can be computed. Derringer and Suich (1980) suggested that overall desirability be computed as the geometric mean of the individual desirabilities (which makes intuitive sense, because if the individual desirability of any outcome is 0.0, or unacceptable, the overall desirability will be 0.0, or unacceptable, no matter how desirable the other individual outcomes are – the geometric mean takes the product of all of the values, and raises the product to the power of the reciprocal of the number of values). Derringer and Suich’s procedure provides a straightforward way for transforming predicted values for multiple dependent variables into a single overall desirability score. The problem of simultaneously optimization of several response variables then boils down to selecting the levels of the predictor variables that maximize the overall desirability of the responses on the dependent variables.
Summary. When you are developing a product whose characteristics are known to depend on the “ingredients” of which it is constituted, producing the best product possible requires determining the effects of the ingredients on each characteristic of the product, and then finding the balance of ingredients that optimizes the overall desirability of the product. In data analytic terms, the procedure that is followed to maximize product desirability is to (1) find adequate models (i.e., prediction equations) to predict characteristics of the product as a function of the levels of the independent variables, and (2) determine the optimum levels of the independent variables for overall product quality. These two steps, if followed faithfully, will likely lead to greater success in product improvement than the fabled, but statistically dubious technique of hoping for accidental breakthroughs and discoveries that radically improve product quality.
Residuals Analysis
Basic Idea. Extended residuals analysis is a collection of methods for inspecting different residual and predicted values, and thus to examine the adequacy of the prediction model, the need for transformations of the variables in the model, and the existence of outliers in the data.
Residuals are the deviations of the observed values on the dependent variable from the predicted values, given the current model. The ANOVA models used in analyzing responses on the dependent variable make certain assumptions about the distributions of residual (but not predicted) values on the dependent variable. These assumptions can be summarized by saying that the ANOVA model assumes normality, linearity, homogeneity of variances and covariances, and independence of residuals. All of these properties of the residuals for a dependent variable can be inspected using Residuals analysis.
BoxCox Transformations of Dependent Variables
Basic Idea. It is assumed in analysis of variance that the variances in the different groups (experimental conditions) are homogeneous, and that they are uncorrelated with the means. If the distribution of values within each experimental condition is skewed, and the means are correlated with the standard deviations, then you can often apply an appropriate power transformation to the dependent variable to stabilize the variances, and to reduce or eliminate the correlation between the means and standard deviations. The BoxCox transformation is useful for selecting an appropriate (power) transformation of the dependent variable.
Selecting the BoxCox transformation option will produce a plot of the Residual Sum of Squares, given the model, as a function of the value of lambda, where lambda is used to define a transformation of the dependent variable,
y’ = ( y**(lambda) – 1 ) / ( g**(lambda1) * lambda)  if lambda 0 
y’ = g * natural log(y)  if lambda = 0 
in which g is the geometric mean of the dependent variable and all values of the dependent variable are nonnegative. The value of lambda for which the Residual Sum of Squares is a minimum is the maximum likelihood estimate for this parameter. It produces the variance stabilizing transformation of the dependent variable that reduces or eliminates the correlation between the group means and standard deviations.
In practice, it is not important that you use the exact estimated value of lambda for transforming the dependent variable. Rather, as a rule of thumb, you should consider the following transformations:
Approximate lambda 
Suggested transformation of y 

1 0.5 0 0.5 1 
Reciprocal Reciprocal square root Natural logarithm Square root None 
For additional information regarding this family of transformations, see Box and Cox (1964), Box and Draper (1987), and Maddala (1977).
Distribution Fitting, Formulate Hypotheses
General Purpose
In some research applications, we can formulate hypotheses about the specific distribution of the variable of interest. For example, variables whose values are determined by an infinite number of independent random events will be distributed following the normal distribution: we can think of a person’s height as being the result of very many independent factors such as numerous specific genetic predispositions, early childhood diseases, nutrition, etc. (see the animation below for an example of the normal distribution). As a result, height tends to be normally distributed in the U.S. population. On the other hand, if the values of a variable are the result of very rare events, then the variable will be distributed according to the Poisson distribution (sometimes called the distribution of rare events). For example, industrial accidents can be thought of as the result of the intersection of a series of unfortunate (and unlikely) events, and their frequency tends to be distributed according to the Poisson distribution. These and other distributions are described in greater detail in the respective glossary topics.
Another common application where distribution fitting procedures are useful is when we want to verify the assumption of normality before using some parametric test (see General Purpose of Nonparametric Tests). For example, you may want to use the KolmogorovSmirnov test for normality or the ShapiroWilks’ W test to test for normality.
Fit of the Observed Distribution
For predictive purposes it is often desirable to understand the shape of the underlying distribution of the population. To determine this underlying distribution, it is common to fit the observed distribution to a theoretical distribution by comparing the frequencies observed in the data to the expected frequencies of the theoretical distribution (i.e., a Chisquare goodness of fit test). In addition to this type a test, some software packages also allow you to compute Maximum Likelihood tests and Method of Matching Moments (see Fitting Distributions by Moments in the Process Analysis topic) tests.
Which Distribution to use. As described above, certain types of variables follow specific distributions. Variables whose values are determined by an infinite number of independent random events will be distributed following the normal distribution, whereas variables whose values are the result of an extremely rare event would follow the Poisson distribution. The major distributions that have been proposed for modeling survival or failure times are the exponential (and linear exponential) distribution, the Weibull distribution of extreme events, and the Gompertz distribution. The section on types of distributions contains a number of distributions generally giving a brief example of what type of data would most commonly follow a specific distribution as well as the probability density function (pdf) for each distribution.
Types of Distributions
Bernoulli Distribution. This distribution best describes all situations where a “trial” is made resulting in either “success” or “failure,” such as when tossing a coin, or when modeling the success or failure of a surgical procedure. The Bernoulli distribution is defined as:
f(x) = p^{x} *(1p)^{1x}, for x = 0, 1
where
p  is the probability that a particular event (e.g., success) will occur. 
Beta Distribution. The beta distribution arises from a transformation of the F distribution and is typically used to model the distribution of order statistics. Because the beta distribution is bounded on both sides, it is often used for representing processes with natural lower and upper limits. For examples, refer to Hahn and Shapiro (1967). The beta distribution is defined as:
f(x) = G (?+?)/[G(?)G(?)] * x^{?1}*(1x)^{?1}, for 0 < x < 1, ? > 0, ? > 0
where
G  is the Gamma function 
?, ?  are the shape parameters (Shape1 and Shape2, respectively) 
The animation above shows the beta distribution as the two shape parameters change.
Binomial Distribution. The binomial distribution is useful for describing distributions of binomial events, such as the number of males and females in a random sample of companies, or the number of defective components in samples of 20 units taken from a production process. The binomial distribution is defined as:
f(x) = [n!/(x!*(nx)!)]*p^{x} * q^{nx}, for x = 0,1,2,…,n
where
p  is the probability that the respective event will occur 
q  is equal to 1p 
n  is the maximum number of independent trials. 
Cauchy Distribution. The Cauchy distribution is interesting for theoretical reasons. Although its mean can be taken as zero, since it is symmetrical about zero, the expectation, variance, higher moments, and moment generating function do not exist. The Cauchy distribution is defined as:
f(x) = 1/(?*p*{1+[(x ?)/ ?]^{2}}), for 0 < ?
where
?  is the location parameter (median) 
?  is the scale parameter 
p  is the constant Pi (3.1415…) 
The animation above shows the changing shape of the Cauchy distribution when the location parameter equals 0 and the scale parameter equals 1, 2, 3, and 4.
Chisquare Distribution. The sum of n independent squared random variables, each distributed following the standard normal distribution, is distributed as Chisquare with n degrees of freedom. This distribution is most frequently used in the modeling of random variables (e.g., representing frequencies) in statistical applications. The Chisquare distribution is defined by:
f(x) = {1/[2^{?/2}* G(?/2)]} * [x^{(?/2)1} * e^{x/2}], for ? = 1, 2, …, 0 < x
where
?  is the degrees of freedom 
e  is the base of the natural logarithm, sometimes called Euler’s e (2.71…) 
G  (gamma) is the Gamma function. 
The above animation shows the shape of the Chisquare distribution as the degrees of freedom increase (1, 2, 5, 10, 25 and 50).
Exponential Distribution. If T is the time between occurrences of rare events that happen on the average with a rate l per unit of time, then T is distributed exponentially with parameter l (lambda). Thus, the exponential distribution is frequently used to model the time interval between successive random events. Examples of variables distributed in this manner would be the gap length between cars crossing an intersection, lifetimes of electronic devices, or arrivals of customers at the checkout counter in a grocery store. The exponential distribution function is defined as:
f(x) = ? *e^{? x} for 0 = x < 8, ? > 0
where
?  is an exponential function parameter (an alternative parameterization is scale parameter b=1/l) 
e  is the base of the natural logarithm, sometimes called Euler’s e (2.71…) 
Extreme Value. The extreme value distribution is often used to model extreme events, such as the size of floods, gust velocities encountered by airplanes, maxima of stock market indices over a given year, etc.; it is also often used in reliability testing, for example in order to represent the distribution of failure times for electric circuits (see Hahn and Shapiro, 1967). The extreme value (Type I) distribution has the probability density function:
f(x) = 1/b * e^[(xa)/b] * e^{e^[(xa)/b]}, for 8 < x < 8, b > 0
where
a  is the location parameter 
b  is the scale parameter 
e  is the base of the natural logarithm, sometimes called Euler’s e (2.71…) 
F Distribution. Snedecor’s F distribution is most commonly used in tests of variance (e.g., ANOVA). The ratio of two chisquares divided by their respective degrees of freedom is said to follow an F distribution. The F distribution (for x > 0) has the probability density function (for n = 1, 2, …; w = 1, 2, …):
f(x) = [G{(?+?)/2}]/[G(?/2)G(?/2)] * (?/?)^{(?/2)} * x^{[(?/2)1]} * {1+[(?/?)*x]}^{[(?+?)/2]}, for 0 = x < 8, ?=1,2,…, ?=1,2,…
where
?, ?  are the shape parameters, degrees of freedom 
G  is the Gamma function 
The animation above shows various tail areas (pvalues) for an F distribution with both degrees of freedom equal to 10.
Gamma Distribution. The probability density function of the exponential distribution has a mode of zero. In many instances, it is known a priori that the mode of the distribution of a particular random variable of interest is not equal to zero (e.g., when modeling the distribution of the lifetimes of a product such as an electric light bulb, or the serving time taken at a ticket booth at a baseball game). In those cases, the gamma distribution is more appropriate for describing the underlying distribution. The gamma distribution is defined as:
f(x) = {1/[bG(c)]}*[x/b]^{c1}*e^{x/b} for 0 = x, c > 0
where
G  is the Gamma function 
c  is the Shape parameter 
b  is the Scale parameter. 
e  is the base of the natural logarithm, sometimes called Euler’s e (2.71…) 
The animation above shows the gamma distribution as the shape parameter changes from 1 to 6.
Geometric Distribution. If independent Bernoulli trials are made until a “success” occurs, then the total number of trials required is a geometric random variable. The geometric distribution is defined as:
f(x) = p*(1p)^{x}, for x = 1,2,…
where
p  is the probability that a particular event (e.g., success) will occur. 
Gompertz Distribution. The Gompertz distribution is a theoretical distribution of survival times. Gompertz (1825) proposed a probability model for human mortality, based on the assumption that the “average exhaustion of a man’s power to avoid death to be such that at the end of equal infinitely small intervals of time he lost equal portions of his remaining power to oppose destruction which he had at the commencement of these intervals” (Johnson, Kotz, Balakrishnan, 1995, p. 25). The resultant hazard function:
r(x)=Bc^{x}, for x = 0, B > 0, c = 1
is often used in survival analysis. See Johnson, Kotz, Balakrishnan (1995) for additional details.
Laplace Distribution. For interesting mathematical applications of the Laplace distribution see Johnson and Kotz (1995). The Laplace (or Double Exponential) distribution is defined as:
f(x) = 1/(2b) * e^{[(xa/b)]}, for 8 < x < 8
where
a  is the location parameter (mean) 
b  is the scale parameter 
e  is the base of the natural logarithm, sometimes called Euler’s e (2.71…) 
The graphic above shows the changing shape of the Laplace distribution when the location parameter equals 0 and the scale parameter equals 1, 2, 3, and 4.
Logistic Distribution. The logistic distribution is used to model binary responses (e.g., Gender) and is commonly used in logistic regression. The logistic distribution is defined as:
f(x) = (1/b) * e^{[(xa)/b]} * {1+e^{[(xa)/b]}^2}, for 8 < x < 8, 0 < b
where
a  is the location parameter (mean) 
b  is the scale parameter 
e  is the base of the natural logarithm, sometimes called Euler’s e (2.71…) 
The graphic above shows the changing shape of the logistic distribution when the location parameter equals 0 and the scale parameter equals 1, 2, and 3.
Lognormal Distribution. The lognormal distribution is often used in simulations of variables such as personal incomes, age at first marriage, or tolerance to poison in animals. In general, if x is a sample from a normal distribution, then y = e^{x} is a sample from a lognormal distribution. Thus, the lognormal distribution is defined as:
f(x) = 1/[xs(2p)^{1/2}] * e^{[log(x)µ]**2/2s**2}, for 0 < x < 8, µ > 0, s > 0
where
µ  is the scale parameter 
s  is the shape parameter 
e  is the base of the natural logarithm, sometimes called Euler’s e (2.71…) 
The animation above shows the lognormal distribution with mu equal to 0 and sigma equals .10, .30, .50, .70, and .90.
Normal Distribution. The normal distribution (the “bellshaped curve” which is symmetrical about the mean) is a theoretical function commonly used in inferential statistics as an approximation to sampling distributions (see also Elementary Concepts). In general, the normal distribution provides a good model for a random variable, when:
 There is a strong tendency for the variable to take a central value;
 Positive and negative deviations from this central value are equally likely;
 The frequency of deviations falls off rapidly as the deviations become larger.
As an underlying mechanism that produces the normal distribution, we can think of an infinite number of independent random (binomial) events that bring about the values of a particular variable. For example, there are probably a nearly infinite number of factors that determine a person’s height (thousands of genes, nutrition, diseases, etc.). Thus, height can be expected to be normally distributed in the population. The normal distribution function is determined by the following formula:
f(x) = 1/[(2*p)^{1/2}*s] * e**{1/2*[(xµ)/s]^{2} }, for 8 < x < 8
where
µ  is the mean 
s  is the standard deviation 
e  is the base of the natural logarithm, sometimes called Euler’s e (2.71…) 
p  is the constant Pi (3.14…) 
The animation above shows several tail areas of the standard normal distribution (i.e., the normal distribution with a mean of 0 and a standard deviation of 1). The standard normal distribution is often used in hypothesis testing.
Pareto Distribution. The Pareto distribution is commonly used in monitoring production processes (see Quality Control and Process Analysis). For example, a machine which produces copper wire will occasionally generate a flaw at some point along the wire. The Pareto distribution can be used to model the length of wire between successive flaws. The standard Pareto distribution is defined as:
f(x) = c/x^{c+1}, for 1 = x, c < 0
where
c  is the shape parameter 
The animation above shows the Pareto distribution for the shape parameter equal to 1, 2, 3, 4, and 5.
Poisson Distribution. The Poisson distribution is also sometimes referred to as the distribution of rare events. Examples of Poisson distributed variables are number of accidents per person, number of sweepstakes won per person, or the number of catastrophic defects found in a production process. It is defined as:
f(x) = (?^{x}*e^{?})/x!, for x = 0,1,2,…, 0 < ?
where
?  (lambda) is the expected value of x (the mean) 
e  is the base of the natural logarithm, sometimes called Euler’s e (2.71…) 
Rayleigh Distribution. If two independent variables y_{1} and y_{2} are independent from each other and normally distributed with equal variance, then the variable x = Ö(y_{1}^{2}+ y_{2}^{2}) will follow the Rayleigh distribution. Thus, an example (and appropriate metaphor) for such a variable would be the distance of darts from the target in a dartthrowing game, where the errors in the two dimensions of the target plane are independent and normally distributed. The Rayleigh distribution is defined as:
f(x) = x/b^{2} * e^[(x^{2}/2b^{2})], for 0 = x < 8, b > 0
where
b  is the scale parameter 
e  is the base of the natural logarithm, sometimes called Euler’s e (2.71…) 
The graphic above shows the changing shape of the Rayleigh distribution when the scale parameter equals 1, 2, and 3.
Rectangular Distribution. The rectangular distribution is useful for describing random variables with a constant probability density over the defined range a<b.
f(x) = 1/(ba), for a<x<b
= 0 , elsewhere
where
a<b  are constants. 
Student’s t Distribution. The student’s t distribution is symmetric about zero, and its general shape is similar to that of the standard normal distribution. It is most commonly used in testing hypothesis about the mean of a particular population. The student’s t distribution is defined as (for n = 1, 2, . . .):
f(x) = G[(?+1)/2] / G(?/2) * (?*p)^{1/2} * [1 + (x^{2}/?)^{(?+1)/2}]
where
?  is the shape parameter, degrees of freedom 
G  is the Gamma function 
p  is the constant Pi (3.14 . . .) 
The shape of the student’s t distribution is determined by the degrees of freedom. As shown in the animation above, its shape changes as the degrees of freedom increase.
Weibull Distribution. As described earlier, the exponential distribution is often used as a model of timetofailure measurements, when the failure (hazard) rate is constant over time. When the failure probability varies over time, then the Weibull distribution is appropriate. Thus, the Weibull distribution is often used in reliability testing (e.g., of electronic relays, ball bearings, etc.; see Hahn and Shapiro, 1967). The Weibull distribution is defined as:
f(x) = c/b*(x/b)^{(c1)} * e^{[(x/b)^c]}, for 0 = x < 8, b > 0, c > 0
where
b  is the scale parameter 
c  is the shape parameter 
e  is the base of the natural logarithm, sometimes called Euler’s e (2.71…) 
The animation above shows the Weibull distribution as the shape parameter increases (.5, 1, 2, 3, 4, 5, and 10).
Discriminant Function Analysis
Discover Which Variables Discriminate Between Groups, Discriminant Function Analysis
General Purpose
Discriminant function analysis is used to determine which variables discriminate between two or more naturally occurring groups. For example, an educational researcher may want to investigate which variables discriminate between high school graduates who decide (1) to go to college, (2) to attend a trade or professional school, or (3) to seek no further training or education. For that purpose the researcher could collect data on numerous variables prior to students’ graduation. After graduation, most students will naturally fall into one of the three categories. Discriminant Analysis could then be used to determine which variable(s) are the best predictors of students’ subsequent educational choice.
A medical researcher may record different variables relating to patients’ backgrounds in order to learn which variables best predict whether a patient is likely to recover completely (group 1), partially (group 2), or not at all (group 3). A biologist could record different characteristics of similar types (groups) of flowers, and then perform a discriminant function analysis to determine the set of characteristics that allows for the best discrimination between the types.
Computational Approach
Computationally, discriminant function analysis is very similar to analysis of variance (ANOVA). Let us consider a simple example. Suppose we measure height in a random sample of 50 males and 50 females. Females are, on the average, not as tall as males, and this difference will be reflected in the difference in means (for the variable Height). Therefore, variable height allows us to discriminate between males and females with a better than chance probability: if a person is tall, then he is likely to be a male, if a person is short, then she is likely to be a female.
We can generalize this reasoning to groups and variables that are less “trivial.” For example, suppose we have two groups of high school graduates: Those who choose to attend college after graduation and those who do not. We could have measured students’ stated intention to continue on to college one year prior to graduation. If the means for the two groups (those who actually went to college and those who did not) are different, then we can say that intention to attend college as stated one year prior to graduation allows us to discriminate between those who are and are not college bound (and this information may be used by career counselors to provide the appropriate guidance to the respective students).
To summarize the discussion so far, the basic idea underlying discriminant function analysis is to determine whether groups differ with regard to the mean of a variable, and then to use that variable to predict group membership (e.g., of new cases).
Analysis of Variance. Stated in this manner, the discriminant function problem can be rephrased as a oneway analysis of variance (ANOVA) problem. Specifically, one can ask whether or not two or more groups are significantly different from each other with respect to the mean of a particular variable. To learn more about how one can test for the statistical significance of differences between means in different groups you may want to read the Overview section to ANOVA/MANOVA. However, it should be clear that, if the means for a variable are significantly different in different groups, then we can say that this variable discriminates between the groups.
In the case of a single variable, the final significance test of whether or not a variable discriminates between groups is the F test. As described in Elementary Concepts and ANOVA /MANOVA, F is essentially computed as the ratio of the betweengroups variance in the data over the pooled (average) withingroup variance. If the betweengroup variance is significantly larger then there must be significant differences between means.
Multiple Variables. Usually, one includes several variables in a study in order to see which one(s) contribute to the discrimination between groups. In that case, we have a matrix of total variances and covariances; likewise, we have a matrix of pooled withingroup variances and covariances. We can compare those two matrices via multivariate F tests in order to determined whether or not there are any significant differences (with regard to all variables) between groups. This procedure is identical to multivariate analysis of variance or MANOVA. As in MANOVA, one could first perform the multivariate test, and, if statistically significant, proceed to see which of the variables have significantly different means across the groups. Thus, even though the computations with multiple variables are more complex, the principal reasoning still applies, namely, that we are looking for variables that discriminate between groups, as evident in observed mean differences.
Stepwise Discriminant Analysis
Probably the most common application of discriminant function analysis is to include many measures in the study, in order to determine the ones that discriminate between groups. For example, an educational researcher interested in predicting high school graduates’ choices for further education would probably include as many measures of personality, achievement motivation, academic performance, etc. as possible in order to learn which one(s) offer the best prediction.
Model. Put another way, we want to build a “model” of how we can best predict to which group a case belongs. In the following discussion we will use the term “in the model” in order to refer to variables that are included in the prediction of group membership, and we will refer to variables as being “not in the model” if they are not included.
Forward stepwise analysis. In stepwise discriminant function analysis, a model of discrimination is built stepbystep. Specifically, at each step all variables are reviewed and evaluated to determine which one will contribute most to the discrimination between groups. That variable will then be included in the model, and the process starts again.
Backward stepwise analysis. One can also step backwards; in that case all variables are included in the model and then, at each step, the variable that contributes least to the prediction of group membership is eliminated. Thus, as the result of a successful discriminant function analysis, one would only keep the “important” variables in the model, that is, those variables that contribute the most to the discrimination between groups.
F to enter, F to remove. The stepwise procedure is “guided” by the respective F to enter and F to remove values. The F value for a variable indicates its statistical significance in the discrimination between groups, that is, it is a measure of the extent to which a variable makes a unique contribution to the prediction of group membership. If you are familiar with stepwise multiple regression procedures, then you may interpret the F to enter/remove values in the same way as in stepwise regression.
Capitalizing on chance. A common misinterpretation of the results of stepwise discriminant analysis is to take statistical significance levels at face value. By nature, the stepwise procedures will capitalize on chance because they “pick and choose” the variables to be included in the model so as to yield maximum discrimination. Thus, when using the stepwise approach the researcher should be aware that the significance levels do not reflect the true alpha error rate, that is, the probability of erroneously rejecting H_{0} (the null hypothesis that there is no discrimination between groups).
Interpreting a TwoGroup Discriminant Function
In the twogroup case, discriminant function analysis can also be thought of as (and is analogous to) multiple regression (see Multiple Regression; the twogroup discriminant analysis is also called Fisher linear discriminant analysis after Fisher, 1936; computationally all of these approaches are analogous). If we code the two groups in the analysis as 1 and 2, and use that variable as the dependent variable in a multiple regression analysis, then we would get results that are analogous to those we would obtain via Discriminant Analysis. In general, in the twogroup case we fit a linear equation of the type:
Group = a + b_{1}*x_{1} + b_{2}*x_{2} + … + b_{m}*x_{m}
where a is a constant and b_{1} through b_{m} are regression coefficients. The interpretation of the results of a twogroup problem is straightforward and closely follows the logic of multiple regression: Those variables with the largest (standardized) regression coefficients are the ones that contribute most to the prediction of group membership.
Discriminant Functions for Multiple Groups
When there are more than two groups, then we can estimate more than one discriminant function like the one presented above. For example, when there are three groups, we could estimate (1) a function for discriminating between group 1 and groups 2 and 3 combined, and (2) another function for discriminating between group 2 and group 3. For example, we could have one function that discriminates between those high school graduates that go to college and those who do not (but rather get a job or go to a professional or trade school), and a second function to discriminate between those graduates that go to a professional or trade school versus those who get a job. The b coefficients in those discriminant functions could then be interpreted as before.
Canonical analysis. When actually performing a multiple group discriminant analysis, we do not have to specify how to combine groups so as to form different discriminant functions. Rather, you can automatically determine some optimal combination of variables so that the first function provides the most overall discrimination between groups, the second provides second most, and so on. Moreover, the functions will be independent or orthogonal, that is, their contributions to the discrimination between groups will not overlap. Computationally, you will perform a canonical correlation analysis (see also Canonical Correlation) that will determine the successive functions and canonical roots (the term root refers to the eigenvalues that are associated with the respective canonical function). The maximum number of functions will be equal to the number of groups minus one, or the number of variables in the analysis, whichever is smaller.
Interpreting the discriminant functions. As before, we will get b (and standardized beta) coefficients for each variable in each discriminant (now also called canonical) function, and they can be interpreted as usual: the larger the standardized coefficient, the greater is the contribution of the respective variable to the discrimination between groups. (Note that we could also interpret the structure coefficients; see below.) However, these coefficients do not tell us between which of the groups the respective functions discriminate. We can identify the nature of the discrimination for each discriminant (canonical) function by looking at the means for the functions across groups. We can also visualize how the two functions discriminate between groups by plotting the individual scores for the two discriminant functions (see the example graph below).
In this example, Root (function) 1 seems to discriminate mostly between groups Setosa, and Virginic and Versicol combined. In the vertical direction (Root 2), a slight trend of Versicol points to fall below the center line (0) is apparent.
Factor structure matrix. Another way to determine which variables “mark” or define a particular discriminant function is to look at the factor structure. The factor structure coefficients are the correlations between the variables in the model and the discriminant functions; if you are familiar with factor analysis (see Factor Analysis) you may think of these correlations as factor loadings of the variables on each discriminant function.
Some authors have argued that these structure coefficients should be used when interpreting the substantive “meaning” of discriminant functions. The reasons given by those authors are that (1) supposedly the structure coefficients are more stable, and (2) they allow for the interpretation of factors (discriminant functions) in the manner that is analogous to factor analysis. However, subsequent Monte Carlo research (Barcikowski & Stevens, 1975; Huberty, 1975) has shown that the discriminant function coefficients and the structure coefficients are about equally unstable, unless the n is fairly large (e.g., if there are 20 times more cases than there are variables). The most important thing to remember is that the discriminant function coefficients denote the unique (partial) contribution of each variable to the discriminant function(s), while the structure coefficients denote the simple correlations between the variables and the function(s). If one wants to assign substantive “meaningful” labels to the discriminant functions (akin to the interpretation of factors in factor analysis), then the structure coefficients should be used (interpreted); if one wants to learn what is each variable’s unique contribution to the discriminant function, use the discriminant function coefficients (weights).
Significance of discriminant functions. One can test the number of roots that add significantly to the discrimination between group. Only those found to be statistically significant should be used for interpretation; nonsignificant functions (roots) should be ignored.
Summary. To summarize, when interpreting multiple discriminant functions, which arise from analyses with more than two groups and more than one variable, one would first test the different functions for statistical significance, and only consider the significant functions for further examination. Next, we would look at the standardized b coefficients for each variable for each significant function. The larger the standardized b coefficient, the larger is the respective variable’s unique contribution to the discrimination specified by the respective discriminant function. In order to derive substantive “meaningful” labels for the discriminant functions, one can also examine the factor structure matrix with the correlations between the variables and the discriminant functions. Finally, we would look at the means for the significant discriminant functions in order to determine between which groups the respective functions seem to discriminate.
Assumptions
As mentioned earlier, discriminant function analysis is computationally very similar to MANOVA, and all assumptions for MANOVA mentioned in ANOVA/MANOVA apply. In fact, you may use the wide range of diagnostics and statistical tests of assumption that are available to examine your data for the discriminant analysis.
Normal distribution. It is assumed that the data (for the variables) represent a sample from a multivariate normal distribution. You can examine whether or not variables are normally distributed with histograms of frequency distributions. However, note that violations of the normality assumption are usually not “fatal,” meaning, that the resultant significance tests etc. are still “trustworthy.” You may use specific tests for normality in addition to graphs.
Homogeneity of variances/covariances. It is assumed that the variance/covariance matrices of variables are homogeneous across groups. Again, minor deviations are not that important; however, before accepting final conclusions for an important study it is probably a good idea to review the withingroups variances and correlation matrices. In particular a scatterplot matrix can be produced and can be very useful for this purpose. When in doubt, try rerunning the analyses excluding one or two groups that are of less interest. If the overall results (interpretations) hold up, you probably do not have a problem. You may also use the numerous tests available to examine whether or not this assumption is violated in your data. However, as mentioned in ANOVA/MANOVA, the multivariate Box M test for homogeneity of variances/covariances is particularly sensitive to deviations from multivariate normality, and should not be taken too “seriously.”
Correlations between means and variances. The major “real” threat to the validity of significance tests occurs when the means for variables across groups are correlated with the variances (or standard deviations). Intuitively, if there is large variability in a group with particularly high means on some variables, then those high means are not reliable. However, the overall significance tests are based on pooled variances, that is, the average variance across all groups. Thus, the significance tests of the relatively larger means (with the large variances) would be based on the relatively smaller pooled variances, resulting erroneously in statistical significance. In practice, this pattern may occur if one group in the study contains a few extreme outliers, who have a large impact on the means, and also increase the variability. To guard against this problem, inspect the descriptive statistics, that is, the means and standard deviations or variances for such a correlation.
The matrix illconditioning problem. Another assumption of discriminant function analysis is that the variables that are used to discriminate between groups are not completely redundant. As part of the computations involved in discriminant analysis, you will invert the variance/covariance matrix of the variables in the model. If any one of the variables is completely redundant with the other variables then the matrix is said to be illconditioned, and it cannot be inverted. For example, if a variable is the sum of three other variables that are also in the model, then the matrix is illconditioned.
Tolerance values. In order to guard against matrix illconditioning, constantly check the socalled tolerance value for each variable. This tolerance value is computed as 1 minus Rsquare of the respective variable with all other variables included in the current model. Thus, it is the proportion of variance that is unique to the respective variable. You may also refer to Multiple Regression to learn more about multiple regression and the interpretation of the tolerance value. In general, when a variable is almost completely redundant (and, therefore, the matrix illconditioning problem is likely to occur), the tolerance value for that variable will approach 0.
Classification
Another major purpose to which discriminant analysis is applied is the issue of predictive classification of cases. Once a model has been finalized and the discriminant functions have been derived, how well can we predict to which group a particular case belongs?
A priori and post hoc predictions. Before going into the details of different estimation procedures, we would like to make sure that this difference is clear. Obviously, if we estimate, based on some data set, the discriminant functions that best discriminate between groups, and then use the same data to evaluate how accurate our prediction is, then we are very much capitalizing on chance. In general, one will always get a worse classification when predicting cases that were not used for the estimation of the discriminant function. Put another way, post hoc predictions are always better than a priori predictions. (The trouble with predicting the future a priori is that one does not know what will happen; it is much easier to find ways to predict what we already know has happened.) Therefore, one should never base one’s confidence regarding the correct classification of future observations on the same data set from which the discriminant functions were derived; rather, if one wants to classify cases predictively, it is necessary to collect new data to “try out” (crossvalidate) the utility of the discriminant functions.
Classification functions. These are not to be confused with the discriminant functions. The classification functions can be used to determine to which group each case most likely belongs. There are as many classification functions as there are groups. Each function allows us to compute classification scores for each case for each group, by applying the formula:
S_{i} = c_{i} + w_{i1}*x_{1} + w_{i2}*x_{2} + … + w_{im}*x_{m}
In this formula, the subscript i denotes the respective group; the subscripts 1, 2, …, m denote the m variables; c_{i} is a constant for the i‘th group, w_{ij} is the weight for the j‘th variable in the computation of the classification score for the i‘th group; x_{j} is the observed value for the respective case for the j‘th variable. S_{i} is the resultant classification score.
We can use the classification functions to directly compute classification scores for some new observations.
Classification of cases. Once we have computed the classification scores for a case, it is easy to decide how to classify the case: in general we classify the case as belonging to the group for which it has the highest classification score (unless the a priori classification probabilities are widely disparate; see below). Thus, if we were to study high school students’ postgraduation career/educational choices (e.g., attending college, attending a professional or trade school, or getting a job) based on several variables assessed one year prior to graduation, we could use the classification functions to predict what each student is most likely to do after graduation. However, we would also like to know the probability that the student will make the predicted choice. Those probabilities are called posterior probabilities, and can also be computed. However, to understand how those probabilities are derived, let us first consider the socalled Mahalanobis distances.
Mahalanobis distances. You may have read about these distances in other parts of the manual. In general, the Mahalanobis distance is a measure of distance between two points in the space defined by two or more correlated variables. For example, if there are two variables that are uncorrelated, then we could plot points (cases) in a standard twodimensional scatterplot; the Mahalanobis distances between the points would then be identical to the Euclidean distance; that is, the distance as, for example, measured by a ruler. If there are three uncorrelated variables, we could also simply use a ruler (in a 3D plot) to determine the distances between points. If there are more than 3 variables, we cannot represent the distances in a plot any more. Also, when the variables are correlated, then the axes in the plots can be thought of as being nonorthogonal; that is, they would not be positioned in right angles to each other. In those cases, the simple Euclidean distance is not an appropriate measure, while the Mahalanobis distance will adequately account for the correlations.
Mahalanobis distances and classification. For each group in our sample, we can determine the location of the point that represents the means for all variables in the multivariate space defined by the variables in the model. These points are called group centroids. For each case we can then compute the Mahalanobis distances (of the respective case) from each of the group centroids. Again, we would classify the case as belonging to the group to which it is closest, that is, where the Mahalanobis distance is smallest.
Posterior classification probabilities. Using the Mahalanobis distances to do the classification, we can now derive probabilities. The probability that a case belongs to a particular group is basically proportional to the Mahalanobis distance from that group centroid (it is not exactly proportional because we assume a multivariate normal distribution around each centroid). Because we compute the location of each case from our prior knowledge of the values for that case on the variables in the model, these probabilities are called posterior probabilities. In summary, the posterior probability is the probability, based on our knowledge of the values of other variables, that the respective case belongs to a particular group. Some software packages will automatically compute those probabilities for all cases (or for selected cases only for crossvalidation studies).
A priori classification probabilities. There is one additional factor that needs to be considered when classifying cases. Sometimes, we know ahead of time that there are more observations in one group than in any other; thus, the a priori probability that a case belongs to that group is higher. For example, if we know ahead of time that 60% of the graduates from our high school usually go to college (20% go to a professional school, and another 20% get a job), then we should adjust our prediction accordingly: a priori, and all other things being equal, it is more likely that a student will attend college that choose either of the other two options. You can specify different a priori probabilities, which will then be used to adjust the classification of cases (and the computation of posterior probabilities) accordingly.
In practice, the researcher needs to ask him or herself whether the unequal number of cases in different groups in the sample is a reflection of the true distribution in the population, or whether it is only the (random) result of the sampling procedure. In the former case, we would set the a priori probabilities to be proportional to the sizes of the groups in our sample, in the latter case we would specify the a priori probabilities as being equal in each group. The specification of different a priori probabilities can greatly affect the accuracy of the prediction.
Summary of the prediction. A common result that one looks at in order to determine how well the current classification functions predict group membership of cases is the classification matrix. The classification matrix shows the number of cases that were correctly classified (on the diagonal of the matrix) and those that were misclassified.
Another word of caution. To reiterate, post hoc predicting of what has happened in the past is not that difficult. It is not uncommon to obtain very good classification if one uses the same cases from which the classification functions were computed. In order to get an idea of how well the current classification functions “perform,” one must classify (a priori) different cases, that is, cases that were not used to estimate the classification functions. You can include or exclude cases from the computations; thus, the classification matrix can be computed for “old” cases as well as “new” cases. Only the classification of new cases allows us to assess the predictive validity of the classification functions (see also crossvalidation); the classification of old cases only provides a useful diagnostic tool to identify outliers or areas where the classification function seems to be less adequate.
Summary. In general Discriminant Analysis is a very useful tool (1) for detecting the variables that allow the researcher to discriminate between different (naturally occurring) groups, and (2) for classifying cases into different groups with a better than chance accuracy.
Data Mining Techniques
 Data Mining
 Crucial Concepts in Data Mining
 Data Warehousing
 OnLine Analytic Processing (OLAP)
 Exploratory Data Analysis (EDA) and Data Mining Techniques
 Neural Networks
Data Mining
Data Mining is an analytic process designed to explore data (usually large amounts of data – typically business or market related – also known as “big data”) in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data. The ultimate goal of data mining is prediction – and predictive data mining is the most common type of data mining and one that has the most direct business applications. The process of data mining consists of three stages: (1) the initial exploration, (2) model building or pattern identification with validation/verification, and (3) deployment (i.e., the application of the model to new data in order to generate predictions).
Stage 1: Exploration. This stage usually starts with data preparation which may involve cleaning data, data transformations, selecting subsets of records and – in case of data sets with large numbers of variables (“fields”) – performing some preliminary feature selection operations to bring the number of variables to a manageable range (depending on the statistical methods which are being considered). Then, depending on the nature of the analytic problem, this first stage of the process of data mining may involve anywhere between a simple choice of straightforward predictors for a regression model, to elaborate exploratory analyses using a wide variety of graphical and statistical methods (see Exploratory Data Analysis (EDA)) in order to identify the most relevant variables and determine the complexity and/or the general nature of models that can be taken into account in the next stage.
Stage 2: Model building and validation. This stage involves considering various models and choosing the best one based on their predictive performance (i.e., explaining the variability in question and producing stable results across samples). This may sound like a simple operation, but in fact, it sometimes involves a very elaborate process. There are a variety of techniques developed to achieve that goal – many of which are based on socalled “competitive evaluation of models,” that is, applying different models to the same data set and then comparing their performance to choose the best. These techniques – which are often considered the core of predictive data mining – include: Bagging (Voting, Averaging), Boosting, Stacking (Stacked Generalizations), and MetaLearning.
Stage 3: Deployment. That final stage involves using the model selected as best in the previous stage and applying it to new data in order to generate predictions or estimates of the expected outcome.
The concept of Data Mining is becoming increasingly popular as a business information management tool where it is expected to reveal knowledge structures that can guide decisions in conditions of limited certainty. Recently, there has been increased interest in developing new analytic techniques specifically designed to address the issues relevant to business Data Mining (e.g., Classification Trees), but Data Mining is still based on the conceptual principles of statistics including the traditional Exploratory Data Analysis (EDA) and modeling and it shares with them both some components of its general approaches and specific techniques.
However, an important general difference in the focus and purpose between Data Mining and the traditional Exploratory Data Analysis (EDA) is that Data Mining is more oriented towards applications than the basic nature of the underlying phenomena. In other words, Data Mining is relatively less concerned with identifying the specific relations between the involved variables. For example, uncovering the nature of the underlying functions or the specific types of interactive, multivariate dependencies between variables are not the main goal of Data Mining. Instead, the focus is on producing a solution that can generate useful predictions. Therefore, Data Mining accepts among others a “black box” approach to data exploration or knowledge discovery and uses not only the traditional Exploratory Data Analysis (EDA) techniques, but also such techniques as Neural Networks which can generate valid predictions but are not capable of identifying the specific nature of the interrelations between the variables on which the predictions are based.
Data Mining is often considered to be “a blend of statistics, AI (artificial intelligence), and data base research” (Pregibon, 1997, p. 8), which until very recently was not commonly recognized as a field of interest for statisticians, and was even considered by some “a dirty word in Statistics” (Pregibon, 1997, p. 8). Due to its applied importance, however, the field emerges as a rapidly growing and major area (also in statistics) where important theoretical advances are being made (see, for example, the recent annual International Conferences on Knowledge Discovery and Data Mining, cohosted by the American Statistical Association).
For information on Data Mining techniques, review the summary topics included below. There are numerous books that review the theory and practice of data mining; the following books offer a representative sample of recent general books on data mining, representing a variety of approaches and perspectives:
Berry, M., J., A., & Linoff, G., S., (2000). Mastering data mining. New York: Wiley.
Edelstein, H., A. (1999). Introduction to data mining and knowledge discovery (3rd ed). Potomac, MD: Two Crows Corp.
Fayyad, U. M., PiatetskyShapiro, G., Smyth, P., & Uthurusamy, R. (1996). Advances in knowledge discovery & data mining. Cambridge, MA: MIT Press.
Han, J., Kamber, M. (2000). Data mining: Concepts and Techniques. New York: MorganKaufman.
Hastie, T., Tibshirani, R., & Friedman, J. H. (2001). The elements of statistical learning : Data mining, inference, and prediction. New York: Springer.
Pregibon, D. (1997). Data Mining. Statistical Computing and Graphics, 7, 8.
Weiss, S. M., & Indurkhya, N. (1997). Predictive data mining: A practical guide. New York: MorganKaufman.
Westphal, C., Blaxton, T. (1998). Data mining solutions. New York: Wiley.
Witten, I. H., & Frank, E. (2000). Data mining. New York: MorganKaufmann.
Crucial Concepts in Data Mining
Bagging (Voting, Averaging)
The concept of bagging (voting for classification, averaging for regressiontype problems with continuous dependent variables of interest) applies to the area of predictive data mining, to combine the predicted classifications (prediction) from multiple models, or from the same type of model for different learning data. It is also used to address the inherent instability of results when applying complex models to relatively small data sets. Suppose your data mining task is to build a model for predictive classification, and the dataset from which to train the model (learning data set, which contains observed classifications) is relatively small. You could repeatedly subsample (with replacement) from the dataset, and apply, for example, a tree classifier (e.g., C&RT and CHAID) to the successive samples. In practice, very different trees will often be grown for the different samples, illustrating the instability of models often evident with small data sets. One method of deriving a single prediction (for new observations) is to use all trees found in the different samples, and to apply some simple voting: The final classification is the one most often predicted by the different trees. Note that some weighted combination of predictions (weighted vote, weighted average) is also possible, and commonly used. A sophisticated (machine learning) algorithm for generating weights for weighted prediction or voting is the Boosting procedure.
Boosting
The concept of boosting applies to the area of predictive data mining, to generate multiple models or classifiers (for prediction or classification), and to derive weights to combine the predictions from those models into a single prediction or predicted classification (see also Bagging).
A simple algorithm for boosting works like this: Start by applying some method (e.g., a tree classifier such as C&RT or CHAID) to the learning data, where each observation is assigned an equal weight. Compute the predicted classifications, and apply weights to the observations in the learning sample that are inversely proportional to the accuracy of the classification. In other words, assign greater weight to those observations that were difficult to classify (where the misclassification rate was high), and lower weights to those that were easy to classify (where the misclassification rate was low). In the context of C&RT for example, different misclassification costs (for the different classes) can be applied, inversely proportional to the accuracy of prediction in each class. Then apply the classifier again to the weighted data (or with different misclassification costs), and continue with the next iteration (application of the analysis method for classification to the reweighted data).
Boosting will generate a sequence of classifiers, where each consecutive classifier in the sequence is an “expert” in classifying observations that were not well classified by those preceding it. During deployment (for prediction or classification of new cases), the predictions from the different classifiers can then be combined (e.g., via voting, or some weighted voting procedure) to derive a single best prediction or classification.
Note that boosting can also be applied to learning methods that do not explicitly support weights or misclassification costs. In that case, random subsampling can be applied to the learning data in the successive steps of the iterative boosting procedure, where the probability for selection of an observation into the subsample is inversely proportional to the accuracy of the prediction for that observation in the previous iteration (in the sequence of iterations of the boosting procedure).
CRISP
See Models for Data Mining.
Data Preparation (in Data Mining)
Data preparation and cleaning is an often neglected but extremely important step in the data mining process. The old saying “garbageingarbageout” is particularly applicable to the typical data mining projects where large data sets collected via some automatic methods (e.g., via the Web) serve as the input into the analyses. Often, the method by which the data where gathered was not tightly controlled, and so the data may contain outofrange values (e.g., Income: 100), impossible data combinations (e.g., Gender: Male, Pregnant: Yes), and the like. Analyzing data that has not been carefully screened for such problems can produce highly misleading results, in particular in predictive data mining.
Data Reduction (for Data Mining)
The term Data Reduction in the context of data mining is usually applied to projects where the goal is to aggregate or amalgamate the information contained in large datasets into manageable (smaller) information nuggets. Data reduction methods can include simple tabulation, aggregation (computing descriptive statistics) or more sophisticated techniques like clustering, principal components analysis, etc.
See also predictive data mining, drilldown analysis.
Deployment
The concept of deployment in predictive data mining refers to the application of a model for prediction or classification to new data. After a satisfactory model or set of models has been identified (trained) for a particular application, we usually want to deploy those models so that predictions or predicted classifications can quickly be obtained for new data. For example, a credit card company may want to deploy a trained model or set of models (e.g., neural networks, metalearner) to quickly identify transactions which have a high probability of being fraudulent.
DrillDown Analysis
The concept of drilldown analysis applies to the area of data mining, to denote the interactive exploration of data, in particular of large databases. The process of drilldown analyses begins by considering some simple breakdowns of the data by a few variables of interest (e.g., Gender, geographic region, etc.). Various statistics, tables, histograms, and other graphical summaries can be computed for each group. Next, we may want to “drilldown” to expose and further analyze the data “underneath” one of the categorizations, for example, we might want to further review the data for males from the midwest. Again, various statistical and graphical summaries can be computed for those cases only, which might suggest further breakdowns by other variables (e.g., income, age, etc.). At the lowest (“bottom”) level are the raw data: For example, you may want to review the addresses of male customers from one region, for a certain income group, etc., and to offer to those customers some particular services of particular utility to that group.
Feature Selection
One of the preliminary stage in predictive data mining, when the data set includes more variables than could be included (or would be efficient to include) in the actual model building phase (or even in initial exploratory operations), is to select predictors from a large list of candidates. For example, when data are collected via automated (computerized) methods, it is not uncommon that measurements are recorded for thousands or hundreds of thousands (or more) of predictors. The standard analytic methods for predictive data mining, such as neural network analyses, classification and regression trees, generalized linear models, or general linear models become impractical when the number of predictors exceed more than a few hundred variables.
Feature selection selects a subset of predictors from a large list of candidate predictors without assuming that the relationships between the predictors and the dependent or outcome variables of interest are linear, or even monotone. Therefore, this is used as a preprocessor for predictive data mining, to select manageable sets of predictors that are likely related to the dependent (outcome) variables of interest, for further analyses with any of the other methods for regression and classification.
Machine Learning
Machine learning, computational learning theory, and similar terms are often used in the context of Data Mining, to denote the application of generic modelfitting or classification algorithms for predictive data mining. Unlike traditional statistical data analysis, which is usually concerned with the estimation of population parameters by statistical inference, the emphasis in data mining (and machine learning) is usually on the accuracy of prediction (predicted classification), regardless of whether or not the “models” or techniques that are used to generate the prediction is interpretable or open to simple explanation. Good examples of this type of technique often applied to predictive data mining are neural networks or metalearning techniques such as boosting, etc. These methods usually involve the fitting of very complex “generic” models, that are not related to any reasoning or theoretical understanding of underlying causal processes; instead, these techniques can be shown to generate accurate predictions or classification in crossvalidation samples.
MetaLearning
The concept of metalearning applies to the area of predictive data mining, to combine the predictions from multiple models. It is particularly useful when the types of models included in the project are very different. In this context, this procedure is also referred to as Stacking (Stacked Generalization).
Suppose your data mining project includes tree classifiers, such as C&RT and CHAID, linear discriminant analysis (e.g., see GDA), and Neural Networks. Each computes predicted classifications for a crossvalidation sample, from which overall goodnessoffit statistics (e.g., misclassification rates) can be computed. Experience has shown that combining the predictions from multiple methods often yields more accurate predictions than can be derived from any one method (e.g., see Witten and Frank, 2000). The predictions from different classifiers can be used as input into a metalearner, which will attempt to combine the predictions to create a final best predicted classification. So, for example, the predicted classifications from the tree classifiers, linear model, and the neural network classifier(s) can be used as input variables into a neural network metaclassifier, which will attempt to “learn” from the data how to combine the predictions from the different models to yield maximum classification accuracy.
We can apply metalearners to the results from different metalearners to create “metameta”learners, and so on; however, in practice such exponential increase in the amount of data processing, in order to derive an accurate prediction, will yield less and less marginal utility.
Models for Data Mining
In the business environment, complex data mining projects may require the coordinate efforts of various experts, stakeholders, or departments throughout an entire organization. In the data mining literature, various “general frameworks” have been proposed to serve as blueprints for how to organize the process of gathering data, analyzing data, disseminating results, implementing results, and monitoring improvements.
One such model, CRISP (CrossIndustry Standard Process for data mining) was proposed in the mid1990s by a European consortium of companies to serve as a nonproprietary standard process model for data mining. This general approach postulates the following (perhaps not particularly controversial) general sequence of steps for data mining projects:
Another approach – the Six Sigma methodology – is a wellstructured, datadriven methodology for eliminating defects, waste, or quality control problems of all kinds in manufacturing, service delivery, management, and other business activities. This model has recently become very popular (due to its successful implementations) in various American industries, and it appears to gain favor worldwide. It postulated a sequence of, socalled, DMAIC steps –
– that grew up from the manufacturing, quality improvement, and process control traditions and is particularly well suited to production environments (including “production of services,” i.e., service industries).
Another framework of this kind (actually somewhat similar to Six Sigma) is the approach proposed by SAS Institute called SEMMA –
– which is focusing more on the technical activities typically involved in a data mining project.
All of these models are concerned with the process of how to integrate data mining methodology into an organization, how to “convert data into information,” how to involve important stakeholders, and how to disseminate the information in a form that can easily be converted by stakeholders into resources for strategic decision making.
Some software tools for data mining are specifically designed and documented to fit into one of these specific frameworks.
The general underlying philosophy of StatSoft’s STATISTICA Data Miner is to provide a flexible data mining workbench that can be integrated into any organization, industry, or organizational culture, regardless of the general data mining processmodel that the organization chooses to adopt. For example, STATISTICA Data Miner can include the complete set of (specific) necessary tools for ongoing company wide Six Sigma quality control efforts, and users can take advantage of its (still optional) DMAICcentric user interface for industrial data mining tools. It can equally well be integrated into ongoing marketing research, CRM (Customer Relationship Management) projects, etc. that follow either the CRISP or SEMMA approach – it fits both of them perfectly well without favoring either one. Also, STATISTICA Data Miner offers all the advantages of a general data mining oriented “development kit” that includes easy to use tools for incorporating into your projects not only such components as custom database gateway solutions, prompted interactive queries, or proprietary algorithms, but also systems of access privileges, workgroup management, and other collaborative work tools that allow you to design large scale, enterprisewide systems (e.g., following the CRISP, SEMMA, or a combination of both models) that involve your entire organization.
Predictive Data Mining
The term Predictive Data Mining is usually applied to identify data mining projects with the goal to identify a statistical or neural network model or set of models that can be used to predict some response of interest. For example, a credit card company may want to engage in predictive data mining, to derive a (trained) model or set of models (e.g., neural networks, metalearner) that can quickly identify transactions which have a high probability of being fraudulent. Other types of data mining projects may be more exploratory in nature (e.g., to identify cluster or segments of customers), in which case drilldown descriptive and exploratory methods would be applied. Data reduction is another possible objective for data mining (e.g., to aggregate or amalgamate the information in very large data sets into useful and manageable chunks).
SEMMA
See Models for Data Mining.
Stacked Generalization
See Stacking.
Stacking (Stacked Generalization)
The concept of stacking (Stacked Generalization) applies to the area of predictive data mining, to combine the predictions from multiple models. It is particularly useful when the types of models included in the project are very different.
Suppose your data mining project includes tree classifiers, such as C&RT or CHAID, linear discriminant analysis (e.g., see GDA), and Neural Networks. Each computes predicted classifications for a crossvalidation sample, from which overall goodnessoffit statistics (e.g., misclassification rates) can be computed. Experience has shown that combining the predictions from multiple methods often yields more accurate predictions than can be derived from any one method (e.g., see Witten and Frank, 2000). In stacking, the predictions from different classifiers are used as input into a metalearner, which attempts to combine the predictions to create a final best predicted classification. So, for example, the predicted classifications from the tree classifiers, linear model, and the neural network classifier(s) can be used as input variables into a neural network metaclassifier, which will attempt to “learn” from the data how to combine the predictions from the different models to yield maximum classification accuracy.
Other methods for combining the prediction from multiple models or methods (e.g., from multiple datasets used for learning) are Boosting and Bagging (Voting).
Text Mining
While Data Mining is typically concerned with the detection of patterns in numeric data, very often important (e.g., critical to business) information is stored in the form of text. Unlike numeric data, text is often amorphous, and difficult to deal with. Text mining generally consists of the analysis of (multiple) text documents by extracting key phrases, concepts, etc. and the preparation of the text processed in that manner for further analyses with numeric data mining techniques (e.g., to determine cooccurrences of concepts, key phrases, names, addresses, product names, etc.).
Voting
See Bagging.
Data Warehousing
StatSoft defines data warehousing as a process of organizing the storage of large, multivariate data sets in a way that facilitates the retrieval of information for analytic purposes.
The most efficient data warehousing architecture will be capable of incorporating or at least referencing all data available in the relevant enterprisewide information management systems, using designated technology suitable for corporate data base management (e.g., Oracle, Sybase, MS SQL Server. Also, a flexible, highperformance (see the IDP technology), open architecture approach to data warehousing – that flexibly integrates with the existing corporate systems and allows the users to organize and efficiently reference for analytic purposes enterprise repositories of data of practically any complexity – is offered in StatSoft enterprise systems such as STATISTICA Enterprise and STATISTICA Enterprise/QC , which can also work in conjunction with STATISTICA Data Miner and STATISTICA Enterprise Server.
OnLine Analytic Processing (OLAP)
The term OnLine Analytic Processing – OLAP (or Fast Analysis of Shared Multidimensional Information – FASMI) refers to technology that allows users of multidimensional databases to generate online descriptive or comparative summaries (“views”) of data and other analytic queries. Note that despite its name, analyses referred to as OLAP do not need to be performed truly “online” (or in realtime); the term applies to analyses of multidimensional databases (that may, obviously, contain dynamically updated information) through efficient “multidimensional” queries that reference various types of data. OLAP facilities can be integrated into corporate (enterprisewide) database systems and they allow analysts and managers to monitor the performance of the business (e.g., such as various aspects of the manufacturing process or numbers and types of completed transactions at different locations) or the market. The final result of OLAP techniques can be very simple (e.g., frequency tables, descriptive statistics, simple crosstabulations) or more complex (e.g., they may involve seasonal adjustments, removal of outliers, and other forms of cleaning the data). Although Data Mining techniques can operate on any kind of unprocessed or even unstructured information, they can also be applied to the data views and summaries generated by OLAP to provide more indepth and often more multidimensional knowledge. In this sense, Data Mining techniques could be considered to represent either a different analytic approach (serving different purposes than OLAP) or as an analytic extension of OLAP.
Exploratory Data Analysis (EDA)
EDA vs. Hypothesis Testing
As opposed to traditional hypothesis testing designed to verify a priori hypotheses about relations between variables (e.g., “There is a positive correlation between the AGE of a person and his/her RISK TAKING disposition”), exploratory data analysis (EDA) is used to identify systematic relations between variables when there are no (or not complete) a priori expectations as to the nature of those relations. In a typical exploratory data analysis process, many variables are taken into account and compared, using a variety of techniques in the search for systematic patterns.
Computational EDA techniques
Computational exploratory data analysis methods include both simple basic statistics and more advanced, designated multivariate exploratory techniques designed to identify patterns in multivariate data sets.
Basic statistical exploratory methods. The basic statistical exploratory methods include such techniques as examining distributions of variables (e.g., to identify highly skewed or nonnormal, such as bimodal patterns), reviewing large correlation matrices for coefficients that meet certain thresholds (see example above), or examining multiway frequency tables (e.g., “slice by slice” systematically reviewing combinations of levels of control variables).
Multivariate exploratory techniques. Multivariate exploratory techniques designed specifically to identify patterns in multivariate (or univariate, such as sequences of measurements) data sets include: Cluster Analysis, Factor Analysis, Discriminant Function Analysis, Multidimensional Scaling, Loglinear Analysis, Canonical Correlation, Stepwise Linear and Nonlinear (e.g., Logit) Regression, Correspondence Analysis, Time Series Analysis, and Classification Trees.
Neural Networks. Neural Networks are analytic techniques modeled after the (hypothesized) processes of learning in the cognitive system and the neurological functions of the brain and capable of predicting new observations (on specific variables) from other observations (on the same or other variables) after executing a process of socalled learning from existing data.
For more information, see Neural Networks; see also STATISTICA Neural Networks.
Graphical (Data Visualization) EDA Techniques
A large selection of powerful exploratory data analytic techniques is also offered by graphical data visualization methods that can identify relations, trends, and biases “hidden” in unstructured data sets.
Brushing. Perhaps the most common and historically first widely used technique explicitly identified as graphical exploratory data analysis is brushing, an interactive method allowing us to select onscreen specific data points or subsets of data and identify their (e.g., common) characteristics, or to examine their effects on relations between relevant variables. Those relations between variables can be visualized by fitted functions (e.g., 2D lines or 3D surfaces) and their confidence intervals, thus, for example, we can examine changes in those functions by interactively (temporarily) removing or adding specific subsets of data. For example, one of many applications of the brushing technique is to select (i.e., highlight) in a matrix scatterplot all data points that belong to a certain category (e.g., a “medium” income level, see the highlighted subset in the fourth component graph of the first row in the illustration left) in order to examine how those specific observations contribute to relations between other variables in the same data set (e.g, the correlation between the “debt” and “assets” in the current example). If the brushing facility supports features such as “animated brushing” or “automatic function refitting,” we can define a dynamic brush that would move over the consecutive ranges of a criterion variable (e.g., “income” measured on a continuous scale or a discrete [3level] scale as on the illustration above) and examine the dynamics of the contribution of the criterion variable to the relations between other relevant variables in the same data set.
Other graphical EDA techniques. Other graphical exploratory analytic techniques include function fitting and plotting, data smoothing, overlaying and merging of multiple displays, categorizing data, splitting/merging subsets of data in graphs, aggregating data in graphs, identifying and marking subsets of data that meet specific conditions, icon plots,
shading, plotting confidence intervals and confidence areas (e.g., ellipses),
generating tessellations, spectral planes,
integrated layered compressions,
and projected contours, data image reduction techniques, interactive (and continuous) rotation
with animated stratification (crosssections) of 3D displays, and selective highlighting of specific series and blocks of data.
Verification of Results of EDA
The exploration of data can only serve as the first stage of data analysis and its results can be treated as tentative at best as long as they are not confirmed, e.g., crossvalidated, using a different data set (or an independent subset). If the result of the exploratory stage suggests a particular model, then its validity can be verified by applying it to a new data set and testing its fit (e.g., testing its predictive validity). Case selection conditions can be used to quickly define subsets of data (e.g., for estimation and verification), and for testing the robustness of results.
Neural Networks
(see also, the Neural Networks topic)
Neural Networks are analytic techniques modeled after the (hypothesized) processes of learning in the cognitive system and the neurological functions of the brain and capable of predicting new observations (on specific variables) from other observations (on the same or other variables) after executing a process of socalled learning from existing data. Neural Networks is one of the Data Mining techniques.
The first step is to design a specific network architecture (that includes a specific number of “layers” each consisting of a certain number of “neurons”). The size and structure of the network needs to match the nature (e.g., the formal complexity) of the investigated phenomenon. Because the latter is obviously not known very well at this early stage, this task is not easy and often involves multiple “trials and errors.” (Now, there is, however, neural network software that applies artificial intelligence techniques to aid in that tedious task and finds “the best” network architecture.)
The new network is then subjected to the process of “training.” In that phase, neurons apply an iterative process to the number of inputs (variables) to adjust the weights of the network in order to optimally predict (in traditional terms, we could say find a “fit” to) the sample data on which the “training” is performed. After the phase of learning from an existing data set, the new network is ready and it can then be used to generate predictions.
The resulting “network” developed in the process of “learning” represents a pattern detected in the data. Thus, in this approach, the “network” is the functional equivalent of a model of relations between variables in the traditional model building approach. However, unlike in the traditional models, in the “network,” those relations cannot be articulated in the usual terms used in statistics or methodology to describe relations between variables (such as, for example, “A is positively correlated with B but only for observations where the value of C is low and D is high”). Some neural networks can produce highly accurate predictions; they represent, however, a typical atheoretical (one can say, “a black box”) research approach. That approach is concerned only with practical considerations, that is, with the predictive validity of the solution and its applied relevance and not with the nature of the underlying mechanism or its relevance for any “theory” of the underlying phenomena.
However, it should be mentioned that Neural Network techniques can also be used as a component of analyses designed to build explanatory models because Neural Networks can help explore data sets in search for relevant variables or groups of variables; the results of such explorations can then facilitate the process of model building. Moreover, now there is neural network software that uses sophisticated algorithms to search for the most relevant input variables, thus potentially contributing directly to the model building process.
One of the major advantages of neural networks is that, theoretically, they are capable of approximating any continuous function, and thus the researcher does not need to have any hypotheses about the underlying model, or even to some extent, which variables matter. An important disadvantage, however, is that the final solution depends on the initial conditions of the network, and, as stated before, it is virtually impossible to “interpret” the solution in traditional, analytic terms, such as those used to build theories that explain phenomena.
Some authors stress the fact that neural networks use, or we should say are expected to use, massively parallel computation models. For example Haykin (1994) defines neural network as:
“a massively parallel distributed processor that has a natural propensity for storing experiential knowledge and making it available for use. It resembles the brain in two respects: (1) Knowledge is acquired by the network through a learning process, and (2) Interneuron connection strengths known as synaptic weights are used to store the knowledge.” (p. 2).
However, as Ripley (1996) points out, the vast majority of contemporary neural network applications run on singleprocessor computers and he argues that a large speedup can be achieved not only by developing software that will take advantage of multiprocessor hardware by also by designing better (more efficient) learning algorithms.
Neural networks is one of the methods used in Data Mining; see also Exploratory Data Analysis. For more information on neural networks, see Haykin (1994), Masters (1995), Ripley (1996), and Welstead (1994). For a discussion of neural networks as statistical tools, see Warner and Misra (1996). See also, STATISTICA Neural Networks.
Correspondence Analysis
How To Analyze Simple TwoWay and MultiWay Table, Correspondence Analysis
General Purpose
Correspondence analysis is a descriptive/exploratory technique designed to analyze simple twoway and multiway tables containing some measure of correspondence between the rows and columns. The results provide information which is similar in nature to those produced by Factor Analysis techniques, and they allow you to explore the structure of categorical variables included in the table. The most common kind of table of this type is the twoway frequency crosstabulation table (see, for example, Basic Statistics or LogLinear).
In a typical correspondence analysis, a crosstabulation table of frequencies is first standardized, so that the relative frequencies across all cells sum to 1.0. One way to state the goal of a typical analysis is to represent the entries in the table of relative frequencies in terms of the distances between individual rows and/or columns in a lowdimensional space. This is best illustrated by a simple example, which will be described below. There are several parallels in interpretation between correspondence analysis and Factor Analysis, and some similar concepts will also be pointed out below.
For a comprehensive description of this method, computational details, and its applications (in the English language), refer to the classic text by Greenacre (1984). These methods were originally developed primarily in France by JeanPaul Benzérci in the early 1960’s and 1970’s (e.g., see Benzérci, 1973; see also Lebart, Morineau, and Tabard, 1977), but have only more recently gained increasing popularity in Englishspeaking countries (see, for example, Carrol, Green, and Schaffer, 1986; Hoffman and Franke, 1986). (Note that similar techniques were developed independently in several countries, where they were known as optimal scaling, reciprocal averaging, optimal scoring, quantification method, or homogeneity analysis). In the following paragraphs, a general introduction to correspondence analysis will be presented.
Overview. Suppose you collected data on the smoking habits of different employees in a company. The following data set is presented in Greenacre (1984, p. 55).
Smoking Category  

Staff Group 
(1) None 
(2) Light 
(3) Medium 
(4) Heavy 
Row Totals 
(1) Senior Managers (2) Junior Managers (3) Senior Employees (4) Junior Employees (5) Secretaries 
4 4 25 18 10 
2 3 10 24 6 
3 7 12 33 7 
2 4 4 13 2 
11 18 51 88 25 
Column Totals  61  45  62  25  193 
You can think of the 4 column values in each row of the table as coordinates in a 4dimensional space, and you could compute the (Euclidean) distances between the 5 row points in the 4 dimensional space. The distances between the points in the 4dimensional space summarize all information about the similarities between the rows in the table above. Now suppose you could find a lowerdimensional space, in which to position the row points in a manner that retains all, or almost all, of the information about the differences between the rows. You could then present all information about the similarities between the rows (types of employees in this case) in a simple 1, 2, or 3dimensional graph. While this may not appear to be particularly useful for small tables such as the one shown above, you can easily imagine how the presentation and interpretation of very large tables (e.g., differential preference for 10 consumer items among 100 groups of respondents in a consumer survey) could greatly benefit from the simplification that can be achieved via correspondence analysis (e.g., represent the 10 consumer items in a two dimensional space).
Mass. To continue with the simpler example of the twoway table presented above, computationally, the program will first compute the relative frequencies for the frequency table, so that the sum of all table entries is equal to 1.0 (each element will be divided by the total, i.e., 193). You could say that this table now shows how one unit of mass is distributed across the cells. In the terminology of correspondence analysis, the row and column totals of the matrix of relative frequencies are called the row mass and column mass, respectively.
Inertia. The term inertia in correspondence analysis is used by analogy with the definition in applied mathematics of “moment of inertia,” which stands for the integral of mass times the squared distance to the centroid (e.g., Greenacre, 1984, p. 35). Inertia is defined as the total Pearson Chisquare for the twoway divided by the total sum (193 in the present example)..
Inertia and row and column profiles. If the rows and columns in a table are completely independent of each other, the entries in the table (distribution of mass) can be reproduced from the row and column totals alone, or row and column profiles in the terminology of correspondence analysis. According to the wellknown formula for computing the Chisquare statistic for twoway tables, the expected frequencies in a table, where the column and rows are independent of each other, are equal to the respective column total times the row total, divided by the grand total. Any deviations from the expected values (expected under the hypothesis of complete independence of the row and column variables) will contribute to the overall Chisquare. Thus, another way of looking at correspondence analysis is to consider it a method for decomposing the overall Chisquare statistic (or Inertia=Chi square/Total N) by identifying a small number of dimensions in which the deviations from the expected values can be represented. This is similar to the goal of Factor Analysis, where the total variance is decomposed, so as to arrive at a lowerdimensional representation of the variables that allows you to reconstruct most of the variance/covariance matrix of variables.
Analyzing rows and columns. This simple example began with a discussion of the rowpoints in the table shown above. However, you may rather be interested in the column totals, in which case you could plot the column points in a smalldimensional space, which satisfactorily reproduces the similarity (and distances) between the relative frequencies for the columns, across the rows, in the table shown above. In fact it is customary to simultaneously plot the column points and the row points in a single graph, to summarize the information contained in a twoway table.
Reviewing results. Let’s now look at some of the results for the table shown above. First, shown below are the socalled singular values , eigenvalues, percentages of inertia explained, cumulative percentages, and the contribution to the overall Chi square.
Eigenvalues and Inertia for all Dimensions Input Table (Rows x Columns): 5 x 4 Total Inertia = .08519 Chi² = 16.442 


No. of Dims 
Singular Values 
Eigen Values 
Perc. of Inertia 
Cumulatv Percent 
Chi Squares 
1 2 3 
.273421 .100086 .020337 
.074759 .010017 .000414 
87.75587 11.75865 .48547 
87.7559 99.5145 100.0000 
14.42851 1.93332 .07982 
Note that the dimensions are “extracted” so as to maximize the distances between the row or column points, and successive dimensions (which are independent of or orthogonal to each other) will “explain” less and less of the overall Chisquare value (and, thus, inertia). Thus, the extraction of the dimensions is similar to the extraction of principal components in Factor Analysis.
First, it appears that, with a single dimension, 87.76% of the inertia can be “explained,” that is, the relative frequency values that can be reconstructed from a single dimension can reproduce 87.76% of the total Chisquare value (and, thus, of the inertia) for this twoway table; two dimensions allow you to explain 99.51%.
Maximum number of dimensions. Since the sums of the frequencies across the columns must be equal to the row totals, and the sums across the rows equal to the column totals, there are in a sense only (no. of columns1) independent entries in each row, and (no. of rows1) independent entries in each column of the table (once you know what these entries are, you can fill in the rest based on your knowledge of the column and row marginal totals). Thus, the maximum number of eigenvalues that can be extracted from a two way table is equal to the minimum of the number of columns minus 1, and the number of rows minus 1. If you choose to extract (i.e., interpret) the maximum number of dimensions that can be extracted, then you can reproduce exactly all information contained in the table.
Row and column coordinates. Next look at the coordinates for the twodimensional solution.
Row Name  Dim. 1  Dim. 2 

(1) Senior Managers (2) Junior Managers (3) Senior Employees (4) Junior Employees (5) Secretaries 
.065768 .258958 .380595 .232952 .201089 
.193737 .243305 .010660 .057744 .078911 
Of course, you can plot these coordinates in a twodimensional scatterplot. Remember that the purpose of correspondence analysis is to reproduce the distances between the row and/or column points in a twoway table in a lowerdimensional display; note that, as in Factor Analysis, the actual rotational orientation of the axes is arbitrarily chosen so that successive dimensions “explain” less and less of the overall Chisquare value (or inertia). You could, for example, reverse the signs in each column in the table shown above, thereby effectively rotating the respective axis in the plot by 180 degrees.
What is important are the distances of the points in the twodimensional display, which are informative in that row points that are close to each other are similar with regard to the pattern of relative frequencies across the columns. If you have produced this plot you will see that, along the most important first axis in the plot, the Senior employees and Secretaries are relatively close together on the left side of the origin (scale position 0). If you looked at the table of relative row frequencies (i.e., frequencies standardized, so that their sum in each row is equal to 100%), you will see that these two groups of employees indeed show very similar patterns of relative frequencies across the categories of smoking intensity.
Percentages of Row Totals  

Smoking Category  
Staff Group 
(1) None 
(2) Light 
(3) Medium 
(4) Heavy 
Row Totals 
(1) Senior Managers (2) Junior Managers (3) Senior Employees (4) Junior Employees (5) Secretaries 
36.36 22.22 49.02 20.45 40.00 
18.18 16.67 19.61 27.27 24.00 
27.27 38.89 23.53 37.50 28.00 
18.18 22.22 7.84 14.77 8.00 
100.00 100.00 100.00 100.00 100.00 
Obviously the final goal of correspondence analysis is to find theoretical interpretations (i.e., meaning) for the extracted dimensions. One method that may aid in interpreting extracted dimensions is to plot the column points. Shown below are the column coordinates for the first and second dimension.
Smoking category 
Dim. 1 
Dim. 2 

None Light Medium Heavy 
.393308 .099456 .196321 .293776 
.030492 .141064 .007359 .197766 
It appears that the first dimension distinguishes mostly between the different degrees of smoking, and in particular between category None and the others. Thus, you can interpret the greater similarity of Senior Managers with Secretaries, with regard to their position on the first axis, as mostly deriving from the relatively large numbers of Nonesmokers in these two groups of employees.
Compatibility of row and column coordinates. It is customary to summarize the row and column coordinates in a single plot. However, it is important to remember that in such plots, you can only interpret the distances between row points, and the distances between column points, but not the distances between row points and column points.
To continue with this example, it would not be appropriate to say that the category None is similar to Senior Employees (the two points are very close in the simultaneous plot of row and column coordinates). However, as was indicated earlier, it is appropriate to make general statements about the nature of the dimensions, based on which side of the origin particular points fall. For example, because category None is the only column point on the left side of the origin for the first axis, and since employee group Senior Employees also falls onto that side of the first axis, you may conclude that the first axis separates None smokers from the other categories of smokers, and that Senior Employees are different from, for example, Junior Employees, in that there are relatively more nonsmoking Senior Employees.
Scaling of the coordinates (standardization options). Another important decision that the analyst must make concerns the scaling of the coordinates. The nature of the choice pertains to whether or not you want to analyze the relative row percentages, column percentages, or both. In the context of the example described above, the row percentages were shown to illustrate how the patterns of those percentages across the columns are similar for points which appear more closely together in the graphical display of the row coordinates. Put another way, the coordinates are based on the analysis of the row profile matrix, where the sum of the table entries in a row, across all columns, is equal to 1.0 (each entry r_{ij} in the row profile matrix can be interpreted as the conditional probability that a case belongs to column j, given its membership in row i). Thus, the coordinates are computed so as to maximize the differences between the points with respect to the row profiles (row percentages). The row coordinates are computed from the row profile matrix, the column coordinates are computed from the column profile matrix.
A fourth option, Canonical standardization (see Gifi, 1981), is also provided, and it amounts to a standardization of the columns and rows of the matrix of relative frequencies. This standardization amounts to a rescaling of the coordinates based on the row profile standardization and the column profile standardization, and this type of standardization is not widely used. Note also that a variety of other custom standardizations can be easily performed if you have the raw eigenvalues and eigenvector matrices.
Metric of coordinate system. In several places in this introduction, the term distance was (loosely) used to refer to the differences between the pattern of relative frequencies for the rows across the columns, and columns across the rows, which are to be reproduced in a lowerdimensional solution as a result of the correspondence analysis. Actually, these distances represented by the coordinates in the respective space are not simple Euclidean distances computed from the relative row or column frequencies, but rather, they are weighted distances. Specifically, the weighting that is applied is such that the metric in the lower dimensional space is a Chisquare metric, provided that (1) you are comparing row points, and chose either rowprofile standardization or both row and columnprofile standardization, or (2) you are comparing column points, and chose either columnprofile standardization or both row and columnprofile standardization.
In that case (but not if you chose the canonical standardization), the squared Euclidean distance between, for example, two row points i and i’ in the respective coordinate system of a given number of dimensions actually approximates a weighted (i.e., Chisquare) distance between the relative frequencies (see Hoffman and Franke, 1986, formula 21):
d_{ii ‘}^{2} = _{j} (1/c_{j} (p_{ij} /r_{i} – p^{2}_{i ‘ j} /r_{i ‘}))
In this formula, d_{ii ‘} stands for the squared distance between the two points, c_{j} stands for the column total for the j‘th column of the standardized frequency table (where the sum of all entries or mass is equal to 1.0), p_{ij} stands for the individual cell entries in the standardized frequency table (row i, column j), r_{i} stands for the row total for the i‘th column of the relative frequency table, and the summation is over the columns of the table. To reiterate, only the distances between row points, and correspondingly, between column points are interpretable in this manner; the distances between row points and column points cannot be interpreted.
Judging the quality of a solution. A number of auxiliary statistics are reported, to aid in the evaluation of the quality of the respective chosen numbers of dimensions. The general concern here is that all (or at least most) points are properly represented by the respective solution, that is, that their distances to other points can be approximated to a satisfactory degree. Shown below are all statistics reported for the row coordinates for the example table discussed so far, based on a onedimensional solution only (i.e., only one dimension is used to reconstruct the patterns of relative frequencies across the columns).
Row Coordinates and Contributions to Inertia  

Staff Group 
Coordin. Dim.1 
Mass 
Quality 
Relative Inertia 
Inertia Dim.1 
Cosine² Dim.1 
(1) Senior Managers (2) Junior Managers (3) Senior Employees (4) Junior Employees (5) Secretaries 
.065768 .258958 .380595 .232952 .201089 
.056995 .093264 .264249 .455959 .129534 
.092232 .526400 .999033 .941934 .865346 
.031376 .139467 .449750 .308354 .071053 
.003298 .083659 .512006 .330974 .070064 
.092232 .526400 .999033 .941934 .865346 
Coordinates. The first numeric column shown in the table above contains the coordinates, as discussed in the previous paragraphs. To reiterate, the specific interpretation of these coordinates depends on the standardization chosen for the solution (see above). The number of dimensions is chosen by the user (in this case we chose only one dimension), and coordinate values will be shown for each dimension (i.e., there will be one column with coordinate values for each dimension).
Mass. The Mass column contains the row totals (since these are the row coordinates) for the table of relative frequencies (i.e., for the table where each entry is the respective mass, as discussed earlier in this section). Remember that the coordinates are computed based on the matrix of conditional probabilities shown in the Mass column.
Quality. The Quality column contains information concerning the quality of representation of the respective row point in the coordinate system defined by the respective numbers of dimensions, as chosen by the user. In the table shown above, only one dimension was chosen, and the numbers in the Quality column pertain to the quality of representation in the onedimensional space. To reiterate, computationally, the goal of the correspondence analysis is to reproduce the distances between points in a lowdimensional space. If you extracted (i.e., interpreted) the maximum number of dimensions (which is equal to the minimum of the number of rows and the number of columns, minus 1), you could reconstruct all distances exactly. The Quality of a point is defined as the ratio of the squared distance of the point from the origin in the chosen number of dimensions, over the squared distance from the origin in the space defined by the maximum number of dimensions (remember that the metric here is Chisquare, as described earlier). By analogy to Factor Analysis, the quality of a point is similar in its interpretation to the communality for a variable in factor analysis. .
Note that the Quality measure reported is independent of the chosen method of standardization, and always pertains to the default standardization (i.e., the distance metric is Chisquare, and the quality measure can be interpreted as the “proportion of Chi square accounted for” for the respective row, given the respective number of dimensions). A low quality means that the current number of dimensions does not well represent the respective row (or column). In the table shown above, the quality for the first row (Senior Managers) is less than .1, indicating that this row point is not well represented by the one dimensional representation of the points.
Relative inertia. The Quality of a point (see above) represents the proportion of the contribution of that point to the overall inertia (Chisquare) that can be accounted for by the chosen number of dimensions. However, it does not indicate whether or not, and to what extent, the respective point does in fact contribute to the overall inertia (Chi square value). The relative inertia represents the proportion of the total inertia accounted for by the respective point, and it is independent of the number of dimensions chosen by the user. Note that a particular solution may represent a point very well (high Quality), but the same point may not contribute much to the overall inertia (e.g., a row point with a pattern of relative frequencies across the columns that is similar to the average pattern across all rows).
Relative inertia for each dimension. This column contains the relative contribution of the respective (row) point to the inertia “accounted for” by the respective dimension. Thus, this value will be reported for each (row or column) point, for each dimension.
Cosine^{²} (quality or squared correlations with each dimension). This column contains the quality for each point, by dimension. The sum of the values in these columns across the dimensions is equal to the total Quality value discussed above (since in the example table above, only one dimension was chose, the values in this column are identical to the values in the overall Quality column). This value may also be interpreted as the “correlation” of the respective point with the respective dimension. The term Cosine^{²} refers to the fact that this value is also the squared cosine value of the angle the point makes with the respective dimension (refer to Greenacre, 1984, for details concerning the geometric aspects of correspondence analysis).
A note about “statistical significance.” It should be noted at this point that correspondence analysis is an exploratory technique. Actually, the method was developed based on a philosophical orientation that emphasizes the development of models that fit the data, rather than the rejection of hypotheses based on the lack of fit (Benzecri’s “second principle” states that “The model must fit the data, not vice versa;” see Greenacre, 1984, p. 10). Therefore, there are no statistical significance tests that are customarily applied to the results of a correspondence analysis; the primary purpose of the technique is to produce a simplified (low dimensional) representation of the information in a large frequency table (or tables with similar measures of correspondence).
Supplementary Points
The introductory section provides an overview of how to interpret the coordinates and related statistics computed in a correspondence analysis. An important aid in the interpretation of the results from a correspondence analysis is to include supplementary row or column points, that were not used to perform the original analyses. For example, consider the following results which are based on the example given in the introductory (based on Greenacre, 1984).
Row Name  Dim. 1  Dim. 2 

(1) Senior Managers (2) Junior Managers (3) Senior Employees (4) Junior Employees (5) Secretaries 
.065768 .258958 .380595 .232952 .201089 
.193737 .243305 .010660 .057744 .078911 
National Average  .258368  .117648 
The table above shows the coordinate values (for two dimensions) computed for a frequency table of different types of employees by type of smoking habit. The row labeled National Average contains the coordinate values for the supplementary point, which is the national average (percentages) for the different smoking categories (which make up the columns of the table; those fictitious percentages reported in Greenacre (1984) are: Nonsmokers: 42%, light smokers: 29%, medium smokers, 20%; heavy smokers: 9%). If you plotted these coordinates in a twodimensional scatterplot, along with the column coordinates, it would be apparent that the National Average supplementary row point is plotted close to the point representing the Secretaries group, and on the same side of the horizontal axis (first dimension) as the Nonsmokers column point. If you refer back to the original twoway table shown in the introductory section, this finding is consistent with the entries in the table of row frequencies, that is, there are relatively more nonsmokers among the Secretaries, and in the National Average. Put another way, the sample represented in the original frequency table contains more smokers than the national average.
While this type of information could have been easily gleaned from the original frequency table (that was used as the input to the analysis), in the case of very large tables, such conclusions may not be as obvious.
Quality. Another interesting result for supplementary points concerns the quality of their representation in the chosen number of dimensions (see the introductory section for a more detailed discussion of the concept of quality). To reiterate, the goal of the correspondence analysis is to reproduce the distances between the row or column coordinates (patterns of relative frequencies across the columns or rows, respectively) in a lowdimensional solution. Given such a solution, you may ask whether particular supplementary points of interest can be represented equally well in the final space, that is, whether or not their distances from the other points in the table can also be represented in the chosen numbers of dimensions. Shown below are the summary statistics for the original points, and the supplementary row point National Average, for the twodimensional solution of representation of representation of supplementary points.
Staff Group 
Quality 
Cosine² Dim.1 
Cosine² Dim.2 

(1) Senior Managers (2) Junior Managers (3) Senior Employees (4) Junior Employees (5) Secretaries 
.892568 .991082 .999817 .999810 .998603 
.092232 .526400 .999033 .941934 .865346 
.800336 .464682 .000784 .057876 .133257 
National Average  .761324  .630578  .130746 
The statistics reported in the table above are discussed in the introductory section. In short, the Quality of a row or column point is defined as the ratio of the squared distance of the point from the origin in the chosen number of dimensions, over the squared distance from the origin in the space defined by the maximum number of dimensions (remember that the metric here is Chisquare, as described in the introductory section). In a sense, the overall quality is the “proportion of squared distancefromtheoverallcentroid accounted for.” The supplementary row point National Average has a quality of .76, indicating that it is reasonably well represented in the twodimensional solution. The Cosine^{²} statistic is the quality “accounted for” by the respective row point, by the respective dimension (the sum of the Cosine^{²} values over the respective number of dimensions is equal to the total Quality, see also the introductory section).
Multiple Correspondence Analysis (MCA)
Multiple correspondence analysis (MCA) may be considered to be an extension of simple correspondence analysis to more than two variables. For an introductory overview of simple correspondence analysis, refer to the introductory section . Multiple correspondence analysis is a simple correspondence analysis carried out on an indicator (or design) matrix with cases as rows and categories of variables as columns. Actually, one usually analyzes the inner product of such a matrix, called the Burt Table in an MCA; this will be discussed later. However, to clarify the interpretation of the results from a multiple correspondence analysis, it is easier to discuss the simple correspondence analysis of an indicator or design matrix.
Indicator or design matrix. Consider again the simple twoway table presented in the introductory section:
Smoking Category  

Staff Group 
(1) None 
(2) Light 
(3) Medium 
(4) Heavy 
Row Totals 
(1) Senior Managers (2) Junior Managers (3) Senior Employees (4) Junior Employees (5) Secretaries 
4 4 25 18 10 
2 3 10 24 6 
3 7 12 33 7 
2 4 4 13 2 
11 18 51 88 25 
Column Totals  61  45  62  25  193 
Suppose you had entered the data for this table in the following manner, as an indicator or designmatrix:
Staff Group  Smoking  

Case Number 
Senior Manager 
Junior Manager 
Senior Employee 
Junior Employee 
Secretary 
None 
Light 
Medium 
Heavy 
1 2 3 4 5 … … … 191 192 193 
1 1 1 1 1 . . . 0 0 0 
0 0 0 0 0 . . . 0 0 0 
0 0 0 0 0 . . . 0 0 0 
0 0 0 0 0 . . . 0 0 0 
0 0 0 0 0 . . . 1 1 1 
1 1 1 1 0 . . . 0 0 0 
0 0 0 0 1 . . . 0 0 0 
0 0 0 0 0 . . . 1 0 0 
0 0 0 0 0 . . . 0 1 1 
Each one of the 193 total cases in the table is represented by one case in this data file. For each case a 1 is entered into the category where the respective case “belongs,” and a 0 otherwise. For example, case 1 represents a Senior Manager who is a None smoker. As can be seen in the table above, there are a total of 4 such cases in the twoway table, and thus there will be four cases like this in the indicator matrix. In all, there will be 193cases in the indicator or design matrix.
Analyzing the design matrix. If you now analyzed this data file (design or indicator matrix) shown above as if it were a twoway frequency table, the results of the correspondence analysis would provide column coordinates that would allow you to relate the different categories to each other, based on the distances between the row points, i.e., between the individual cases. In fact, the twodimensional display you would obtain for the column coordinates would look very similar to the combined display for row and column coordinates, if you had performed the simple correspondence analysis on the twoway frequency table (note that the metric will be different, but the relative positions of the points will be very similar).
More than two variables. The approach to analyzing categorical data outlined above can easily be extended to more than two categorical variables. For example, the indicator or design matrix could contain two additional variables Male and Female, again coded 0 and 1, to indicate the subjects’ gender; and three variables could be added to indicate to which one of three age groups a case belongs. Thus, in the final display, you could represent the relationships (similarities) between Gender, Age, Smoking habits, and Occupation (Staff Groups).
Fuzzy coding. It is not necessary that each case is assigned exclusively to only one category of each categorical variable. Rather than the 0or1 coding scheme, you could enter probabilities for membership in a category, or some other measure that represents a fuzzy rule for group membership. Greenacre (1984) discusses different types of coding schemes of this kind. For example, suppose in the example design matrix shown earlier, you had missing data for a few cases regarding their smoking habits. Instead of discarding those cases entirely from the analysis (or creating a new category Missing data), you could assign to the different smoking categories proportions (which should add to 1.0) to represent the probabilities that the respective case belongs to the respective category (e.g., you could enter proportions based on your knowledge about estimates for the national averages for the different categories).
Interpretation of coordinates and other results. To reiterate, the results of a multiple correspondence analysis are identical to the results you would obtain for the column coordinates from a simple correspondence analysis of the design or indicator matrix. Therefore, the interpretation of coordinate values, quality values, cosine^{²}‘s and other statistics reported as the results from a multiple correspondence analysis can be interpreted in the same manner as described in the context of the simple correspondence analysis (see introductory section), however, these statistics pertain to the total inertia associated with the entire design matrix.
Supplementary column points and “multiple regression” for categorical variables. Another application of the analysis of design matrices via correspondence analysis techniques is that it allows you to perform the equivalent of a Multiple Regression for categorical variables, by adding supplementary columns to the design matrix. For example, suppose you added to the design matrix shown earlier two columns to indicate whether or not the respective subject had or had not been ill over the past year (i.e., you could add one column Ill and another column Not ill, and again enter 0‘s and 1‘s to indicate each subject’s health status). If, in a simple correspondence analysis of the design matrix, you added those columns as supplementary columns to the analysis, then (1) the summary statistics for the quality of representation (see the introductory section) for those columns would give you an indication of how well you can “explain” illness as a function of the other variables in the design matrix, and (2) the display of the column points in the final coordinate system would provide an indication of the nature (e.g., direction) of the relationships between the columns in the design matrix and the column points indicating illness; this technique (adding supplementary points to an MCA analysis) is also sometimes called predictive mapping.
The Burt table. The actual computations in multiple correspondence analysis are not performed on a design or indicator matrix (which, potentially, may be very large if there are many cases), but on the inner product of this matrix; this matrix is also called the Burt matrix. With frequency tables, this amounts to tabulating the stacked categories against each other; for example the Burt for the twoway frequency table presented earlier would look like this.
Employee  Smoking  

(1)  (2)  (3)  (4)  (5)  (1)  (2)  (3)  (4)  
(1) Senior Managers (2) Junior Managers (3) Senior Employees (4) Junior Employees (5) Secretaries (1) Smoking:None (2) Smoking:Light (3) Smoking:Medium (4) Smoking:Heavy 
11 0 0 0 0 4 2 3 2 
0 18 0 0 0 4 3 7 4 
0 0 51 0 0 25 10 12 4 
0 0 0 88 0 18 24 33 13 
0 0 0 0 25 10 6 7 2 
4 4 25 18 10 61 0 0 0 
2 3 10 24 6 0 45 0 0 
3 7 12 33 7 0 0 62 0 
2 4 4 13 2 0 0 0 25 
The Burt has a clearly defined structure. In the case of two categorical variables (shown above), it consists of 4 partitions: (1) the crosstabulation of variable Employee against itself, (2) the crosstabulation of variable Employee against variable Smoking, (3), the crosstabulation of variable Smoking against variable Employee, and (4) the crosstabulation of variable Smoking against itself. Note that the matrix is symmetrical, and that the sum of the diagonal elements in each partition representing the crosstabulation of a variable against itself must be the same (e.g., there were a total of 193 observations in the present example, and hence, the diagonal elements in the crosstabulation tables of variable Employee against itself, and Smokingagainst itself must also be equal to 193).
Note that the offdiagonal elements in the partitions representing the crosstabulations of a variable against itself are equal to 0 in the table shown above. However, this is not necessarily always the case, for example, when the Burt was derived from a design or indicator matrix that included fuzzy coding of category membership (see above).
Burt Tables
The Burt table is the result of the inner product of a design or indicator matrix, and the multiple correspondence analysis results are identical to the results one would obtain for the column points from a simple correspondence analysis of the indicator or design matrix (see also MCA).
For example, suppose you had entered data concerning the Survival for different Age groups in different Locations like this:
SURVIVAL  AGE  LOCATION  

Case No.  NO  YES  LESST50  A50TO69  OVER69  TOKYO  BOSTON  GLAMORGN 
1 2 3 4 … … … 762 763 764 
0 1 0 0 . . . 1 0 0 
1 0 1 1 . . . 0 1 1 
0 1 0 0 . . . 0 1 0 
1 0 1 0 . . . 1 0 1 
0 0 0 1 . . . 0 0 0 
0 1 0 0 . . . 1 0 0 
0 0 1 0 . . . 0 1 0 
1 0 0 1 . . . 0 0 1 
In this data arrangement, for each case a 1 was entered to indicate to which category, of a particular set of categories, a case belongs (e.g., Survival, with the categories No and Yes). For example, case 1 survived (a 0 was entered for variable No, and a 1 was entered for variable Yes), case 1 is between age 50 and 69 (a 1 was entered for variable A50to69), and was observed in Glamorgn). Overall there are 764observations in the data set.
If you denote the data (design or indicator matrix) shown above as matrix X, then matrix product X’X is a Burt table); shown below is an example of a Burt table that one might obtain in this manner.
SURVIVAL  AGE  LOCATION  

NO  YES  <50  5069  69+  TOKYO  BOSTON  GLAMORGN  
SURVIVAL:NO SURVIVAL:YES AGE:UNDER_50 LOCATION:TOKYO 
210 0 68 60 
0 554 212 230 
68 212 280 151 
93 258 0 120 
49 84 0 19 
60 230 151 290 
82 171 58 0 
68 153 71 0 
The Burt table has a clearly defined structure. Overall, the data matrix is symmetrical. In the case of 3 categorical variables (as shown above), the data matrix consists 3 x 3 = 9 partitions, created by each variable being tabulated against itself, and against the categories of all other variables. Note that the sum of the diagonal elements in each diagonal partition (i.e., where the respective variables are tabulated against themselves) is constant (equal to 764 in this case).
The offdiagonal elements in each diagonal partition in this example are all 0. If the cases in the design or indicator matrix are assigned to categories via fuzzy coding (i.e., if probabilities are used to indicate likelihood of membership in a category, rather than 0/1 coding to indicate actual membership), then the offdiagonal elements of the diagonal partitions are not necessarily equal to 0.
Classification and Regression Trees
Introductory Overview – Basic Ideas
Overview
C&RT, a recursive partitioning method, builds classification and regression trees for predicting continuous dependent variables (regression) and categorical predictor variables (classification). The classic C&RT algorithm was popularized by Breiman et al. (Breiman, Friedman, Olshen, & Stone, 1984; see also Ripley, 1996). A general introduction to treeclassifiers, specifically to the QUEST (Quick, Unbiased, Efficient Statistical Trees) algorithm, is also presented in the context of the Classification Trees Analysis facilities, and much of the following discussion presents the same information, in only a slightly different context. Another, similar type of tree building algorithm is CHAID (Chisquare Automatic Interaction Detector; see Kass, 1980).
Classification and Regression Problems
There are numerous algorithms for predicting continuous variables or categorical variables from a set of continuous predictors and/or categorical factor effects. For example, in GLM (General Linear Models) and GRM (General Regression Models), we can specify a linear combination (design) of continuous predictors and categorical factor effects (e.g., with twoway and threeway interaction effects) to predict a continuous dependent variable. In GDA (General Discriminant Function Analysis), we can specify such designs for predicting categorical variables, i.e., to solve classification problems.
Regressiontype problems. Regressiontype problems are generally those where we attempt to predict the values of a continuous variable from one or more continuous and/or categorical predictor variables. For example, we may want to predict the selling prices of single family homes (a continuous dependent variable) from various other continuous predictors (e.g., square footage) as well as categorical predictors (e.g., style of home, such as ranch, twostory, etc.; zip code or telephone area code where the property is located, etc.; note that this latter variable would be categorical in nature, even though it would contain numeric values or codes). If we used simple multiple regression, or some general linear model (GLM) to predict the selling prices of single family homes, we would determine a linear equation for these variables that can be used to compute predicted selling prices. There are many different analytic procedures for fitting linear models (GLM, GRM, Regression), various types of nonlinear models (e.g., Generalized Linear/Nonlinear Models (GLZ), Generalized Additive Models (GAM), etc.), or completely customdefined nonlinear models (see Nonlinear Estimation), where we can type in an arbitrary equation containing parameters to be estimated. CHAID also analyzes regressiontype problems, and produces results that are similar (in nature) to those computed by C&RT. Note that various neural network architectures are also applicable to solve regressiontype problems.
Classificationtype problems. Classificationtype problems are generally those where we attempt to predict values of a categorical dependent variable (class, group membership, etc.) from one or more continuous and/or categorical predictor variables. For example, we may be interested in predicting who will or will not graduate from college, or who will or will not renew a subscription. These would be examples of simple binary classification problems, where the categorical dependent variable can only assume two distinct and mutually exclusive values. In other cases, we might be interested in predicting which one of multiple different alternative consumer products (e.g., makes of cars) a person decides to purchase, or which type of failure occurs with different types of engines. In those cases there are multiple categories or classes for the categorical dependent variable. There are a number of methods for analyzing classificationtype problems and to compute predicted classifications, either from simple continuous predictors (e.g., binomial or multinomial logit regression in GLZ), from categorical predictors (e.g., LogLinear analysis of multiway frequency tables), or both (e.g., via ANCOVAlike designs in GLZ or GDA). The CHAID also analyzes classificationtype problems, and produces results that are similar (in nature) to those computed by C&RT. Note that various neural network architectures are also applicable to solve classificationtype problems.
Classification and Regression Trees (C&RT)
In most general terms, the purpose of the analyses via treebuilding algorithms is to determine a set of ifthen logical (split) conditions that permit accurate prediction or classification of cases.
Classification Trees
For example, consider the widely referenced Iris data classification problem introduced by Fisher [1936; see also Discriminant Function Analysis and General Discriminant Analysis (GDA)]. The data file Irisdat reports the lengths and widths of sepals and petals of three types of irises (Setosa, Versicol, and Virginic). The purpose of the analysis is to learn how we can discriminate between the three types of flowers, based on the four measures of width and length of petals and sepals. Discriminant function analysis will estimate several linear combinations of predictor variables for computing classification scores (or probabilities) that allow the user to determine the predicted classification for each observation. A classification tree will determine a set of logical ifthen conditions (instead of linear equations) for predicting or classifying cases instead:
The interpretation of this tree is straightforward: If the petal width is less than or equal to 0.8, the respective flower would be classified as Setosa; if the petal width is greater than 0.8 and less than or equal to 1.75, then the respective flower would be classified as Virginic; else, it belongs to class Versicol.
Regression Trees
The general approach to derive predictions from few simple ifthen conditions can be applied to regression problems as well. This example is based on the data file Poverty, which contains 1960 and 1970 Census figures for a random selection of 30 counties. The research question (for that example) was to determine the correlates of poverty, that is, the variables that best predict the percent of families below the poverty line in a county. A reanalysis of those data, using the regression tree analysis [and vfold crossvalidation, yields the following results:
Again, the interpretation of these results is rather straightforward: Counties where the percent of households with a phone is greater than 72% have generally a lower poverty rate. The greatest poverty rate is evident in those counties that show less than (or equal to) 72% of households with a phone, and where the population change (from the 1960 census to the 170 census) is less than 8.3 (minus 8.3). These results are straightforward, easily presented, and intuitively clear as well: There are some affluent counties (where most households have a telephone), and those generally have little poverty. Then there are counties that are generally less affluent, and among those the ones that shrunk most showed the greatest poverty rate. A quick review of the scatterplot of observed vs. predicted values shows how the discrimination between the latter two groups is particularly well “explained” by the tree model.
Advantages of Classification and Regression Trees (C&RT) Methods
As mentioned earlier, there are a large number of methods that an analyst can choose from when analyzing classification or regression problems. Tree classification techniques, when they “work” and produce accurate predictions or predicted classifications based on few logical ifthen conditions, have a number of advantages over many of those alternative techniques.
Simplicity of results. In most cases, the interpretation of results summarized in a tree is very simple. This simplicity is useful not only for purposes of rapid classification of new observations (it is much easier to evaluate just one or two logical conditions, than to compute classification scores for each possible group, or predicted values, based on all predictors and using possibly some complex nonlinear model equations), but can also often yield a much simpler “model” for explaining why observations are classified or predicted in a particular manner (e.g., when analyzing business problems, it is much easier to present a few simple ifthen statements to management, than some elaborate equations).
Tree methods are nonparametric and nonlinear. The final results of using tree methods for classification or regression can be summarized in a series of (usually few) logical ifthen conditions (tree nodes). Therefore, there is no implicit assumption that the underlying relationships between the predictor variables and the dependent variable are linear, follow some specific nonlinear link function [e.g., see Generalized Linear/Nonlinear Models (GLZ)], or that they are even monotonic in nature. For example, some continuous outcome variable of interest could be positively related to a variable Income if the income is less than some certain amount, but negatively related if it is more than that amount (i.e., the tree could reveal multiple splits based on the same variable Income, revealing such a nonmonotonic relationship between the variables). Thus, tree methods are particularly well suited for data mining tasks, where there is often little a priori knowledge nor any coherent set of theories or predictions regarding which variables are related and how. In those types of data analyses, tree methods can often reveal simple relationships between just a few variables that could have easily gone unnoticed using other analytic techniques.
General Computation Issues and Unique Solutions of C&RT
The computational details involved in determining the best split conditions to construct a simple yet useful and informative tree are quite complex. Refer to Breiman et al. (1984) for a discussion of their CART® algorithm to learn more about the general theory of and specific computational solutions for constructing classification and regression trees. An excellent general discussion of tree classification and regression methods, and comparisons with other approaches to pattern recognition and neural networks, is provided in Ripley (1996).
Avoiding OverFitting: Pruning, Crossvalidation, and Vfold Crossvalidation
A major issue that arises when applying regression or classification trees to “real” data with much random error noise concerns the decision when to stop splitting. For example, if we had a data set with 10 cases, and performed 9 splits (determined 9 ifthen conditions), we could perfectly predict every single case. In general, if we only split a sufficient number of times, eventually we will be able to “predict” (“reproduce” would be the more appropriate term here) our original data (from which we determined the splits). Of course, it is far from clear whether such complex results (with many splits) will replicate in a sample of new observations; most likely they will not.
This general issue is also discussed in the literature on tree classification and regression methods, as well as neural networks, under the topic of “overlearning” or “overfitting.” If not stopped, the tree algorithm will ultimately “extract” all information from the data, including information that is not and cannot be predicted in the population with the current set of predictors, i.e., random or noise variation. The general approach to addressing this issue is first to stop generating new split nodes when subsequent splits only result in very little overall improvement of the prediction. For example, if we can predict 90% of all cases correctly from 10 splits, and 90.1% of all cases from 11 splits, then it obviously makes little sense to add that 11th split to the tree. There are many such criteria for automatically stopping the splitting (treebuilding) process.
Once the tree building algorithm has stopped, it is always useful to further evaluate the quality of the prediction of the current tree in samples of observations that did not participate in the original computations. These methods are used to “prune back” the tree, i.e., to eventually (and ideally) select a simpler tree than the one obtained when the tree building algorithm stopped, but one that is equally as accurate for predicting or classifying “new” observations.
Crossvalidation. One approach is to apply the tree computed from one set of observations (learning sample) to another completely independent set of observations (testing sample). If most or all of the splits determined by the analysis of the learning sample are essentially based on “random noise,” then the prediction for the testing sample will be very poor. Hence, we can infer that the selected tree is not very good (useful), and not of the “right size.”
Vfold crossvalidation. Continuing further along this line of reasoning (described in the context of crossvalidation above), why not repeat the analysis many times over with different randomly drawn samples from the data, for every tree size starting at the root of the tree, and applying it to the prediction of observations from randomly selected testing samples. Then use (interpret, or accept as our final result) the tree that shows the best average accuracy for crossvalidated predicted classifications or predicted values. In most cases, this tree will not be the one with the most terminal nodes, i.e., the most complex tree. This method for pruning a tree, and for selecting a smaller tree from a sequence of trees, can be very powerful, and is particularly useful for smaller data sets. It is an essential step for generating useful (for prediction) tree models, and because it can be computationally difficult to do, this method is often not found in tree classification or regression software.
Reviewing Large Trees: Unique Analysis Management Tools
Another general issue that arises when applying tree classification or regression methods is that the final trees can become very large. In practice, when the input data are complex and, for example, contain many different categories for classification problems and many possible predictors for performing the classification, then the resulting trees can become very large. This is not so much a computational problem as it is a problem of presenting the trees in a manner that is easily accessible to the data analyst, or for presentation to the “consumers” of the research.
Analyzing ANCOVAlike Designs
The classic (Breiman et. al., 1984) classification and regression trees algorithms can accommodate both continuous and categorical predictor. However, in practice, it is not uncommon to combine such variables into analysis of variance/covariance (ANCOVA) like predictor designs with main effects or interaction effects for categorical and continuous predictors. This method of analyzing coded ANCOVAlike designs is relatively new and. However, it is easy to see how the use of coded predictor designs expands these powerful classification and regression techniques to the analysis of data from experimental designs (e.g., see for example the detailed discussion of experimental design methods for quality improvement in the context of the Experimental Design module of Industrial Statistics).
Computational Details
The process of computing classification and regression trees can be characterized as involving four basic steps:
 Specifying the criteria for predictive accuracy
 Selecting splits
 Determining when to stop splitting
 Selecting the “rightsized” tree.
These steps are very similar to those discussed in the context of Classification Trees Analysis (see also Breiman et al., 1984, for more details). See also, Computational Formulas.
Specifying the Criteria for Predictive Accuracy
The classification and regression trees (C&RT) algorithms are generally aimed at achieving the best possible predictive accuracy. Operationally, the most accurate prediction is defined as the prediction with the minimum costs. The notion of costs was developed as a way to generalize, to a broader range of prediction situations, the idea that the best prediction has the lowest misclassification rate. In most applications, the cost is measured in terms of proportion of misclassified cases, or variance. In this context, it follows, therefore, that a prediction would be considered best if it has the lowest misclassification rate or the smallest variance. The need for minimizing costs, rather than just the proportion of misclassified cases, arises when some predictions that fail are more catastrophic than others, or when some predictions that fail occur more frequently than others.
Priors. In the case of a categorical response (classification problem), minimizing costs amounts to minimizing the proportion of misclassified cases when priors are taken to be proportional to the class sizes and when misclassification costs are taken to be equal for every class.
The a priori probabilities used in minimizing costs can greatly affect the classification of cases or objects. Therefore, care has to be taken while using the priors. If differential base rates are not of interest for the study, or if we know that there are about an equal number of cases in each class, then we would use equal priors. If the differential base rates are reflected in the class sizes (as they would be, if the sample is a probability sample), then we would use priors estimated by the class proportions of the sample. Finally, if we have specific knowledge about the base rates (for example, based on previous research), then we would specify priors in accordance with that knowledge The general point is that the relative size of the priors assigned to each class can be used to “adjust” the importance of misclassifications for each class. However, no priors are required when we are building a regression tree.
Misclassification costs. Sometimes more accurate classification of the response is desired for some classes than others for reasons not related to the relative class sizes. If the criterion for predictive accuracy is Misclassification costs, then minimizing costs would amount to minimizing the proportion of misclassified cases when priors are considered proportional to the class sizes and misclassification costs are taken to be equal for every class.
Case weights. Case weights are treated strictly as case multipliers. For example, the misclassification rates from an analysis of an aggregated data set using case weights will be identical to the misclassification rates from the same analysis where the cases are replicated the specified number of times in the data file.
However, note that the use of case weights for aggregated data sets in classification problems is related to the issue of minimizing costs. Interestingly, as an alternative to using case weights for aggregated data sets, we could specify appropriate priors and/or misclassification costs and produce the same results while avoiding the additional processing required to analyze multiple cases with the same values for all variables. Suppose that in an aggregated data set with two classes having an equal number of cases, there are case weights of 2 for all cases in the first class, and case weights of 3 for all cases in the second class. If we specified priors of .4 and .6, respectively, specified equal misclassification costs, and analyzed the data without case weights, we will get the same misclassification rates as we would get if we specified priors estimated by the class sizes, specified equal misclassification costs, and analyzed the aggregated data set using the case weights. We would also get the same misclassification rates if we specified priors to be equal, specified the costs of misclassifying class 1 cases as class 2 cases to be 2/3 of the costs of misclassifying class 2 cases as class 1 cases, and analyzed the data without case weights.
Selecting Splits
The second basic step in classification and regression trees is to select the splits on the predictor variables that are used to predict membership in classes of the categorical dependent variables, or to predict values of the continuous dependent (response) variable. In general terms, the split at each node will be found that will generate the greatest improvement in predictive accuracy. This is usually measured with some type of node impurity measure, which provides an indication of the relative homogeneity (the inverse of impurity) of cases in the terminal nodes. If all cases in each terminal node show identical values, then node impurity is minimal, homogeneity is maximal, and prediction is perfect (at least for the cases used in the computations; predictive validity for new cases is of course a different matter…).
For classification problems, C&RT gives the user the choice of several impurity measures: The Gini index, Chisquare, or Gsquare. The Gini index of node impurity is the measure most commonly chosen for classificationtype problems. As an impurity measure, it reaches a value of zero when only one class is present at a node. With priors estimated from class sizes and equal misclassification costs, the Gini measure is computed as the sum of products of all pairs of class proportions for classes present at the node; it reaches its maximum value when class sizes at the node are equal; the Gini index is equal to zero if all cases in a node belong to the same class. The Chisquare measure is similar to the standard Chisquare value computed for the expected and observed classifications (with priors adjusted for misclassification cost), and the Gsquare measure is similar to the maximumlikelihood Chisquare (as for example computed in the LogLinear module). For regressiontype problems, a leastsquares deviation criterion (similar to what is computed in least squares regression) is automatically used. Computational Formulas provides further computational details.
Determining When to Stop Splitting
As discussed in Basic Ideas, in principal, splitting could continue until all cases are perfectly classified or predicted. However, this wouldn’t make much sense since we would likely end up with a tree structure that is as complex and “tedious” as the original data file (with many nodes possibly containing single observations), and that would most likely not be very useful or accurate for predicting new observations. What is required is some reasonable stopping rule. In C&RT, two options are available that can be used to keep a check on the splitting process; namely Minimum n and Fraction of objects.
Minimum n. One way to control splitting is to allow splitting to continue until all terminal nodes are pure or contain no more than a specified minimum number of cases or objects. In C&RT this is done by using the option Minimum n that allows us to specify the desired minimum number of cases as a check on the splitting process. This option can be used when Prune on misclassification error, Prune on deviance, or Prune on variance is active as the Stopping rule for the analysis.
Fraction of objects. Another way to control splitting is to allow splitting to continue until all terminal nodes are pure or contain no more cases than a specified minimum fraction of the sizes of one or more classes (in the case of classification problems, or all cases in regression problems). This option can be used when FACTstyle direct stopping has been selected as the Stopping rule for the analysis. In C&RT, the desired minimum fraction can be specified as the Fraction of objects. For classification problems, if the priors used in the analysis are equal and class sizes are equal as well, then splitting will stop when all terminal nodes containing more than one class have no more cases than the specified fraction of the class sizes for one or more classes. Alternatively, if the priors used in the analysis are not equal, splitting will stop when all terminal nodes containing more than one class have no more cases than the specified fraction for one or more classes. See Loh and Vanichestakul, 1988 for details.
Pruning and Selecting the “RightSized” Tree
The size of a tree in the classification and regression trees analysis is an important issue, since an unreasonably big tree can only make the interpretation of results more difficult. Some generalizations can be offered about what constitutes the “rightsized” tree. It should be sufficiently complex to account for the known facts, but at the same time it should be as simple as possible. It should exploit information that increases predictive accuracy and ignore information that does not. It should, if possible, lead to greater understanding of the phenomena it describes. The options available in C&RT allow the use of either, or both, of two different strategies for selecting the “rightsized” tree from among all the possible trees. One strategy is to grow the tree to just the right size, where the right size is determined by the user, based on the knowledge from previous research, diagnostic information from previous analyses, or even intuition. The other strategy is to use a set of welldocumented, structured procedures developed by Breiman et al. (1984) for selecting the “rightsized” tree. These procedures are not foolproof, as Breiman et al. (1984) readily acknowledge, but at least they take subjective judgment out of the process of selecting the “rightsized” tree.
FACTstyle direct stopping. We will begin by describing the first strategy, in which the user specifies the size to grow the tree. This strategy is followed by selecting FACTstyle direct stopping as the stopping rule for the analysis, and by specifying the Fraction of objects that allows the tree to grow to the desired size. C&RT provides several options for obtaining diagnostic information to determine the reasonableness of the choice of size for the tree. Specifically, three options are available for performing crossvalidation of the selected tree; namely Test sample, Vfold, and Minimal costcomplexity.
Test sample crossvalidation. The first, and most preferred type of crossvalidation is the test sample crossvalidation. In this type of crossvalidation, the tree is computed from the learning sample, and its predictive accuracy is tested by applying it to predict the class membership in the test sample. If the costs for the test sample exceed the costs for the learning sample, then this is an indication of poor crossvalidation. In that case, a different sized tree might crossvalidate better. The test and learning samples can be formed by collecting two independent data sets, or if a large learning sample is available, by reserving a randomly selected proportion of the cases, say a third or a half, for use as the test sample.
In the C&RT module, test sample crossvalidation is performed by specifying a sample identifier variable that contains codes for identifying the sample (learning or test) to which each case or object belongs.
Vfold crossvalidation. The second type of crossvalidation available in C&RT is Vfold crossvalidation. This type of crossvalidation is useful when no test sample is available and the learning sample is too small to have the test sample taken from it. The userspecified ‘v’ value for vfold crossvalidation (its default value is 3) determines the number of random subsamples, as equal in size as possible, that are formed from the learning sample. A tree of the specified size is computed ‘v’ times, each time leaving out one of the subsamples from the computations, and using that subsample as a test sample for crossvalidation, so that each subsample is used (v – 1) times in the learning sample and just once as the test sample. The CV costs (crossvalidation cost) computed for each of the ‘v’ test samples are then averaged to give the vfold estimate of the CV costs.
Minimal costcomplexity crossvalidation pruning. In C&RT, minimal costcomplexity crossvalidation pruning is performed, if Prune on misclassification error has been selected as the Stopping rule. On the other hand, if Prune on deviance has been selected as the Stopping rule, then minimal deviancecomplexity crossvalidation pruning is performed. The only difference in the two options is the measure of prediction error that is used. Prune on misclassification error uses the costs that equals the misclassification rate when priors are estimated and misclassification costs are equal, while Prune on deviance uses a measure, based on maximumlikelihood principles, called the deviance (see Ripley, 1996). For details about the algorithms used in C&RT to implement Minimal costcomplexity crossvalidation pruning, see also the Introductory Overview and Computational Methods sections of Classification Trees Analysis.
The sequence of trees obtained by this algorithm have a number of interesting properties. They are nested, because the successively pruned trees contain all the nodes of the next smaller tree in the sequence. Initially, many nodes are often pruned going from one tree to the next smaller tree in the sequence, but fewer nodes tend to be pruned as the root node is approached. The sequence of largest trees is also optimally pruned, because for every size of tree in the sequence, there is no other tree of the same size with lower costs. Proofs and/or explanations of these properties can be found in Breiman et al. (1984).
Tree selection after pruning. The pruning, as discussed above, often results in a sequence of optimally pruned trees. So the next task is to use an appropriate criterion to select the “rightsized” tree from this set of optimal trees. A natural criterion would be the CV costs (crossvalidation costs). While there is nothing wrong with choosing the tree with the minimum CV costs as the “rightsized” tree, oftentimes there will be several trees with CV costs close to the minimum. Following Breiman et al. (1984) we could use the “automatic” tree selection procedure and choose as the “rightsized” tree the smallestsized (least complex) tree whose CV costs do not differ appreciably from the minimum CV costs. In particular, they proposed a “1 SE rule” for making this selection, i.e., choose as the “rightsized” tree the smallestsized tree whose CV costs do not exceed the minimum CV costs plus 1 times the standard error of the CV costs for the minimum CV costs tree. In C&RT, a multiple other than the 1 (the default) can also be specified for the SE rule. Thus, specifying a value of 0.0 would result in the minimal CV cost tree being selected as the “rightsized” tree. Values greater than 1.0 could lead to trees much smaller than the minimal CV cost tree being selected as the “rightsized” tree. One distinct advantage of the “automatic” tree selection procedure is that it helps to avoid “over fitting” and “under fitting” of the data.
As can be been seen, minimal costcomplexity crossvalidation pruning and subsequent “rightsized” tree selection is a truly “automatic” process. The algorithms make all the decisions leading to the selection of the “rightsized” tree, except for, perhaps, specification of a value for the SE rule. Vfold crossvalidation allows us to evaluate how well each tree “performs” when repeatedly crossvalidated in different samples randomly drawn from the data.
Computational Formulas
In Classification and Regression Trees, estimates of accuracy are computed by different formulas for categorical and continuous dependent variables (classification and regressiontype problems). For classificationtype problems (categorical dependent variable) accuracy is measured in terms of the true classification rate of the classifier, while in the case of regression (continuous dependent variable) accuracy is measured in terms of mean squared error of the predictor.
In addition to measuring accuracy, the following measures of node impurity are used for classification problems: The Gini measure, generalized Chisquare measure, and generalized Gsquare measure. The Chisquare measure is similar to the standard Chisquare value computed for the expected and observed classifications (with priors adjusted for misclassification cost), and the Gsquare measure is similar to the maximumlikelihood Chisquare (as for example computed in the LogLinear module). The Gini measure is the one most often used for measuring purity in the context of classification problems, and it is described below.
For continuous dependent variables (regressiontype problems), the least squared deviation (LSD) measure of impurity is automatically applied.
Estimation of Accuracy in Classification
In classification problems (categorical dependent variable), three estimates of the accuracy are used: resubstitution estimate, test sample estimate, and vfold crossvalidation. These estimates are defined here.
Resubstitution estimate. Resubstitution estimate is the proportion of cases that are misclassified by the classifier constructed from the entire sample. This estimate is computed in the following manner:
where X is the indicator function;
X = 1, if the statement is true
X = 0, if the statement is false
and d (x) is the classifier.
The resubstitution estimate is computed using the same data as used in constructing the classifier d .
Test sample estimate. The total number of cases are divided into two subsamples Z1 and Z2. The test sample estimate is the proportion of cases in the subsample Z2, which are misclassified by the classifier constructed from the subsample Z1. This estimate is computed in the following way.
Let the learning sample Z of size N be partitioned into subsamples Z1 and Z2 of sizes N and N2, respectively.
where Z2 is the sub sample that is not used for constructing the classifier.
vfold crossvalidation. The total number of cases are divided into v sub samples Z1, Z2, …, Zv of almost equal sizes. vfold cross validation estimate is the proportion of cases in the subsample Z that are misclassified by the classifier constructed from the subsample Z – Zv. This estimate is computed in the following way.
Let the learning sample Z of size N be partitioned into v sub samples Z1, Z2, …, Zv of almost sizes N1, N2, …, Nv, respectively.
where is computed from the sub sample Z – Zv .
Estimation of Accuracy in Regression
In the regression problem (continuous dependent variable) three estimates of the accuracy are used: resubstitution estimate, test sample estimate, and vfold crossvalidation. These estimates are defined here.
Resubstitution estimate. The resubstitution estimate is the estimate of the expected squared error using the predictor of the continuous dependent variable. This estimate is computed in the following way.
where the learning sample Z consists of (xi,yi),i = 1,2,…,N. The resubstitution estimate is computed using the same data as used in constructing the predictor d .
Test sample estimate. The total number of cases are divided into two subsamples Z1 and Z2. The test sample estimate of the mean squared error is computed in the following way:
Let the learning sample Z of size N be partitioned into subsamples Z1 and Z2 of sizes N and N2, respectively.
where Z2 is the subsample that is not used for constructing the predictor.
vfold crossvalidation. The total number of cases are divided into v sub samples Z1, Z2, …, Zv of almost equal sizes. The subsample Z – Zv is used to construct the predictor d. Then vfold cross validation estimate is computed from the subsample Zv in the following way:
Let the learning sample Z of size N be partitioned into v sub samples Z1, Z2, …, Zv of almost sizes N1, N2, …, Nv, respectively.
where is computed from the sub sample Z – Zv .
Estimation of Node Impurity: Gini Measure
The Gini measure is the measure of impurity of a node and is commonly used when the dependent variable is a categorical variable, defined as:
if costs of misclassification are not specified,
if costs of misclassification are specified,
where the sum extends over all k categories. p( j / t) is the probability of category j at the node t and C(i / j ) is the probability of misclassifying a category j case as category i.
Estimation of Node Impurity: LeastSquared Deviation
Leastsquared deviation (LSD) is used as the measure of impurity of a node when the response variable is continuous, and is computed as:
where Nw(t) is the weighted number of cases in node t, wi is the value of the weighting variable for case i, fi is the value of the frequency variable, yi is the value of the response variable, and y(t) is the weighted mean for node t.
Desktop STATISTICA Prevails Over Minitab in a Comparative Review
eKPI Solutions, a consulting and training company, has published on its portal a comparative evaluation and review of the desktop version of STATISTICA vs. Minitab, another popular application for quality improvement and Six Sigma applications.
In this review, the author concludes that users who need tools for “advanced (statistical) analysis knowledge discovery” should “get STATISTICA. STATISTICA provides all necessary statistical tools!” and that the StatSoft products “can be used by a very wide audience” of users.
The author goes on to list a number of “clear, qualitative advantages of STATISTICA over Minitab” including:
 STATISTICA can be used by “virtually anyone who needs to ‘review numerical reports’ as part of their job.”
 STATISTICA is designed to address the data collection and analysis needs of each stage of a project and can serve as the analytic foundation for project implementation at companies of any size.
 STATISTICA offers “more of the relevant tools than any other commercially available application.”
 Graphical capabilities in STATISTICA are “simply in a whole different category than those of Minitab, in almost every respect” including: selection, flexibility, presentation quality appearance, integration with analyses, breath of features/functionality, interactive nature of graphs, compatibility with other applications, object model support…”there is no comparison.”
 The breath of statistical analyses is “also in a completely different category” and the selection “goes far beyond what Minitab offers.”
 STATISTICA includes a Reporter tool which allows for tables and graphs to be automatically (or manually) directed to the Reporter or even Microsoft Word to “develop presentation quality reports, completely within STATISTICA” and by comparison “tables and graphs cannot even be collected into a single editor inside of Minitab let alone presented in a professional way or later modified.”
The author concludes the lengthy review concluding that “companies that deploy STATISTICA” will find a complete arsenal of tools specifically ‘preconfigured’ for implementations of Six Sigma strategies at any level of the organization and a unique set of customization facilities [that] will allow them to quickly convert STATISTICA into a tool that will look and work as if it were originally developed ‘only’ to meet the needs of their specific organization.”
First Look at ”STATISTICA” – Decision Management
First Look – StatSoft STATISTICA
James Taylor on Everything Decision Management
January 31, 2012
StatSoft was founded in 1984 and started building statistical software when it first became practical to deliver on the PC. STATISTICA is an enterprise predictive analytics platform on the Windows platform with rolebased access, connections to the various data sources that companies have and support for data exploration through to deployment. The product has four main pieces:
 Windowsbased analytics Workbench for analysts.
 Decision Management to combine models and rules to automate decisionmaking.
 Enterprise Server to support multiple users in a client/server environment.
 Enterprise Workspaces to capture the data analysis process from end to end and for managing metadata, decisionmaking workflow etc.
STATISTICA is a long time Windows platform user and Microsoft partner. As a result it offers a solution that is tightly coupled with Intel multicore chips very well integrated with Windows. Everything is available as an API call in .Net making it easy to integrate into SharePoint or other Windows applications.
The components get combined into various analytic applications such as a warranty analytics solution, credit scoring, collections, crosssell, insurance fraud detection, subrogation, price optimization, marketing mix optimization and more. These solutions can be completely automated, accessing multiple data sources, running tens or hundreds of predictive analytic models, writing results back into the database and monitoring the performance of the models.
With version 11, StatSoft released the STATISTICA Decisioning Platform that pulls together all the existing product capabilities with new rules management, integrated rules scoring, and other capabilities. The suite now includes:
 Templated data access
 Data preprocessing
 Rules management
 Modeling tools including accelerated logistic regression
 Version control
 Direct deployment
Everything is managed in an enterprise metadata repository deployed on a relational database. Workflows and other components for model creation or business rules are created, managed in the enterprise repository and deployed to a server for execution. Multiple projects and folders can be managed in the repository and permissions are layered onto these. Data access templates, analysis templates, decisioning flows and rules are all managed in this repository. Decision flows with models and rules are checked in and then used to drive reporting (integration with MS document tools), batch scoring for writing back to the production database or deployment to the STATISTICA Live Score Server for realtime decisioning using web services calls. There is a Monitoring and Alerting server for dashboards that monitor model performance and there is an integrated Document Management System for version control and approvals of models.
A decisioning flow involves several steps using the STATISTICA Enterprise Manager product. At each stage elements are retrieved from the repository based on the access defined for users and can be written back to the repository for management and reuse.
 The first step is to retrieve data from data connection and configuration templates. Users may have access to the underlying queries or just to the data. Data from multiple data connections can be used and a wide range of ETL functions are available in the data manipulation step.
 Data can be prepared and recoded, using Weight of Evidence for instance, and these transformations are then deployed as rules that can be versioned and reused. The rules are sequential and can assign text labels as well as transform the data. The rules are deployed to the enterprise server and can be associated with the data source. They can then be included in the defined workflow.
 Models can then be built using various modeling techniques and embedded in the flow. A wide range of modeling techniques are supported and the workflow can create multiple models, combine or compare them etc.
 Additional rules can be added to the workflow. The rules node contains a sequential set of rules built using an editor that has some integration with the data structures being manipulated and has a nice tree structure to allow rules to be collapsed. Temporary variables can be managed and models can be executed by the rules as necessary. Reason codes can be assigned using array handling that lets you build a set of reason codes. Rules can be reused across batch and realtime environment and multiple workflows. Rules have access to the full range of mathematical functions also.
 The whole workflow can then be deployed to the various deployment options.
A debugger allows a set of records to run through the flow and see which transactions fired which rules. While the rules do not offer conflict detection there is some error detection (use of a variable that is not defined for instance) and some tools in the enterprise platform to see which objects refer to which other objects. Users can run multiple paths in a workflow for comparison purposes and can then use analysis tools built into the modeling environment to see what difference a change would make or which approach would be more profitable.
Besides executing the complete decisioning workflows using the STATISTICA products, all of the models can also be deployed as C, C++, PMML, Visual Basic, SAS, Java or C# Stored Procedure. The tool also supports Visual Basic scripting and this can also be used to push things into the database programmatically.
StatSoft will be one of the vendors listed in the forthcoming report on Decision Management Systems platform technologies.